Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to export from gmvault's internal storage to standard formats #80

Merged
merged 4 commits into from Mar 15, 2013

Conversation

vasi
Copy link
Contributor

@vasi vasi commented Aug 31, 2012

Fixes issue 68: #68 .

The command to export is: gmvault export -d DB_DIR -t FORMAT OUTPUT_DIR

Currently valid formats are 'maildir' and 'mbox'. The maildir variant is 'Maildir++' as used by dovecote, and indeed the exported maildir can be used as-is as a dovecote backend. The original 'mboxo' variant of mbox is used, with '.sbd' as a suffix for directories as used by Thunderbird.

As mentioned in the issue, the maildir (or mbox) format duplicates emails, once per label. This is clearly not optimal, but there's no way to avoid it. Users should continue to use gmvault's internal format for everyday use, only exporting to maildir when they need it.

- Inbox is exposed
- Starred messages are exposed
- Folder names are escaped properly
- Chats are exported
- Progress messages are printed
@dmd
Copy link
Contributor

dmd commented Aug 31, 2012

This is great (if it works, I have not tested). Please test & merge!

@gaubert
Copy link
Owner

gaubert commented Sep 19, 2012

@vasi Coming back from my holday. Thanks I will check your add-on and will let you know about it.

@gaubert
Copy link
Owner

gaubert commented Oct 9, 2012

@vasi sorry for the delay very busy at the moment. I come to the integration of your format translators.
I will let you know soon how it goes.

@vasi
Copy link
Contributor Author

vasi commented Oct 10, 2012

No problem, I know the feeling! :)

@keltia
Copy link

keltia commented Oct 22, 2012

mbox clearly can't share mails across label-based mailboxes but along with the tag cache I mentioned in #92, maildor++ could.

@vasi
Copy link
Contributor Author

vasi commented Oct 22, 2012

I don't quite understand, could you explain further?

@keltia
Copy link

keltia commented Oct 22, 2012

What I want to store for each tag is the list of gmail_id that have the tag, that way you can check whether a mail has a multiple tags or not and hard link them if you want.

@gaubert
Copy link
Owner

gaubert commented Oct 23, 2012

@keltia OK I get what you want. You want some kind of index. I will think about it because if it is in Json it can be a pretty long list.

@vasi
Copy link
Contributor Author

vasi commented Dec 13, 2012

Anything I can do to help this along?

@gaubert
Copy link
Owner

gaubert commented Dec 14, 2012

@vasi yes I had no time to try and test it. It is coming with unit test files ?
I am preparing a proper continuous integration system with Jenkins to have multiple poeple working on Gmvault.

@vasi
Copy link
Contributor Author

vasi commented Dec 14, 2012

No tests yet, I'll take a look at adding some. Currently I'm basing off of master, is that the right thing to do? Or should I base off of gmv-perf-1.7.2?

@gaubert
Copy link
Owner

gaubert commented Dec 14, 2012

Base on gmv-perf-1.7.2. Then with the tests we will integrate.
Thanks for the help

On Fri, Dec 14, 2012 at 4:25 PM, Dave Vasilevsky
notifications@github.comwrote:

No tests yet, I'll take a look at adding some. Currently I'm basing off of
master, is that the right thing to do? Or should I base off of
gmv-perf-1.7.2?


Reply to this email directly or view it on GitHubhttps://github.com//pull/80#issuecomment-11379847.

@gaubert
Copy link
Owner

gaubert commented Dec 18, 2012

@vasi I will go on vacation until the 28th of Dec but I can be contacted. When you are ready with the testing contact me by email as we will integrate your tests in the Jenkins suite I have created for Gmvault.

Guillaume

@vasi
Copy link
Contributor Author

vasi commented Dec 18, 2012

I will also be on vacation until the new year. We'll work on this later. Enjoy your holiday!

@gaubert
Copy link
Owner

gaubert commented Dec 18, 2012

@vasi You too. Contact me next year once you have the test suite. I really would like to release a new version beginning of next year.

@sjuxax
Copy link

sjuxax commented Dec 24, 2012

Just used these patches to export 14,000 mails to mbox format and it worked very well. Thanks.

@gaubert
Copy link
Owner

gaubert commented Jan 2, 2013

@sjuxax Great version 1.7.2 will contain these patches. We are working on it

@vasi
Copy link
Contributor Author

vasi commented Jan 3, 2013

@gaubert Ok, I'm back from holiday and I've rebased on gmv-perf-1.7.2, you can see the result in the 'export2' branch: https://github.com/vasi/gmvault/tree/export2

I'm ready to look into adding tests now, but I currently can't successfully run any of the existing tests I tried because they all look for specific paths like '/homespace/gaubert/.ssh'. Is there something I'm doing wrong?

@vasi
Copy link
Contributor Author

vasi commented Jan 3, 2013

Also, I'm not 100% sure what we want to test, for the purposes of export. Obviously 'we can ask for an export, and nothing crashes' is a nice baseline, but just doing nothing passes that test ;)

Testing that the output is correct is trickier. A diff with known-good output isn't good enough, since an export format may have multiple ways of correctly representing the same database. For example the hostname part of maildir IDs can be ignored, but will cause exports of different machines to have different pathnames. Maybe the right solution is to add import as well, and verify that export-then-import yields the identical database?

Since the purpose of export is to be able to use gmvault databases with other programs, the real best test would be to actually use these other programs, but that's hard to automate! We should at least specify in the docs what external programs are supported by export, currently I've targeted mbox at Thunderbird, and maildir at dovecot.

@gaubert
Copy link
Owner

gaubert commented Jan 3, 2013

Yes the tests look for oauth token files stored in /homespace/gaubert/.ssh in the test machine. The best for you is to ignore these tests for now and build unitests for the mail export part. Once it has been done, I will integrate them in the main gmvault src tree and in the Jenkins I use for validation. One of the current test idea is to have a reference test mail account which contains a limited subset of emails (up to 100 max) that have specificities (tricky labels, badly formatted email, ....) and check that Gmvault can backup and restore them with some checkings to validate that the restored mailbox is identical to the original one. Another test file validates the command line interface.
So you should create an export_test.py file validating the export. The export_test could do a backup or start from a limited gmvault-db and create the right format.

@gaubert
Copy link
Owner

gaubert commented Jan 3, 2013

@vasi, you could effectively add the import which would be a good feature as well or keep a mbox and maildir export and then create the export from the Gmvault-db and compare the result with the kept mbox or maildir export that work with Thunderbird and dovecot. The kept reference mbox and maildir exports should be small but representative.
You should also validate issues and error handling.
Let me know what you plan to do.

@gaubert
Copy link
Owner

gaubert commented Jan 3, 2013

@vasi I am testing. Does it work with labels in UTF-8 characters (French with accents, German, Japanese, ...).
A test for that would be good. Corner cases (reasonable ones) should be in the test suite.

@vasi
Copy link
Contributor Author

vasi commented Jan 3, 2013

@gaubert My test gmail account already has labels with forbidden characters, like tilde, but testing UTF-8 too is a good idea!

or keep a mbox and maildir export and then create the export from the Gmvault-db and compare the result with the kept mbox or maildir export

As I was trying to say before, this doesn't work. Maildir includes paths that look like "1357207594.M717564P16249Q11615.myhost.mydomain", but obviously "myhost.mydomain" will be different depending on the system. I don't see any obvious way in the Python 'mailbox' module to have it use a mock hostname instead of the real one :(

@gaubert
Copy link
Owner

gaubert commented Jan 3, 2013

@vasi If I launch the export command a second time will redo the all export or only what is new ? Do we need a --resume mode in case of failure like we can have with the sync or restore. It might be useful to be able to restart from where you were.
I have a mailbox with 33000 emails and it takes at least 30 minutes to do the export in mbox to it might be interesting to have an incremental export

@gaubert
Copy link
Owner

gaubert commented Jan 3, 2013

@vasi regarding myhost.mydomain, in your tests, you could anticipate that and get it from the system, the same way mailbox does it (probably using socket.gethostname()) and assert the right parts. As I said we want a validation tests here so if you have only hundred of emails max it should be enough.

@vasi
Copy link
Contributor Author

vasi commented Jan 10, 2013

Hmm, so I do see the "From MAILER-DAEMON" stuff in the mbox file, but in my case Thunderbird still uses the proper "From:" line. I guess we can try calling set_from() on mbox messages and seeing if that helps your case. But I would still like to understand what's going on, and why Thunderbird interprets your messages so strangely. Good luck narrowing it down!

@vasi
Copy link
Contributor Author

vasi commented Jan 10, 2013

Oh, and so I don't forget: Using the import method I showed above, Thunderbird will not import subdirectories that don't have an associated main mailbox. We should probably create an empty main mailbox for each subdirectory when exporting to mbox!

@gaubert
Copy link
Owner

gaubert commented Jan 11, 2013

@vasi I could not do it last night. I will try this week-end but I have a busy schedule. In the mean time, you could try to reproduce the error but adding >From in one of your emails and see if it breaks mbox ?
You can also work on the unit test with a test trying to parse emails from gmvault-db that are incomplete because there was a bug in gmvault.
I will follow your import procedure and see if I can make it work

@vasi
Copy link
Contributor Author

vasi commented Jan 12, 2013

Ok, I've improved logging.

  • The "starting export" log now lists what labels are being exported, if known.
  • The "exported N messages..." log is now "processed N messages...".
  • There are now debug messages to show what's being done for each individual message.

Some notes:

  • It's difficult to print out the list of labels for each message, because logbook complains if a label contains non-ascii characters. I'm doing label.encode('ascii', 'backslashreplace') to deal with that, but it should probably be fixed in logutils somehow?
  • I'm not printing what directory each message is in. This is partly because it's difficult, since storer.get_directory_from_id() returns an absolute path. Also, it's not clear why the exporter should care that emails are stored in directories. If we changed to storing them in a real DB, or in a flat file, or even something crazy like over a network, there's no reason that should force us to change the exporter code!

@vasi
Copy link
Contributor Author

vasi commented Jan 12, 2013

A couple more things:

  • What do we do if the user has a real mailbox named 'Archived'? We can't easily detect this ahead of time!
  • I'm not sure about exporting 'Important' and 'Starred' by default. We can't avoid all duplication, it is simply the nature of Gmail that an email may be in multiple labels. Maybe instead we can add an "--exclude LABEL" option? That way a user who stars tons of messages can just exclude Starred, while those of us who only use stars rarely can keep it.

@vasi
Copy link
Contributor Author

vasi commented Jan 13, 2013

I did some more testing against different importers.

  • Thunderbird ImportExportTool is still working fine with mbox. Now, we make sure to import subdirectories even if the parent directory is empty.
  • Dovecot works ok with maildir. Fixed some mailbox naming issues.
  • OfflineIMAP works with maildir, but it's imperfect because of some mismatches with how Dovecot works.
    • Dovecot considers the root directory of an export to be the inbox; OfflineIMAP just calls it "INBOX".
    • Dovecot uses '.' as a separator by default; OfflineIMAP prefers '/', at least for Gmail.
    • Dovecot prefixes mailboxes with a '.'; OfflineIMAP doesn't use a prefix.
    • Dovecot uses a listescape plugin to deal with illegal characters; OfflineIMAP just allows them.

These can be worked around using OfflineIMAP's nametrans feature, to convert between mailbox names. But that's a little annoying for users. Another option is instead of having --type maildir, we can have --type dovecot and --type offlineimap, which use different conventions. (Or maybe just pick one of them to be the default --type maildir.)

@gaubert
Copy link
Owner

gaubert commented Jan 14, 2013

@vasi Was away this week-end and could not dedicate time to Gmvault.
I watched your screencast and this is what I do. For me it doesn't work don't know why. I need to investigate.
I will try again tonight and will get your latest version and send my comments.
Regarding your comments:

  • logbook handling non-ascii characters. What version of logbook do you use ? logbook 0.4.1 normally handles labels printouts correctly for me. I have japanese and french characters labels.
  • you are right for storer.get_directory_from_id(). We should change that.

Need more time to think about the rest

@gaubert
Copy link
Owner

gaubert commented Jan 14, 2013

@vasi regarding the imapoffline and dovecots issues. I agree with you. Having options in the command line for that should do the trick. What about --type maildir --flavour dovecot or offlineimap with one chosen by default. Tell me which is the best ? flavour would only work in the case of --type maildir.
An alternative would be --type maildir-offlineimap and --type maildir-dovecots but you would have to parse the CLI option. I would like to keep --type maildir or --type mbox as it is a generic type that can be easily understood.

@gaubert
Copy link
Owner

gaubert commented Jan 14, 2013

@vasi for this one:

  • I'm not printing what directory each message is in. This is partly because it's difficult, since storer.get_directory_from_id() returns an absolute path. Also, it's not clear why the exporter should care that emails are stored in directories. If we changed to storing them in a real DB, or in a flat file, or even something crazy like over a network, there's no reason that should force us to change the exporter code!

Instead, you could use the internal date which is in the meta info and then forget about the directory.

@vasi
Copy link
Contributor Author

vasi commented Jan 14, 2013

@gaubert,

  • It turns out I was using logbook 0.4, not 0.4.1. It works fine with the new version. Good catch! Maybe gmvault should require 0.4.1? Then I can get rid of my workaround code.
  • The --flavour flag seems ok. I still prefer having --type dovecot and --type offlineimap though, since it only requires one flag, it's a cleaner interface. The main --type maildir flag can just be an alias to whichever of dovecot/offlineimap we decide we like more :)
  • Sure, I'll use the date for logging.

@gaubert
Copy link
Owner

gaubert commented Jan 18, 2013

@vasi
Where are we with this activity. I would like to finalise everything with the next 3 weeks to release 1.7.2.
Yesterday evening, I could test mbox under Thunderbird following your screencast and it is ok now. I still have the problem in Archived and will try to find the faulty email but so far was unsuccessful.
Below is a list of what I think is left to decide between you and me to finalise the export function:

  1. Option for the different mbox and maildir flavours
    Ok we will use --type to provide dovecot and offlineimap flavour of maildir and mbox. Still we need a mbox and maildir option that will default to the most appropriate flavour. So --type can be maildir or mbox or offlineimap and dovecot
  2. Bug in export creating in Archived folder bad emails displays
    We still need to find a solution to that problem because if it happens to me it will happen to a user. You could try to reproduce the problem and find a solution on that issue. In the mean time I can try to find the culprit email but it will take time. Do you think that having some code to validate the email structure before to passing it to the "mbox transformer" would allow to discard bad emails ?
  3. What if somebody as an existing Archived label
    Then if this is the case we should suffix our label with something like -gmv for example. To simplify the problem we could immediately add this prefix. Still if we encounter an existing label Archived-gmv then we can raise an error with a meaningful message. We should also add the archived folder name as an option in the configuration file in order to mitigate the issue in case somebody contacts us for that (and it will happen one day so it is better to fix it right now)
  4. Loggings improvements
    Where are we with the logging. Could we add the email date in the logs to follow what has been processed and what needs to.
  5. Unit tests
    Could you write a unit test file to test most of the different cases for the exporter as I would like to integrate in the Jenkins I run for Gmvault. Contact me via email if you want the IP because I do not want to spread it out everywhere to minimize potential issues.
  6. Build documentation
    We need to integrate it in the Gmvault documentation

Ok Let me know if I have missed something and where we are with these tasks.
Many thanks

@vasi
Copy link
Contributor Author

vasi commented Jan 19, 2013

  1. I will try to implement flavours early next week.
  2. Can you check if the bad emails have a "From:" line in them? I suspect lack of From: is causing the problem, but it's impossible for me to tell without some actual test data.
    3 & 4. Ok, will do.
  3. Do you have an existing Gmvault db that would be good to test against? I have my own little test db, but there are probably cases you have though of, that I have not. I might have some trouble writing tests, unfortunately I don't have a lot of experience in testing, so advice would be appreciated.
  4. Ok.

I think the only thing we missed is testing for Windows compatibility, including import. We will have to do that at some point.

@gaubert
Copy link
Owner

gaubert commented Feb 13, 2013

@vasi I would like to merge your branch. Where are you with the different flavours ? I think I will take over from where you are if you don't mind. Let me know

@vasi
Copy link
Contributor Author

vasi commented Feb 15, 2013

Sorry for the lack of progress, I've had the flu :( Feel free to take over if you like.

@gaubert
Copy link
Owner

gaubert commented Feb 17, 2013

@vasi Could you still implement the --type dovecot and --type offlineimap flavours and I will take over from there as you were almost there. This would allow me to include export in the next release. All the left bugs on the other features have been solved.
Let me know quickly as I have to take a decision to put that feature or not in the next release.
Thks.

@vasi
Copy link
Contributor Author

vasi commented Feb 19, 2013

Ok, I will finish implementing the --type options.

@gaubert
Copy link
Owner

gaubert commented Feb 19, 2013

Hi Dave,

When do you think you can have it done ?
I would like to merge it asap in order to release the next version within
the next 3 weeks.

Thks.

On 19 February 2013 01:27, Dave Vasilevsky notifications@github.com wrote:

Ok, I will finish implementing the --type options.


Reply to this email directly or view it on GitHubhttps://github.com//pull/80#issuecomment-13750692.

@gaubert
Copy link
Owner

gaubert commented Feb 21, 2013

@vasi any progress ?

@vasi
Copy link
Contributor Author

vasi commented Feb 21, 2013

Yup, starting it out.

@gaubert
Copy link
Owner

gaubert commented Feb 21, 2013

@vasi add the support for OfflineIMAP and the other flavours in --type and I will deal with it afterwards.
Thanks

@vasi
Copy link
Contributor Author

vasi commented Feb 26, 2013

Ok, I've added the flavour support, though it was harder than I thought. Dovecot and OfflineIMAP flavours both appear to work now. I did some testing, but not exhaustively.

https://github.com/vasi/gmvault/commits/export2

@gaubert
Copy link
Owner

gaubert commented Feb 26, 2013

@vasi ok thanks. I will pull your add-ons and include them in the export branch for the next release. I really would like to release the new Gmvault within 2-3 weeks now.

@vasi
Copy link
Contributor Author

vasi commented Feb 26, 2013

Great! Let me know if you need any help.

@gaubert
Copy link
Owner

gaubert commented Feb 26, 2013

@vasi Why did you decide to put offlineimap as the default ?
Don't you think that the default mbox export compliant with Thunderbird will be more popular ?

@gaubert
Copy link
Owner

gaubert commented Feb 26, 2013

@vasi I have briefly tested all the modes (with label selection and hierarchical labels) and verified that it could be then used with Thunderbird. I am going to merge it in my dev branch for the final testing and because I want to release the next version.
A documentation needs to be built for that function and documentation for standard email clients should be built in the future if we want people to use it.
We need also a proper test-suite to quickly validate it.
I will declare this functionality as experimental for the next version.

Many thanks for your help and efforts

@vasi
Copy link
Contributor Author

vasi commented Feb 27, 2013

I used OfflineIMAP as the default because I assumed people would use export to switch to another IMAP provider. But maybe Thunderbird makes more sense, I don't know.

Thanks for all your work on gmvault.

@sjuxax
Copy link

sjuxax commented Feb 27, 2013

FYI I regularly export my emails just to save a local copy. I then delete everything from the server, import the exported files into Thunderbird, compress and encrypt the exported files and distribute them such that I can recover them if necessary. This way, I don't have a lot of emails sitting on Gmail waiting to be pwned and/or subpoenaed.

@gaubert
Copy link
Owner

gaubert commented Feb 27, 2013

@sjuxax thanks for detailing your potential usage of that functionality this helps.
@vasi Ok I propose to put thunderbird (mbox) as the default and then detail your usage scenario in the documentation (change of email provider) as well as part one of @sjuxax scenario (use thunderbird to look at your emails offline).

@ghost ghost assigned gaubert Mar 11, 2013
@gaubert gaubert merged commit 5738a50 into gaubert:master Mar 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants