Import many messages from multiple accounts #14

Merged
merged 24 commits into from Mar 29, 2012

Conversation

Projects
None yet
3 participants
@tomafro
Contributor

tomafro commented Mar 29, 2012

As discussed yesterday, these are the changes I made to enable sauron to both import a lot of messages from different accounts and show them quickly through the web interface.

tomafro added some commits Mar 23, 2012

Split searching for message uids from loading the messages.
This will allow more flexible message loading strategies, such as loading messages in batches or only loading specific messages.
Select the INBOX when initialising the Gmail client
As we never operate on any other mailboxes, there's no need to select a mailbox before each operation.  This makes the client specific to the single mailbox, which given the mailbox-centric way that IMAP works seems to be a useful property.
Store messages in the repository against a given key, using the messa…
…ge UID

This will allow us to check whether the repository already contains the given key, allowing us in the future to only import messages that aren't yet in the repository.
Move MD5 key generation right down to the message store.
If we're storing messages in the repository with a given id, key generation is redundant as far as the repository goes.  It's now the responsibility of the message store itself to decide how to save files.  For now we're still using MD5 hashes but this could easily change.
Import ALL the email from a user's account, not just their inbox mess…
…ages

I've been testing against my tom.ward account and sauron-test, neither of which have enough messages to really break anything (in their inboxes at least).  Switching to All Mail to try and destroy things.
Use Mail to bypass encoding issues
Attempting to import all the emails from tom.ward@gofreerange.com I encountered some nasty encoding problems like this one:

  Encoding::UndefinedConversionError: "\xA3" from ASCII-8BIT to UTF-8
    from /Users/tomw/Projects/freerange/sauron/lib/file_based_message_store.rb:14:in `write'
    from /Users/tomw/Projects/freerange/sauron/lib/file_based_message_store.rb:14:in `[]='

Rather than worry too much about this, I decided to pipe all imported messages through Mail.new to see if that fixed the problem.  It seemed to, but we should be aware of encoding problems that may arise and fix the underlying problem when or if it becomes the next most important thing to do.
Use Base64 encoding when storing messages to avoid encoding issues.
Passing messages through Mail.new(message).to_s worked for about 1,100 of my messages, but not all of them.  Base64 is safer.
@lazyatom

This comment has been minimized.

Show comment Hide comment
@lazyatom

lazyatom Mar 23, 2012

Contributor

I'm pleased that you hit this issue already :)

Contributor

lazyatom commented on 28c7357 Mar 23, 2012

I'm pleased that you hit this issue already :)

@lazyatom

This comment has been minimized.

Show comment Hide comment
@lazyatom

lazyatom Mar 23, 2012

Contributor

I know this isn't a concern at this point, but we know that UID isn't going to be unique across multiple accounts (or even mailboxes).

Contributor

lazyatom commented on a9d3348 Mar 23, 2012

I know this isn't a concern at this point, but we know that UID isn't going to be unique across multiple accounts (or even mailboxes).

tomafro added some commits Mar 23, 2012

GmailImapClient => GoogleMail::Mailbox
What was previously the client only provides access to a single mailbox, so can be called Mailbox with little confusion.  This makes many things clearer to me, such as referring to 'all the messages in the mailbox' (rather than the client).
Use an ActiveRecord model to record each imported message.
To display a list of messages, there's no need to store them as files on disk.  Only their subject, date and from is required, so store these as records in the database.

One advantage of storing the messages on disk is that it avoids having to reimport data (which can be very slow).  However, using the CachedConnection to IMAP should negate this.
Support importing messages from multiple accounts.
As the initial schema hasn't been merged into master, I didn't see any problem adding a column to the migration directly, rather than cluttering up the world with more migrations.
Add the ability to show a full message.
In this case, full takes its most extreme meaning - the entire content of the original email.  This is more as proof that we can still access original message content via the web than anything else.

In order to prevent the display of a list of messages taking ages, the original message is only loaded at the point it is needed, not when returning one or a number of messages from the repository.
@lazyatom

This comment has been minimized.

Show comment Hide comment
@lazyatom

lazyatom Mar 27, 2012

Contributor

I think it would be really helpful to explain your thinking behind the introduction of this.

Contributor

lazyatom commented on 6dacbae Mar 27, 2012

I think it would be really helpful to explain your thinking behind the introduction of this.

This comment has been minimized.

Show comment Hide comment
@tomafro

tomafro Mar 27, 2012

Contributor

Sure. I got bored importing messages again and again and again because it took so long. Previously (and maybe still at this commit) there was a message store which placed each message on disk, but if I wanted to change the import process (adding a database record for example), or change the message store itself, there was no easy way to get at a whole bunch of messages without re-downloading them. Fine for 10s of messages, but crap when you're dealing with more than a hundred, say.

I'd noticed that imap calls were split into two parts, one to find the ids of the messages you wanted, and one to download the actual data. It occurred to me that if I cached the data calls (uid_fetch) as low down as possible, I could try out different importers over a large range of data much more easily. I was also interested in how easy it would be to change the connection class. In the end it was very simple (though I'm sure the tests could be improved).

Contributor

tomafro replied Mar 27, 2012

Sure. I got bored importing messages again and again and again because it took so long. Previously (and maybe still at this commit) there was a message store which placed each message on disk, but if I wanted to change the import process (adding a database record for example), or change the message store itself, there was no easy way to get at a whole bunch of messages without re-downloading them. Fine for 10s of messages, but crap when you're dealing with more than a hundred, say.

I'd noticed that imap calls were split into two parts, one to find the ids of the messages you wanted, and one to download the actual data. It occurred to me that if I cached the data calls (uid_fetch) as low down as possible, I could try out different importers over a large range of data much more easily. I was also interested in how easy it would be to change the connection class. In the end it was very simple (though I'm sure the tests could be improved).

lazyatom added a commit that referenced this pull request Mar 29, 2012

Merge pull request #14 from freerange/import-all-a-single-users-messages
Import many messages from multiple accounts

Chris, James M and I reviewed Tom's changes and agreed to merge into master.

@lazyatom lazyatom merged commit 324ec35 into master Mar 29, 2012

@floehopper

This comment has been minimized.

Show comment Hide comment
@floehopper

floehopper Apr 2, 2012

Owner

I looked for a migration called CreateMessages, which is what I'd expect by convention, but didn't find it. What's the thinking behind calling this InitialSchema?

I looked for a migration called CreateMessages, which is what I'd expect by convention, but didn't find it. What's the thinking behind calling this InitialSchema?

This comment has been minimized.

Show comment Hide comment
@tomafro

tomafro Apr 2, 2012

Contributor

Not put too much thinking into it, but it's something I've done on a lot of recent personal projects. I find that at the start of a project the schema can be in a state of flux. Until the data on production can't be recreated I put everything in a single migration, moving things around and changing them at will. I only add more migrations once I actually need to migrate production (not just rake db:migrate:reset).

Contributor

tomafro replied Apr 2, 2012

Not put too much thinking into it, but it's something I've done on a lot of recent personal projects. I find that at the start of a project the schema can be in a state of flux. Until the data on production can't be recreated I put everything in a single migration, moving things around and changing them at will. I only add more migrations once I actually need to migrate production (not just rake db:migrate:reset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment