I downloaded my email data from Google Takeout. It gets delivered as a `.mbox` file, which Python can parse natively using the `mailbox` library. 

In [3]:
import mailbox
import pathlib

data_path = pathlib.Path('../../data')

# Load the combined file that contains all the messages.
combined_mbox = mailbox.mbox(data_path / 'prakashdotvijay.mbox', create=False)

I did some digging into ways I could split this file up because it's hard to deal with 70k messages at once. Unfortunately there weren't great options for this. I eventually realized I could split it into separate files myself. Groups that make sense to me:

- Trash
- Spam
- Category Promotions
- Receipts
- Everything else

In [4]:
# Create/open individual mailboxes
trash_mbox = mailbox.mbox(data_path / 'trash.mbox')
spam_mbox = mailbox.mbox(data_path / 'spam.mbox')
promotions_mbox = mailbox.mbox(data_path / 'promotions.mbox')
receipts_mbox = mailbox.mbox(data_path / 'receipts.mbox')
leftover_mbox = mailbox.mbox(data_path / 'leftover.mbox')

In [3]:
# Divide the messages into individual mailboxes
messages = combined_mbox.itervalues()
for message in messages:
  labels = (message['X-Gmail-Labels'] or '').split(',')
  if 'Trash' in labels:
    trash_mbox.add(message)
  elif 'Spam' in labels:
    spam_mbox.add(message)
  elif 'Category Promotions' in labels:
    promotions_mbox.add(message)
  elif 'Receipts' in labels:
    receipts_mbox.add(message)
  else:
    leftover_mbox.add(message)


In [5]:
# List out the number of messages in each inbox
print(f'Total messages: {len(combined_mbox)}')
print(f'Trash messages: {len(trash_mbox)}')
print(f'Spam messages: {len(spam_mbox)}')
print(f'Promotion messages: {len(promotions_mbox)}')
print(f'Receipt messages: {len(receipts_mbox)}')
print(f'Leftover messages: {len(leftover_mbox)}')


Total messages: 72983
Trash messages: 454
Spam messages: 38
Promotion messages: 19471
Receipt messages: 2078
Leftover messages: 50942


Something I thought would be nice to be able to do is poke at the Promotions bucket. 99% of what's in here is spam but Gmail doesn't really provide a convenient way to group things up and get rid of them all at once. That should be easier with Python's math libraries. Step 1 of that is to cut up the email address to filter down to just the domain. 

In [7]:
messages = promotions_mbox.itervalues()
message = next(messages)
print(message.keys())

['X-GM-THRID', 'X-Gmail-Labels', 'Delivered-To', 'Received', 'X-Google-Smtp-Source', 'X-Received', 'ARC-Seal', 'ARC-Message-Signature', 'ARC-Authentication-Results', 'Return-Path', 'Received', 'Received-SPF', 'Authentication-Results', 'DKIM-Signature', 'DKIM-Signature', 'Date', 'To', 'From', 'Reply-to', 'Subject', 'Message-ID', 'CFBL-Address', 'X-Subscription', 'X-Mailer', 'X-MessageID', 'List-Id', 'X-Abuse-Info', 'X-Abuse-Info', 'X-Complaints-To', 'X-Report-Abuse', 'X-CSA-Complaints', 'Feedback-ID', 'List-Unsubscribe', 'List-Unsubscribe-Post', 'X-Virtual-MTA', 'MIME-Version', 'Content-Type']


In [46]:
import email
import re

parsed_email = email.utils.parseaddr(message['From'])[1]
print(parsed_email)

regex = re.compile(r"[@\.](\w+?\.\w+)\Z")
match = regex.findall("foo@a.b.c.google.com")
print(match)


info@cantura.club
['google.com']


I decided it would be good to encapsulate some of this parsing logic in a class. Eventually this can be moved out of the notebook but it's easier to work with in here.

In [47]:
import collections
import itertools

class ParsedMessage:
  def __init__(self, message):
    self.pattern = re.compile(r"[@\.](\w+?\.\w+)\Z")
    self.sender_domain = self.__parse_sender_domain(message)
    
  def __parse_sender_domain(self, message):
    parsed_email = email.utils.parseaddr(message['From'])[1]
    groups = self.pattern.findall(parsed_email)
    if len(groups) == 0:
      return None
    
    return groups.pop()
  
messages = promotions_mbox.itervalues()
for message in messages:
  parsed_message = ParsedMessage(message)
  domain = parsed_message.sender_domain
  if domain is not None:

parsed_messages = map(ParsedMessage.__new__, messages)
messages_by_domain = dict()
for domain, group in itertools.groupby(parsed_messages, lambda m: m.sender_domain):
  messages_by_domain[domain] = len(group)

print(messages_by_domain)

TypeError: object.__new__(X): X is not a type object (mboxMessage)