## Info

This is a notebook for a take home project from Way2B1.

The specifications for the assigment are:

Way2B1: Machine Learning Take Home Exercise
Project: Start benefiting from having too much email (or analyze your email inbox). Goal is to
connect to your email, apply a ML technique and produce a useful insight. This is very open
ended, so reach out for clarification if that’s overwhelming. Expected time commitment: 4-6
hours, but if useful to your portfolio then longer. If short on time, then you can write up how
you plan to approach each step of the project.

**Details:**

    ● Open a gmail account if you don’t already have one that you can use ( you can use any
    email provider that has an API, if you don’t want to use gmail)
    ○ If a new account, then you’ll need to figure out how to get some email into via
    some subscriptions, or sending test messages. Maybe forward non-sensitive, yet
    substantive (non-spam) messages from your primary email account

    ● Connect to Gmail’s API (https://developers.google.com/gmail/api/quickstart/python)

    ● Model: Choose a model or framework that you think will be useful in producing an insight
    related to your emails

    ● Algorithm: Select an algorithm that can be used to produce an insight or prediction from
    your emails. Spam detection is the canonical email analysis use case, you’re welcome to
    do it, but we’d prefer to see something else, like needs a reply, has an important event,
    or classifying emails into meaningful categories. It’s up to you and creativity is great.

    ● Features (if applicable): Identify useful features from each email (example: sender,
    sender domain, has attachments, keywords of email message etc). You don’t need to
    identify every possible feature, just some interesting ones that will support your
    insights.

    ● Data (sets depend on approach): Select a certain set of emails as your training set, a
    certain set as a validation set (optional), and a certain set as a test set (or just any
    arbitrary new email). You can probably use gmail labels to achieve this, but use whatever
    method is easiest for you.
    
**Languages: Python or JS, also you’re welcome to use any ML platform or infrastructure that you’re comfortable with**

**Submission:**

    ● Share a Github or Gitlab repo containing the code from the take-home

    ● README summarizing your approach and findings

    ● You’re free to use this code for any other purpose that you see fit outside of the take home evaluation 

**Evaluation:**

    ● Quality of implementation of algorithm

    ● Understandable explanation in README

    ● Legible, well designed code

    ● We’ll discuss the takehome in subsequent interviews.

**As an aside, this is something I have been meaning to play around with for some time.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy as sp
import time
import os

In [2]:
## google imports

from __future__ import print_function
import os.path

from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

In [3]:
def googletest(ret='labels'):
    # If modifying these scopes, delete the file token.json.
    SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']

    creds = None
    # The file token.json stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.json', 'w') as token:
            token.write(creds.to_json())

    try:
        # Call the Gmail API
        service = build('gmail', 'v1', credentials=creds)
        
        if ret == 'labels':
            results = service.users().labels().list(userId='me').execute()
            labels = results.get('labels', [])
            
            if not labels:
                print('No labels found.')
                return
            print('Labels:')
            for label in labels:
                print(label['name'])
        
        else:
            results = service.users().messages().list(maxResults=200, userId='me').execute()
            messages = results.get('messages')

            if not messages:
                print('No messages found.')
                return
            
            message_list = []
            for msg in messages:
                # Get the message from its id
                txt = service.users().messages().get(userId='me', id=msg['id']).execute()

                # Use try-except to avoid any Errors
                try:
                    # Get value of 'payload' from dictionary 'txt'
                    payload = txt['payload']
                    headers = payload['headers']
                    message_list.append([payload, headers])
                except:
                    pass

    except HttpError as error:
        # TODO(developer) - Handle errors from gmail API.
        print(f'An error occurred: {error}')

In [4]:
googletest(ret='msg')

In [5]:
import quickstart

In [6]:
quickstart.main()

Labels:
CHAT
SENT
INBOX
IMPORTANT
TRASH
DRAFT
SPAM
CATEGORY_FORUMS
CATEGORY_UPDATES
CATEGORY_PERSONAL
CATEGORY_PROMOTIONS
CATEGORY_SOCIAL
STARRED
UNREAD
Personal
Receipts
Work
Notes
Daily Coding Problems
Stuff for Later!


**lets run through this manually**

In [106]:
SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']
creds = Credentials.from_authorized_user_file('token.json', SCOPES)
service = build('gmail', 'v1', credentials=creds)

In [107]:
# help(service.users().messages().list)
## the max is 500 it turns out

In [108]:
results.get("nextPageToken")

'13247078207197137520'

In [109]:
from bs4 import BeautifulSoup
import base64
import email

In [304]:
# def check_next_page_token(results):
#     npt = results.get("nextPageToken")
#     if npt:
#         results = service.users().messages().list(maxResults=n, userId='me', pageToken=npt).execute()
#         return results

In [308]:
def initial_gmail_sync(service, iters=10):
    pageToken = None
    messages_left = True

    all_msgs = []
    # Get messages
    while (messages_left) and (iters > 0):
        messages = service.users().messages().list(userId="me", pageToken=pageToken).execute()
        pageToken = messages.get('nextPageToken')
        # do something with the messages! Importing them to your database for example
        for message in messages:
#             email_id = message['id']
            dirty_message = service.users().messages().get(userId="me", id=message['id'], format='metadata').execute()
            clean_message = extract_data(dirty_message)
            # and now you can do what you want
            # clean_message contains 'from', 'subject', 'to', 'date', 'id' and 'history_id'
            all_msgs.append(clean_message)
        iters -= 1
        if not pageToken:
            messages_left = False
            # you've reached the end of the inbox!
    
    return all_msgs

In [309]:
def get_results(n_results=500, pageToken=None):
    ''' Gets Results
        Params:
        n_results = number of results to fetch
            (max is 500 for display)
        pageToken = to get next list of results from previous pull of 500
        Returns:
        results
    '''
    ## these are alredy confirmed to exist so skipping the check here
    SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']
    creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    service = build('gmail', 'v1', credentials=creds)
    
    results = initial_gmail_sync(service, iters=5)
    
#     ## maxes out at 500 so set it and forget it
#     if pageToken:
#         results = service.users().messages().list(maxResults=n_results, userId='me', pageToken=pageToken).execute()
#     else:
#         results = service.users().messages().list(maxResults=n_results, userId='me').execute()
    return results

In [310]:
def get_subject_and_body(results):
    ''' Function to pull in subject and body from messages.
        Params:
        results = results from pull
        Returns:
        List of Message Subject and Body
    '''   
    messages = results.get('messages')
    all_msgs = []
    start_time = time.time()
    for msg in messages:
        # Get the message from its id
        txt = service.users().messages().get(userId='me', id=msg['id'],
                                             format="full", metadataHeaders=None).execute()    
        payload = txt['payload']
        headers = payload['headers']
        snippet = txt['snippet']

        # Look for Subject and Sender Email in the headers
        for d in headers:
            if d['name'] == 'Subject':
                subject = d['value']
            if d['name'] == 'From':
                sender = d['value']
            
        if not subject:
            subject = None
        if not sender:
            sender = None
        
        try:
            parts = payload.get('parts')[0]
            data = parts['body']['data']
            data = data.replace("-","+").replace("_","/")
            decoded_data = base64.b64decode(data)
            soup = BeautifulSoup(decoded_data , "lxml")
            body = soup.body()     
        except:
            body = None

        all_msgs.append([subject, body, sender, snippet])

    print(f"time to complete = {(time.time() - start_time)/60 :.3f} mins")
        
    return all_msgs

- https://stackoverflow.com/questions/43269704/gmail-api-getting-all-gmail-inbox-messages-limits-to-500

In [263]:
# res1 = get_results(1000)
# msg1 = get_subject_and_body(res1)

In [264]:
# res1.get("resultSizeEstimate")

In [265]:
num_msgs_to_retrieve = 1500

all_res = []
# ## 500 max limit
# res_tmp = get_results(num_msgs_to_retrieve)
# all_res.append(res_tmp)
# itr = 0
# while (res_tmp.get("nextPageToken")) and (itr<10):
#     itr += 1
#     res_tmp = get_results(num_msgs_to_retrieve)
#     all_res.append(res_tmp)

pageToken = None
messages_left = True
iters = 3
while (messages_left) and (iters > 0):
    messages = service.users().messages().list(userId="me", pageToken=pageToken).execute()
    pageToken = messages.get('nextPageToken')
    # do something with the messages! Importing them to your database for example
    for message in messages:
#             email_id = message['id']
        dirty_message = service.users().messages().get(userId="me", id=message['id'], format='metadata').execute()
        clean_message = extract_data(dirty_message)
        # and now you can do what you want
        # clean_message contains 'from', 'subject', 'to', 'date', 'id' and 'history_id'
        all_msgs.append(clean_message)
    iters -= 1
    if not pageToken:
        messages_left = False

In [311]:
all_res = get_results(500)

TypeError: string indices must be integers

In [266]:
len(all_res)

11

In [267]:
all_res[-1].get("nextPageToken")

'18215689330561977979'

In [268]:
# num_msgs_to_retrieve = 3000

# all_res = []
# ## 500 max limit
# for i in range(int(num_msgs_to_retrieve/500)):
#     res_tmp = get_results(num_msgs_to_retrieve)
#     all_res.append(res_tmp)
#     if res_tmp.get("nextPageToken"):
#             res_tmp = service.users().messages().list(userId='me', 
#                                                       pageToken=res_tmp.get("nextPageToken")).execute()   
#             all_res.append(res_tmp)

In [269]:
all_msg = []
for r in all_res:
    msg_tmp = get_subject_and_body(r)
    mdf = pd.DataFrame(msg_tmp, columns=['subject','body','sender','snippet'])
    all_msg.append(mdf)


Bill
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
" looks like a URL

time to complete = 0.851 mins
time to complete = 0.770 mins
time to complete = 0.743 mins
time to complete = 0.745 mins
time to complete = 0.806 mins
time to complete = 0.748 mins
time to complete = 0.752 mins
time to complete = 0.751 mins
time to complete = 0.807 mins
time to complete = 0.854 mins
time to complete = 0.751 mins


In [270]:
all_msg = pd.concat(all_msg, axis=0)
print(all_msg.shape)

(5500, 4)


In [271]:
all_msg.head(10)

Unnamed: 0,subject,body,sender,snippet
0,Reflecting on the final two weeks.,"[[Hi Bill,\r\n\r\nIn a moment, I am going to s...",Chris Deluzio <info@chrisforpa.com>,Fighting for our common good alongside all of ...
1,"Spend $500, Get $75 ENDS TONIGHT + Preview Tom...",,Costco Wholesale <Costco@online.costco.com>,"$150 - $300 OFF Apple iPad Pro (5th Gen), $300..."
2,check this out…,,Team McMullin <info@evanmcmullin.com>,‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ...
3,TIED,"[[B--\r\n\r\nBefore you go any further, I need...",Tim Ryan <info@fightforprogress.org>,"Keep in mind, national Republicans have poured..."
4,RSVP FOR TONIGHT: CELEBRATE DIWALI,[[https://cadem.org/ [https://cadem.org/] ...,CADEM <info@cadem.org>,"California Democratic Party Logo Democrats, Th..."
5,Immunity for COVID Vax/ Almond Mistake/ Herbal...,[[ \r\n \r\nCOVID Vax Handed Permanent Immun...,ANH USA <office@anh-usa.org>,COVID Vax Handed Permanent Immunity &amp; Fauc...
6,BREAKING: VoteAmerica named one of the top “Br...,,"""Emma, VoteAmerica"" <news@voteamerica.com>",This year there will be a record high number o...
7,"Bill, your $3 contribution with early voting ...","[[[1]Tammy Duckworth\r\n\r\n Hi, it's Mandel...",Mandela Barnes <tammy@e.tammyduckworth.com>,"This is a pivotal moment for our campaign, and..."
8,"US Antisemitism: Inexcusable, dangerous and sp...",[[[ ]J Street [ ]\r\nBill --\r\n\r\n The i...,"""Jeremy Ben-Ami, J Street"" <info@jstreet.org>",Bill -- The increased pervasiveness of antisem...
9,Save on Apple AirPods.,[[============================================...,Target <targetnews@em.target.com>,Want contactless shopping? Try Same Day Delive...


**This data pull is by no means perfect, there are a lot of missing values and we could likely have worked out better ways in dealing with the errors. That in mind, we have approxiamtely 5,500 instances from one of my email addresses that we can work through.**

In [272]:
all_msg.to_csv("data/pull.csv")

- lets use some text cleaning techniques
- we will examine the frequencies of subject matter, sender, and snippet
- we will do a little digging in the body

In [276]:
all_msg.subject.nunique()

105

**There arent a lot of unique subjects in the messages.**

In [277]:
all_msg.snippet.nunique()

469

**So there must be a lot of duplicates**

In [280]:
# all_msg.drop_duplicates()
## cant drop dupes

In [281]:
mdf = pd.DataFrame(msg1, columns=['subject','body','sender'])
mdf.shape


(500, 3)

In [282]:
mdf.head()

Unnamed: 0,subject,body,sender
0,A step closer to protecting this special place,[[We're working towards securing permanent pro...,Environment America <action@environmentamerica...
1,"Join 7 Brigade members at ""Virtual Work Night:...",[[Code for Pittsburgh\r\ninvites you to keep c...,Code for Pittsburgh <info@email.meetup.com>
2,We're betting on Stacey Abrams,[[Brian Kemp has got to go.\r\n\r\nView this e...,"""Lori D’Orazio, ReBuild USA"" <info@rebuildusa...."
3,Clarence Thomas has zero business ruling on ca...,,Patriotic Millionaires <info@patrioticmilliona...
4,"Military airstrike in Kachin State, Myanmar, k...",,"""Simon Billenness, International Campaign for ..."


In [285]:
mdf.subject.nunique()

135

In [283]:
for item in mdf.subject:
    if 'donat' in item.lower():
        print(item)

Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are five races where your donation will have maximum impact:
Here are f

- https://stackoverflow.com/questions/55144261/python-how-to-get-the-subject-of-an-email-from-gmail-api
- https://stackoverflow.com/questions/43269704/gmail-api-getting-all-gmail-inbox-messages-limits-to-500
- https://www.geeksforgeeks.org/how-to-read-emails-from-gmail-using-gmail-api-in-python/
- https://stackoverflow.com/questions/41761245/gmail-api-pagination-use-of-nextpagetoken
- https://medium.com/nat-personal-relationship-manager/full-gmail-api-guide-how-to-retrieve-email-metadata-to-use-it-in-your-app-511c77017326

In [288]:
import emaila
import base64

messageraw = service.users().messages().get(
    userId="me", 
    id=msg["id"], 
    format="raw", 
    metadataHeaders=None
).execute()

email_message = email.message_from_bytes(
    base64.urlsafe_b64decode(messageraw['raw'])
)

In [298]:
email_message['subject']

'=?UTF-8?B?V2hhdOKAmXMgeW91ciBjb21wdXRlcuKAmXM=?= favorite metric? |\r\n Cassie Kozyrkov in Towards Data Science'

In [303]:
email_message.items()

[('Delivered-To', 'wild.bill.schill@gmail.com'),
 ('Received',
  'by 2002:a05:6638:4710:0:0:0:0 with SMTP id cs16csp2544588jab;\r\n        Mon, 17 Oct 2022 04:40:01 -0700 (PDT)'),
 ('X-Google-Smtp-Source',
  'AMsMyM7zzL4obmQgoKsHsQeZeBvLNQKxele8AfZ7Yczph34NhYphyDqhLMZIl3YCT72Dbn78H8lH'),
 ('X-Received',
  'by 2002:a05:620a:461e:b0:6ee:e645:662d with SMTP id br30-20020a05620a461e00b006eee645662dmr2412708qkb.631.1666006800822;\r\n        Mon, 17 Oct 2022 04:40:00 -0700 (PDT)'),
 ('ARC-Seal',
  'i=1; a=rsa-sha256; t=1666006800; cv=none;\r\n        d=google.com; s=arc-20160816;\r\n        b=P0QbkFF2iVEx8s7PFnLyDvXhaFnJSKG14Fisruw+EFJQwA3EqDjZscFRwHDhcVtIQq\r\n         GwxqPsI+KFbfxsFbPy5He0K4HOdVG8SuN98G/32PwmdZBR9RNL04HslkB5ozHklHpLir\r\n         BYjzRFdAumLffSPbHz8g4RcB9C4FdLHfq9pOyrHpDPlaL+k4k/wNQvDVEqFT8W7fo3db\r\n         CCCgczimOYlQ2MU3v19jTokKGyiLu64P+s/GGEHu8bGE3oETuBYl6T7jwftUUN3GxnVl\r\n         xoBbfMAKLeLKDKgqfvAh7G4qg7Gh20O7D0pAsoY3YEwBQwbpsTlLjKVAkZfEll3qlyJ7\r\n         3sS

### OLD CODE BELOW

In [115]:
# msgs = []
# for msg in messages:
#     # Get the message from its id
#     txt = service.users().messages().get(userId='me', id=msg['id']).execute()    
#     payload = txt['payload']
#     headers = payload['headers']

#     # Look for Subject and Sender Email in the headers
#     for d in headers:
#         if d['name'] == 'Subject':
#             subject = d['value']
#         if d['name'] == 'From':
#             sender = d['value']
            
#     try:
#         parts = payload.get('parts')[0]
#         data = parts['body']['data']
#         data = data.replace("-","+").replace("_","/")
#         decoded_data = base64.b64decode(data)

#         soup = BeautifulSoup(decoded_data , "lxml")
#         body = soup.body()
#     except:
#         body = None
            
#     msgs.append([subject, body])