# Email Sorting
Royce Schultz

Files available on my [GitHub](https://github.com/royceschultz/DataProject)

## Abstract
I currently have 1,460 unread emails between my 2 primary accounts. Sorting all these messages is no quick task. The simplest method is a series of if statements like,
```
if (message.from('no-reply@piazza.com')): return label('piazza')

```
But more nuanced cases may require more care. For example, email lists may come from many differnt people. Additionally, those people may send you other emails for different contexts.

I will be looking specifically at identifying responses from job applications. My goal is to disinguish between confirmaiton emails, code tests, denials, offers, and spam recruiters.

In [1]:
from model import Gmail
import matplotlib.pyplot as plt
import numpy as np
import random

## The Gmail Class

This custom class handles communication with the api. It can grab live messages and labels from the associated gmail account.

In [9]:
G = Gmail()

In [10]:
LABEL_NAMES = G.labels.names()
DEFAULT_LABELS = ['CATEGORY_PERSONAL','CATEGORY_SOCIAL','CATEGORY_FORUMS','IMPORTANT','CATEGORY_UPDATES'
                  ,'CHAT','SENT','INBOX','TRASH','CATEGORY_PROMOTIONS','DRAFT','SPAM','STARRED','UNREAD']
CUSTOM_LABELS = []
for label in LABEL_NAMES:
    if not label in DEFAULT_LABELS: CUSTOM_LABELS += [label]
CUSTOM_LABELS

['Github', 'Cycling', 'Canvas', 'Job Applications', 'Piazza']

In [13]:
reset = False
idx = [1]
RESET_LABELS = [CUSTOM_LABELS[i] for i in idx]
SET_LABELS = ['INBOX','UNREAD']
if reset:
    for label in RESET_LABELS:
        print(label)
        G.labels.clearLabel([label],[label],SET_LABELS)
print('done')

done


lets pick a label and identify charachteristic words in this group

In [None]:
myLabels = ['Job Applications']

In [None]:
messages = G.labels.match(myLabels)

In [None]:
print(len(messages))
print(messages[0])

### Creating a descriptive hash
The hash function shingles all messages it's given and identifies shingles that are common in at least **freq**% of messages

In [None]:
myHash = G.getHash(random.sample(messages,50), freq=0.15, k=8)

In [None]:
print(len(myHash))
print(myHash)

In [None]:
myHash.filter(.3)
print(len(myHash))
print(myHash)

### Evaluating the hash
Let's see how messages in the group compare to the hash

In [None]:
scores = []
for i, message in enumerate(messages):
    print(i, end='\r')
    content = G.messages.parseMessage(message)
    s = myHash.sim(content)
    scores.append(s)
plt.hist(scores)

In [None]:
ranks = np.argsort(scores)

for i in ranks[:5]:
    print(G.messages.readMessage(messages[i])[:100])

In [None]:
for i in ranks[-5:]:
    print(G.messages.readMessage(messages[i])[:100])

### Huh, that might be a problem
This method is very sensative to initialization. It will favor 'template' emails that are literally identical in large chunks of their content.

### Identifying new messages
Now that we've done the hard part by sampling lots of examples of a label, matching the label to a new message is pretty easy.

In [None]:
for i in range(100):
    print(i,end='\r')
    message = G.messages.popMessage()
    content = G.messages.parseMessage(message)
    s = myHash.sim(content)
    if s > .3:
        print(content[:100])

### Preliminary conclusion
This method sucessfully identifies closely related emails. This should be a significant step towards organizing my inbox, however further tests must be performed to test the efficacy on emails with more diverse language.