# Job Rejection Data Collection
My aim is to explore and find meaningful insights in job rejection emails. However, in order to do this we first need to collect the data. In order to do this, I am going to use several Python packages that specialize in scraping directly from gmail. Here's some resources that were helpful:
* https://gist.github.com/jasonrdsouza/1674794
* https://gist.github.com/robulouski/7441883

In [173]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
import email, getpass, imaplib, os, re
from datetime import datetime
%matplotlib inline
sns.set_style('whitegrid')

In [174]:
# Login to gmail
detach_dir = '.'
user = input("Enter your GMail username:")
pwd = getpass.getpass("Enter your password: ")

Enter your GMail username:cdewey96@vt.edu
Enter your password: ········


In [175]:
# Connect to gmail imap sever
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user, pwd)

('OK', [b'cdewey96@vt.edu authenticated (Success)'])

In [176]:
# Go to correct directory
m.select('"Job Rejection"')

('OK', [b'86'])

In [177]:
# Get the email ids
resp, items = m.search(None, "All")
items = items[0].split()

In [208]:
# Initialize lists
text = []
dates = []
subjects = []

In [209]:
# Collect data
for emailid in items:
    
    # Fetch everything from the id
    resp, data = m.fetch(emailid, "(RFC822)")
    
    # Get the content
    email_body = data[0][1]
    
    # Convert to mail object
    mail = email.message_from_bytes(email_body)
    
    # Get subject
    subjects.append(email.header.decode_header(mail['Subject'])[0][0])
    
    # Get date
    date_tuple = email.utils.parsedate_tz(mail['Date'])
    dates.append(datetime.fromtimestamp(email.utils.mktime_tz(date_tuple)))
    
    # Get text
    if mail.is_multipart():
        text.append(mail.get_payload(0).get_payload())
    else:
        text.append(mail.get_payload())

In [216]:
# Convert to dataframe
df = pd.DataFrame(data={'Date': dates, 'Subject': subjects, 'Text': text})
df.head()

Unnamed: 0,Date,Subject,Text
0,2018-07-12 12:31:08,Your IBM Application,Ref: 110127BR - 2018 Data Scientist Internship...
1,2018-06-12 16:30:28,Thank you from Workday!,"<!doctype html><html xmlns:o=3D""urn:schemas-mi..."
2,2018-05-17 08:43:38,An Update Regarding Your Visa Job Application,"\r\nDear Conor,\r\nThank you for giving us the..."
3,2018-05-01 15:21:05,Thank you for your interest in Zynga for Inter...,<html><head>\r\n<meta http-equiv=3DContent-Typ...
4,2018-04-26 14:49:02,Your Application with Cambia Health Solutions,"Dear Conor,\r\n=C2=A0\r\nThank you for the int..."


In [251]:
# Break up date column
df['Time'] = df['Date'].apply(lambda x: x.time())
df['Day'] = df['Date'].apply(lambda x: x.weekday()).map({0:'Mon', 1:'Tues', 2:'Weds', 
                                                         3:'Thurs', 4:'Fri', 5:'Sat', 6:'Sun'})
df['Hour'] = df['Time'].apply(lambda x: x.hour)
df = df[['Date', 'Time', 'Day', 'Hour', 'Subject', 'Text']]

In [255]:
# Preview data
df.head()

Unnamed: 0,Date,Time,Day,Hour,Subject,Text
0,2018-07-12 12:31:08,12:31:08,Thurs,12,Your IBM Application,Ref: 110127BR - 2018 Data Scientist Internship...
1,2018-06-12 16:30:28,16:30:28,Tues,16,Thank you from Workday!,"<!doctype html><html xmlns:o=3D""urn:schemas-mi..."
2,2018-05-17 08:43:38,08:43:38,Thurs,8,An Update Regarding Your Visa Job Application,"\r\nDear Conor,\r\nThank you for giving us the..."
3,2018-05-01 15:21:05,15:21:05,Tues,15,Thank you for your interest in Zynga for Inter...,<html><head>\r\n<meta http-equiv=3DContent-Typ...
4,2018-04-26 14:49:02,14:49:02,Thurs,14,Your Application with Cambia Health Solutions,"Dear Conor,\r\n=C2=A0\r\nThank you for the int..."


In [257]:
# Data overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 6 columns):
Date       86 non-null datetime64[ns]
Time       86 non-null object
Day        86 non-null object
Hour       86 non-null int64
Subject    86 non-null object
Text       86 non-null object
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 4.1+ KB


In [258]:
# Export to csv
df.to_csv('rejections.csv', index=False)

Note that in order to replicate this data for yourself, there are a couple nuances in dealing with two factor authentication and IMAP access among other things so keep that in mind. I found that the easiest solution is the generate an app-specific password (you need two-factor to do this) and then just use that password when logging in. That's all for now! In my next notebook, I'll look into answering some interesting questions and diving deeper into the data itself. 