<a href="https://colab.research.google.com/github/gabrielborja/python_data_analysis/blob/main/email_analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Working with mbox files

Message format <hr>
The basic Internet message format used for email is defined by RFC 5322.
Internet email messages consist of two sections, 'header' and 'body'. These are known as 'content'.
The header is structured into fields such as From, To, CC, Subject, Date, and other information about the email.
The body contains the message, as unstructured text, sometimes containing a signature block at the end. The header is separated from the body by a blank line.
More information can be found [here](https://en.wikipedia.org/wiki/Email#Header_fields)

## Uploading initial packages and data

In [1]:
#Importing preliminary packages
import numpy as np
import pandas as pd

In [2]:
#Remove previous versions of the uploaded excel file
!rm email_list.json

In [3]:
#Uploading file from local drive
from google.colab import files
uploaded = files.upload()

Saving email_list.json to email_list.json


In [None]:
#Storing json in a Pandas Dataframe
import io
df = pd.read_json(io.BytesIO(uploaded['email_list.json']), encoding='utf-8')

In [5]:
#Checking the dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     1480 non-null   object
 1   From     1480 non-null   object
 2   Subject  1480 non-null   object
dtypes: object(3)
memory usage: 34.8+ KB


##Initial Data cleaning

###Converting to datetime

In [6]:
#Create a copy of the dataframe to perform data cleaning and manipulation
emails = df.copy()

In [7]:
#Replace Unknown Timezones: i.e. CDT, CST flaged as warnings by read_json (UnknownTimezoneWarning)
emails['Date'] = emails['Date'].replace(to_replace=r'CDT|CST', value='UTC', regex=True)
#df[df['Date'].str.contains('CDT|CST', regex=True)] #==>Check how many rows contain unknown timezone

In [8]:
#Replace invalid timezone stamps that will result in NaT values
emails['Date'] = emails['Date'].replace(to_replace='\(-05\)|\(GMT-05:00\)', value='(ECT)', regex=True)
#emails[emails['Date'].str.contains('\(-05\)|\(GMT-05:00\)', regex=True)] #==> Check how many rows contain invalid timezone

In [9]:
#Parsing strings to datetime object and converting to Oslo time zone
emails = emails.assign(Event = emails['Date'])
emails['Event'] = pd.to_datetime(emails['Date'], errors='coerce', utc=True, format='%Y-%m-%d %H:%M:%S', infer_datetime_format=True).dt.tz_convert('Europe/Oslo')

In [10]:
#Sorting the 'Event' column
emails = emails.sort_values(by=['Event']).reset_index(drop=True)
emails.tail()

Unnamed: 0,Date,From,Subject,Event
1475,"Mon, 07 Jun 2021 18:16:00 +0000","""Humble Bundle"" <contact@mailer.humblebundle.com>",Get Data Smart with Humble Bundle and Mercury ...,2021-06-07 20:16:00+02:00
1476,"Mon, 7 Jun 2021 13:57:53 -0500 (ECT)",banco@pichincha.com,NOTIFICACION BANCO PICHINCHA,2021-06-07 20:57:53+02:00
1477,"Mon, 07 Jun 2021 14:03:55 -0600","""Comunicados Banco Pichincha"" <comunicados@pic...",=?UTF-8?Q?Informaci=C3=B3n_de_COSEDE?=,2021-06-07 22:03:55+02:00
1478,"Mon, 07 Jun 2021 22:17:08 +0000",Diners Club del Ecuador <gestion@comunicacione...,Accede al reembolso de tus compras,2021-06-08 00:17:08+02:00
1479,"Mon, 7 Jun 2021 23:13:14 +0000 (UTC)",LinkedIn Job Recommendations <jobs-listings@li...,Harnham is looking for: Data Scientist.,2021-06-08 01:13:14+02:00


In [11]:
#Check that the new dataframe info has all valid datetime
emails.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype                      
---  ------   --------------  -----                      
 0   Date     1480 non-null   object                     
 1   From     1480 non-null   object                     
 2   Subject  1480 non-null   object                     
 3   Event    1480 non-null   datetime64[ns, Europe/Oslo]
dtypes: datetime64[ns, Europe/Oslo](1), object(3)
memory usage: 46.4+ KB


In [None]:
#Checking the presence of () in date
par_date = [d for d in emails['Date'] if '(' in d]
par_date[:10]

['Tue, 25 May 2021 17:14:59 +0000 (UTC)',
 'Thu, 03 Jun 2021 05:48:41 +0000 (UTC)',
 'Sun, 6 Jun 2021 11:41:03 -0500 (ECT)',
 'Sun, 6 Jun 2021 09:02:17 -0500 (ECT)',
 'Sun, 6 Jun 2021 09:01:17 -0500 (ECT)',
 'Sun, 6 Jun 2021 08:57:38 -0500 (ECT)',
 'Fri, 21 May 2021 05:52:58 +0000 (UTC)',
 'Thu, 20 May 2021 20:12:50 +0000 (UTC)',
 'Thu, 03 Jun 2021 17:16:37 +0000 (UTC)',
 'Fri, 7 May 2021 23:51:12 +0000 (UTC)']

In [13]:
#Checking for invalid timestamps in 'Event' column
emails[emails['Event'].isnull()].sum()

Date       0.0
From       0.0
Subject    0.0
Event      0.0
dtype: float64

###Transforming ASCII quoted-printable to UTF-8

Encode and Decode MIME <hr>
Many times we need to deal with data which not always has the regular ASCII characters. For example, an email in a different language other than English. Python has mechanism to deal with such characters by using MIME (Multipurpose Internet Mail Extensions) based module.

Quoted-printable in headers <hr>
The character sequence =?UTF-8?Q? is called quoted-printable, and is legitimately used to encode UTF-8 characters in internet headers, since they can contain only ASCII (rfc1342). Quoted-printable is particularly useful for when the content is mostly ASCII, so for example Chris España could be encoded as =?UTF-8?Q?Chris Espa=F1a?=



In [14]:
#Import encode and decode MIME quoted-printable package
import quopri

In [15]:
#Crete function to Transform 'From' and 'Subject' columns from ASCII (quoted-printable) to UTF-8

def parse_quoted_printable(df):
  """Parse from ASCII (quoted-printable) to UTF-8"""
  
  #Split data to lists
  l1_from = df['From'].copy()
  l2_subj = df['Subject'].copy()

  #Create function to decode string
  def decode_quoted(x):
    try:
      return quopri.decodestring(x).decode(encoding='utf-8')
    except UnicodeDecodeError:
      return x

  #Decode each list with quopri
  l1_from = [decode_quoted(i) for i in l1_from]
  l2_subj = [decode_quoted(i) for i in l2_subj]

  #Return dataframe with converted columns to UTF-8
  df = df.assign(From = l1_from,
                 Subject = l2_subj)

  return df

In [16]:
#Apply function to parse from ASCII (quoted-printable) to UTF-8
emails = parse_quoted_printable(emails)

In [None]:
#Check dataframe after ASCII text transformation
emails.head(10)

In [18]:
#Checking how many values start with text '=?UTF-8?Q?' in 'From' column
my_utf = emails[emails['From'].str.startswith('=?')]['From'].to_list()
my_utf = [i[:10] for i in my_utf]
pd.Series(my_utf).unique()

array(['=?utf-8?Q?', '=?UTF-8?B?', '=?UTF-8?Q?', '=?iso-8859',
       '=?utf-8?B?'], dtype=object)

In [19]:
#Checking how many unique first character are in 'From' column
begins = emails['From'].to_list()
begins = [i[:1] for i in begins]
pd.Series(begins).unique()

array(['b', 's', '=', 'S', 'R', '"', 'T', 'G', 'D', 'n', 'I', 'E', 'K',
       'A', 'C', '<', 'W', 'N', 'X', 'B', '', 'L', 'P', 'l', 'U', 'f',
       'i', 'F', 'Y', 'M', 'H', 'V', 'J', 'O', 'g', 'd', 'v', 'Q', 'z',
       'a', 'j', 'h'], dtype=object)

In [20]:
#Use regex to replace text starting with =?utf-8?Q? =?UTF-8?Q? =?UTF-8?B? =?utf-8?B? =?iso-8859
emails['From'] = emails['From'].replace(to_replace='=\?\w*-\w[\?|\w]\w[\?|\w]', value='', regex=True)
emails['Subject'] = emails['Subject'].replace(to_replace='=\?\w*-\w[\?|\w]\w[\?|\w]', value='', regex=True)
#emails[emails['From'].str.contains('=\?\w*-\w[\?|\w]\w[\?|\w]', regex=True)]

In [21]:
#Use regex to replace text containing =?=
emails['From'] = emails['From'].replace(to_replace='=*\?=', value='', regex=True)
emails['Subject'] = emails['Subject'].replace(to_replace='=*\?=', value='', regex=True)
#emails[emails['Subject'].str.contains('=*\?=', regex=True)]

In [None]:
emails

In [None]:
emails['From'][4]

'"Duolingo" <hello@duolingo.com>'

##Data Manipulation

In [23]:
#Import regex
import re

In [31]:
re?

In [None]:
#Checkind the new dataframe head
emails.head()

In [None]:
emails[~emails['From'].str.contains('<')]['From'].value_counts()

In [155]:
#Create a function to parse local-part and domain

def match_local(x):
  """Match local-part from email"""
  regex1 = re.compile(r'(\w+|\w+[\.|-]\w+|\w+[\.|-]\w+[\.|-]\w+)(?:@)')
  
  try:
    m = re.findall(pattern=regex1, string=x)
    return m[0] #==> Return first group match
  except:
    return "No match"

def match_sender(x):
  """Match name of sender from email"""
  regex2 = re.compile(r'(.*?)(?:<)')
  
  try:
    m = re.findall(pattern=regex2, string=x)
    return m[0].strip().replace('"', '') #==> Return first group match and strip space and "
  except:
    return "No Sender"

def match_domain(x):
  """Match name of domain from email"""
  regex3 = re.compile(r'(?:@)(.*)(?:\.)')
  
  try:
    m = re.findall(pattern=regex3, string=x)
    return m[0].strip().replace('"','').replace("?",'').replace(">",'') #==> Return first group match and strip space and "
  except:
    return "No Domain"

In [26]:
#Define a regex function to match sender, local-part and domain
def regex_matcher(x, reg_ex):
  """Match regular expression argument"""
  try:
    m = re.findall(pattern=reg_ex, string=x)
    return m[0].strip().replace('"','').replace("?",'').replace(">",'') #==> Return first group match, strip space and invalid char
  except:
    return None

In [None]:
#Compile regular expressions and apply regex_matcher
regex1 = re.compile(r'(.*?)(?:<)') #==> Match sender
regex2 = re.compile(r'(\w+|\w+[\.|-]\w+|\w+[\.|-]\w+[\.|-]\w+)(?:@)') #==> Match local-part
regex3 = re.compile(r'(?:@)(.*)(?:>?)') #==> Match domain

emails = emails.assign(Sender = emails['From'].apply(regex_matcher, args=(regex1,)),
                       Local_part = emails['From'].apply(regex_matcher, args=(regex2,)),
                       Domain = emails['From'].apply(regex_matcher, args=(regex3,)))
emails.head(10)

In [None]:
#senders = emails['Sender'].value_counts().reset_index()['index'].to_list()
#senders
emails['Sender'].value_counts()

In [None]:
emails['Local_part'].value_counts()

In [None]:
emails['Domain'].value_counts()

##Data Visualization

In [None]:
#Import visualization packages
import matplotlib.pyplot as plt
import seaborn as sns