<a href="https://colab.research.google.com/github/gabrielborja/python_data_analysis/blob/main/email_analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Working with mbox files

Message format <hr>
The basic Internet message format used for email is defined by RFC 5322.
Internet email messages consist of two sections, 'header' and 'body'. These are known as 'content'.
The header is structured into fields such as From, To, CC, Subject, Date, and other information about the email.
The body contains the message, as unstructured text, sometimes containing a signature block at the end. The header is separated from the body by a blank line.
More information can be found [here](https://en.wikipedia.org/wiki/Email#Header_fields)

## Uploading initial packages and data

In [2]:
#Importing preliminary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#Remove previous versions of the uploaded excel file
!rm email_list.json

In [4]:
#Uploading file from local drive
from google.colab import files
uploaded = files.upload()

Saving email_list.json to email_list.json


In [5]:
#Storing json in a Pandas Dataframe
import io
df = pd.read_json(io.BytesIO(uploaded['email_list.json']), encoding='utf-8')



In [6]:
#Checking the dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     1480 non-null   object
 1   From     1480 non-null   object
 2   Subject  1480 non-null   object
dtypes: object(3)
memory usage: 34.8+ KB


##Data cleaning and manipulation

In [7]:
#Create a copy of the dataframe to perform data cleaning and manipulation
emails = df.copy()

In [8]:
#Replace Unknown Timezones: i.e. CDT, CST flaged as warnings by read_json (UnknownTimezoneWarning)
#df[df['Date'].str.contains('CDT|CST', regex=True)] #==>Check how many rows contain unknown timezone
emails['Date'] = emails['Date'].replace(to_replace=r'CDT|CST', value='UTC', regex=True)

In [9]:
#Parsing strings to datetime object and converting to Oslo time zone
emails = emails.assign(Event = emails['Date'])
emails['Event'] = pd.to_datetime(emails['Date'], errors='coerce', utc=True, format='%Y-%m-%d %H:%M:%S', infer_datetime_format=True).dt.tz_convert('Europe/Oslo')

In [None]:
#Checking the new dataframe head
emails.head()

Encode and Decode MIME <hr>
Many times we need to deal with data which not always has the regular ASCII characters. For example, an email in a different language other than English. Python has mechanism to deal with such characters by using MIME (Multipurpose Internet Mail Extensions) based module.

Quoted-printable in headers <hr>
The character sequence =?UTF-8?Q? is called quoted-printable, and is legitimately used to encode UTF-8 characters in internet headers, since they can contain only ASCII (rfc1342). Quoted-printable is particularly useful for when the content is mostly ASCII, so for example Chris España could be encoded as =?UTF-8?Q?Chris Espa=F1a?=



In [11]:
#Import encode and decode MIME quoted-printable package
import quopri

In [19]:
#Crete function to Transform From and Subject columns from ASCII (quoted-printable) to UTF-8

def parse_quoted_printable(df):
  """Parse from ASCII (quoted-printable) to UTF-8"""
  
  #Split data to lists
  l1_from = df['From'].copy()
  l2_subj = df['Subject'].copy()

  #Create function to decode string
  def decode_quoted(x):
    try:
      return quopri.decodestring(x).decode(encoding='utf-8')
    except UnicodeDecodeError:
      return x

  #Decode each list with quopri
  l1_from = [decode_quoted(i) for i in l1_from]
  l2_subj = [decode_quoted(i) for i in l2_subj]

  #Return dataframe with converted columns to UTF-8
  df = df.assign(From = l1_from,
                 Subject = l2_subj)

  return df
#emails = emails.assign(From = quopri.decodestring(emails['From']).decode(encoding='utf-8'),
#                       Subject = quopri.decodestring(emails['Subject']).decode(encoding='utf-8'))

In [20]:
#Apply function to parse from ASCII (quoted-printable) to UTF-8
emails = parse_quoted_printable(emails)

In [None]:
#Check dataframe after ASCII text transformation
emails.head(10)

In [54]:
#Checking how many values start with text '=?UTF-8?Q?' in 'From' column
my_utf = emails[emails['From'].str.startswith('=?')]['From'].to_list()
my_utf = [i[:10] for i in my_utf]
pd.Series(my_utf).unique()

array(['=?utf-8?Q?', '=?UTF-8?Q?', '=?UTF-8?B?', '=?utf-8?B?',
       '=?iso-8859'], dtype=object)

In [55]:
#Checking how many unique first character are in 'From' column
begins = emails['From'].to_list()
begins = [i[:1] for i in begins]
pd.Series(begins).unique()

array(['=', 'G', '"', 'h', 'Y', 'R', 'A', 'B', 'I', 'n', 'L', 'F', 'b',
       'D', 'M', 'T', 'd', 'E', 'S', '<', 'H', 'C', 'Q', 'J', 'N', 'j',
       's', 'K', 'U', 'g', 'i', 'O', 'z', '', 'X', 'P', 'V', 'v', 'f',
       'W', 'l', 'a'], dtype=object)

In [59]:
#Use regex to replace text starting with =?utf-8?Q? =?UTF-8?Q? =?UTF-8?B? =?utf-8?B? =?iso-8859
emails['From'] = emails['From'].replace(to_replace='=\?\w*-\w[\?|\w]\w[\?|\w]', value='', regex=True)
emails['Subject'] = emails['Subject'].replace(to_replace='=\?\w*-\w[\?|\w]\w[\?|\w]', value='', regex=True)
#emails[emails['From'].str.contains('=\?\w*-\w[\?|\w]\w[\?|\w]', regex=True)]

In [72]:
#Use regex to replace text containing =?=
emails['From'] = emails['From'].replace(to_replace='=*\?=', value='', regex=True)
emails['Subject'] = emails['Subject'].replace(to_replace='=*\?=', value='', regex=True)
#emails[emails['Subject'].str.contains('=*\?=', regex=True)]

In [None]:
emails

In [65]:
emails['From'][4]

'"Duolingo" <hello@duolingo.com>'

In [None]:
#Checking dataframe info after formatting dates
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype                      
---  ------   --------------  -----                      
 0   Date     1480 non-null   object                     
 1   From     1480 non-null   object                     
 2   Subject  1480 non-null   object                     
 3   Event    1469 non-null   datetime64[ns, Europe/Oslo]
dtypes: datetime64[ns, Europe/Oslo](1), object(3)
memory usage: 46.4+ KB


In [None]:
#Checking invalid timestamps
df[df['Event'].isnull()]

Unnamed: 0,Date,From,Subject,Event
240,"Tue, 18 May 2021 09:41:04 -0500 (GMT-05:00)",servicios@discover.com.ec,Agradecemos tu pago,NaT
351,"Wed, 21 Oct 2020 05:00:41 -0500 (-05)",no-reply@biodimed.com,=?UTF-8?Q?Feliz_Cumplea=C3=B1os_EDUARDO_BORJA?=,NaT
384,"Wed, 14 Oct 2020 13:09:24 -0500 (GMT-05:00)",servicios@dinersclub.com.ec,Envio Clave Temporal,NaT
967,"Wed, 14 Oct 2020 13:10:22 -0500 (GMT-05:00)",servicios@interdin.com.ec,=?ISO-8859-1?Q?Recuperaci=F3n_de_contrase=F1a?=,NaT
1050,"Fri, 20 Mar 2020 20:05:00 -0500 (-05)",no-reply@biodimed.com,=?UTF-8?Q?BOLET=C3=8DN_2._Preguntas_y_respues?...,NaT
1062,"Wed, 18 Dec 2019 07:34:46 -0500 (GMT-05:00)",servicios@interdin.com.ec,Agradecemos su pago,NaT
1254,"Sat, 23 Nov 2019 08:01:25 -0500 (GMT-05:00)",servicios@discover.com.ec,Agradecemos tu pago,NaT
1290,"Sat, 21 Mar 2020 18:00:48 -0500 (-05)",no-reply@biodimed.com,=?UTF-8?Q?BOLET=C3=8DN_3._Preguntas_y_respuest...,NaT
1338,"Tue, 26 May 2020 08:00:11 -0500 (GMT-05:00)",servicios@discover.com.ec,Agradecemos tu pago,NaT
1443,"Thu, 10 Oct 2019 04:55:06 -0500 (GMT-05:00)",servicios@titanium.com.ec,=?ISO-8859-1?Q?Notificaci=F3n_de_Consumos?=,NaT


In [None]:
wrong_date = [d for d in df['Date'] if '(' in d]
wrong_date[:11]

['Tue, 25 May 2021 17:14:59 +0000 (UTC)',
 'Thu, 03 Jun 2021 05:48:41 +0000 (UTC)',
 'Sun, 6 Jun 2021 11:41:03 -0500 (ECT)',
 'Sun, 6 Jun 2021 09:02:17 -0500 (ECT)',
 'Sun, 6 Jun 2021 09:01:17 -0500 (ECT)',
 'Sun, 6 Jun 2021 08:57:38 -0500 (ECT)',
 'Fri, 21 May 2021 05:52:58 +0000 (UTC)',
 'Thu, 20 May 2021 20:12:50 +0000 (UTC)',
 'Thu, 03 Jun 2021 17:16:37 +0000 (UTC)',
 'Fri, 7 May 2021 23:51:12 +0000 (UTC)',
 'Wed, 5 May 2021 23:19:08 +0000 (UTC)']