<a href="https://colab.research.google.com/github/gabrielborja/python_data_analysis/blob/main/email_analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Working with mbox files

Message format <hr>
The basic Internet message format used for email is defined by RFC 5322.
Internet email messages consist of two sections, 'header' and 'body'. These are known as 'content'.
The header is structured into fields such as From, To, CC, Subject, Date, and other information about the email.
The body contains the message, as unstructured text, sometimes containing a signature block at the end. The header is separated from the body by a blank line.
More information can be found [here](https://en.wikipedia.org/wiki/Email#Header_fields)

## Uploading packages and data

In [1]:
#Importing necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mailbox

In [2]:
#Remove previous versions of the uploaded excel file
!rm email_list.json

In [3]:
#Uploading file from local drive
from google.colab import files
uploaded = files.upload()

Saving email_list.json to email_list.json


In [None]:
#Storing json in a Pandas Dataframe
import io
df = pd.read_json(io.BytesIO(uploaded['email_list.json']))

##Data cleaning

In [5]:
#Checking the dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     1480 non-null   object
 1   From     1480 non-null   object
 2   Subject  1480 non-null   object
dtypes: object(3)
memory usage: 34.8+ KB


In [46]:
df.Date.replace?

In [6]:
#Replace Unknown Timezones: i.e. CDT, CST, GMT-05:00, -05..
#df[df['Date'].str.contains('CDT|CST', regex=True)]
df['Date'] = df['Date'].replace(to_replace=r'CDT|CST', value='UTC', regex=True)

In [7]:
#Parsing strings to datetime object and converting to Oslo time zone
df['Event'] = df['Date']
df['Event'] = pd.to_datetime(df['Date'], errors='coerce', utc=True, format='%Y-%m-%d %H:%M:%S', infer_datetime_format=True).dt.tz_convert('Europe/Oslo')

In [8]:
#Checking the dataframe head
df.head()

Unnamed: 0,Date,From,Subject,Event
0,6 Jun 2021 11:12:43 -0500,=?utf-8?Q?Produbanco_enl=C3=ADnea?= <bancaenli...,=?utf-8?B?Tm90aWZpY2FjacOzbiBUcmFuc2ZlcmVuY2lh...,2021-06-06 18:12:43+02:00
1,"Sat, 05 Jun 2021 12:57:48 -0700",Google Maps Timeline <noreply-maps-timeline@go...,=?UTF-8?Q?=F0=9F=8C=8D_Gabriel=2C_your_May_upd...,2021-06-05 21:57:48+02:00
2,"Mon, 7 Jun 2021 06:55:46 +0000",=?UTF-8?Q?Elkj=C3=B8p?= <elkjop@email.elkjop.no>,=?UTF-8?B?RsOlIG1ha3MgdXQgYXYgc29tbWVyZW5zIHNw...,2021-06-07 08:55:46+02:00
3,"Mon, 07 Jun 2021 14:03:55 -0600","""Comunicados Banco Pichincha"" <comunicados@pic...",=?UTF-8?Q?Informaci=C3=B3n_de_COSEDE?=,2021-06-07 22:03:55+02:00
4,"Sat, 29 May 2021 07:02:53 +0000","""Duolingo"" <hello@duolingo.com>",=?utf-8?q?=F0=9F=93=9D_Your_weekly_progress_re...,2021-05-29 09:02:53+02:00


In [9]:
#Checking dataframe info after formatting dates
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype                      
---  ------   --------------  -----                      
 0   Date     1480 non-null   object                     
 1   From     1480 non-null   object                     
 2   Subject  1480 non-null   object                     
 3   Event    1469 non-null   datetime64[ns, Europe/Oslo]
dtypes: datetime64[ns, Europe/Oslo](1), object(3)
memory usage: 46.4+ KB


In [10]:
#Checking invalid timestamps
df[df['Event'].isnull()]

Unnamed: 0,Date,From,Subject,Event
240,"Tue, 18 May 2021 09:41:04 -0500 (GMT-05:00)",servicios@discover.com.ec,Agradecemos tu pago,NaT
351,"Wed, 21 Oct 2020 05:00:41 -0500 (-05)",no-reply@biodimed.com,=?UTF-8?Q?Feliz_Cumplea=C3=B1os_EDUARDO_BORJA?=,NaT
384,"Wed, 14 Oct 2020 13:09:24 -0500 (GMT-05:00)",servicios@dinersclub.com.ec,Envio Clave Temporal,NaT
967,"Wed, 14 Oct 2020 13:10:22 -0500 (GMT-05:00)",servicios@interdin.com.ec,=?ISO-8859-1?Q?Recuperaci=F3n_de_contrase=F1a?=,NaT
1050,"Fri, 20 Mar 2020 20:05:00 -0500 (-05)",no-reply@biodimed.com,=?UTF-8?Q?BOLET=C3=8DN_2._Preguntas_y_respues?...,NaT
1062,"Wed, 18 Dec 2019 07:34:46 -0500 (GMT-05:00)",servicios@interdin.com.ec,Agradecemos su pago,NaT
1254,"Sat, 23 Nov 2019 08:01:25 -0500 (GMT-05:00)",servicios@discover.com.ec,Agradecemos tu pago,NaT
1290,"Sat, 21 Mar 2020 18:00:48 -0500 (-05)",no-reply@biodimed.com,=?UTF-8?Q?BOLET=C3=8DN_3._Preguntas_y_respuest...,NaT
1338,"Tue, 26 May 2020 08:00:11 -0500 (GMT-05:00)",servicios@discover.com.ec,Agradecemos tu pago,NaT
1443,"Thu, 10 Oct 2019 04:55:06 -0500 (GMT-05:00)",servicios@titanium.com.ec,=?ISO-8859-1?Q?Notificaci=F3n_de_Consumos?=,NaT


In [74]:
wrong_date = [d for d in df['Date'] if '(' in d]
wrong_date[:11]

['Tue, 25 May 2021 17:14:59 +0000 (UTC)',
 'Thu, 03 Jun 2021 05:48:41 +0000 (UTC)',
 'Sun, 6 Jun 2021 11:41:03 -0500 (ECT)',
 'Sun, 6 Jun 2021 09:02:17 -0500 (ECT)',
 'Sun, 6 Jun 2021 09:01:17 -0500 (ECT)',
 'Sun, 6 Jun 2021 08:57:38 -0500 (ECT)',
 'Fri, 21 May 2021 05:52:58 +0000 (UTC)',
 'Thu, 20 May 2021 20:12:50 +0000 (UTC)',
 'Thu, 03 Jun 2021 17:16:37 +0000 (UTC)',
 'Fri, 7 May 2021 23:51:12 +0000 (UTC)',
 'Wed, 5 May 2021 23:19:08 +0000 (UTC)']