#  __Data Preparation for training__

In this notebook, we will explore and analyze our dataset to identify the best machine learning model for detecting whether a given text is `spam` or not.

Our dataset is in CSV format, and we will primarily use Pandas for analysis.

## __Loading the data__

In [2]:
import pandas as pd

DATA_PATH = "../data/spam_ham_dataset.csv"

df = pd.read_csv(DATA_PATH)

## __Visualization__

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [4]:
df.tail()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0
5170,4807,spam,Subject: important online banking alert\r\ndea...,1


In [6]:
df.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

In [7]:
df.columns.value_counts()

Unnamed: 0    1
label         1
text          1
label_num     1
Name: count, dtype: int64

In [8]:
df.shape

(5171, 4)

Our `CSV` data contain 4 columns and 5171 rows. Let's push our analysis further 

In [9]:
df.value_counts

<bound method DataFrame.value_counts of       Unnamed: 0 label                                               text  \
0            605   ham  Subject: enron methanol ; meter # : 988291\r\n...   
1           2349   ham  Subject: hpl nom for january 9 , 2001\r\n( see...   
2           3624   ham  Subject: neon retreat\r\nho ho ho , we ' re ar...   
3           4685  spam  Subject: photoshop , windows , office . cheap ...   
4           2030   ham  Subject: re : indian springs\r\nthis deal is t...   
...          ...   ...                                                ...   
5166        1518   ham  Subject: put the 10 on the ft\r\nthe transport...   
5167         404   ham  Subject: 3 / 4 / 2000 and following noms\r\nhp...   
5168        2933   ham  Subject: calpine daily gas nomination\r\n>\r\n...   
5169        1409   ham  Subject: industrial worksheets for august 2000...   
5170        4807  spam  Subject: important online banking alert\r\ndea...   

      label_num  
0             0  

In [10]:
df["label"].value_counts()

label
ham     3672
spam    1499
Name: count, dtype: int64

In [12]:
HAM_PER = 3672/5171
SPAM_PER = 1499/5171
HAM_PER,SPAM_PER

(0.7101140978534133, 0.2898859021465867)

Around __71%__ of **HAM** and **29%** of **SPAM**, let's keep that in mind when splitting the data into train and validation batches.

In [13]:
df["text"].head()

0    Subject: enron methanol ; meter # : 988291\r\n...
1    Subject: hpl nom for january 9 , 2001\r\n( see...
2    Subject: neon retreat\r\nho ho ho , we ' re ar...
3    Subject: photoshop , windows , office . cheap ...
4    Subject: re : indian springs\r\nthis deal is t...
Name: text, dtype: object

In [19]:
df["text"].head()


0    Subject: enron methanol ; meter # : 988291\r\n...
1    Subject: hpl nom for january 9 , 2001\r\n( see...
2    Subject: neon retreat\r\nho ho ho , we ' re ar...
3    Subject: photoshop , windows , office . cheap ...
4    Subject: re : indian springs\r\nthis deal is t...
Name: text, dtype: object