# Exploratory Data Analysis

This notebook contains the Exploratory Data Analysis for the Email Spam detection project

## Installing requiements and Importing data


In [None]:
import numpy as np
import pandas as pd 

%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

In [None]:
# Importing data
data = pd.read_csv('../data/spam.csv', encoding='latin-1')
data.head()

## Feature Engineering

In [None]:
# Removing unnecessary columns
data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1 , inplace=True)

# Renaming cols
data.rename(columns={
    'v1' : 'class',
    'v2' : 'text'
}, inplace=True)

data.head()

## Data Exploration

In [None]:
# Get the shape of the data
data.shape


### Checks to perform on data

* Check for missing values
* Check for duplicate values
* Check the dtypes of different data
* Get the descriptive statistics of the data (We can skip this step because our data doesnt have any numeric columns)

#### Check for missing values

In [None]:
# check for missing values
data.isnull().sum()

Our dataset doesn't contain any missing values


#### Check for duplicate values

In [None]:
# check for duplicate values
data.duplicated().sum()

Our dataset contain duplicate values so let's remove them


In [None]:
data.drop_duplicates(inplace=True)

# check for duplicate values again
data.duplicated().sum()

### Check for datatypes in the dataset

In [None]:
data.info()

## Data Visualization

In [None]:
# Let's visualize the class column
plt.figure(figsize=(10, 7))
sns.countplot(data=data, x="class")
plt.show()


From the above plot, our data contains more examples of `ham` which means `not spam` as compared to `spam`

In [None]:
wordcloud = WordCloud(background_color="white",width=1600, height=800).generate(' '.join(data["text"].tolist()))
plt.figure(figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)

In [None]:
# Save the new datafile in the folder
data.to_csv("../data/data_cleaned.csv", index=False, header=True)