# Authors: Aishwarya Mathew, Vikram Yabannavar

# Final Project - Spamifier

## Introduction

Data science is all around us. It is an interdisciplinary field of study that strives to discover meaning behind data. Finding a good first data science project to work on is a hard task. You may know a lot of data science concepts but you get stuck when you want to apply your skills to real world tasks. However, you've come to the right place! Have you ever wanted to make an application that could distinguish between some bad thing and a good thing? This tutorial will do just that. It's actually a basic introduction to the world of data science. This project will teach you how to create a spam classifier/filter that will distinguish between spam vs. not spam (we call this ham) SMS messages. We will take you through the entire data science lifecycle which includes data collection, data processing, exploratory data analysis and visualization, analysis, hypothesis testing, machine learning and insight/policy decision.

### Tutorial Content

--Installing Libraries
--Downloading and Preparing the Data

In [26]:
import pandas as pd
import csv
import re

## Downloading The Data (Data Collection)

The first step of any data science project is Data Collection and for that, you need data. Kaggle is a dataset website that has a lot of real world data. To get started, you need to click on this link, https://www.kaggle.com/uciml/sms-spam-collection-dataset , to download the SMS spam vs. ham dataset from Kaggle to your local disk. You will have to create a user account on Kaggle to download any of their datasets.  

## Data Processing

Once you have the dataset (a csv file) on your local server, we can start the next step, in which, we prepare our data. The code below is going to export the spam vs. ham dataset from the local server to a pandas dataframe. 

In [3]:
#reading the data into a pandas dataframe
spamham_data = pd.read_csv("spam.csv", encoding='latin-1')

#removing unnecessary columns
del spamham_data['Unnamed: 2']
del spamham_data['Unnamed: 3']
del spamham_data['Unnamed: 4']
#renaming the remaining two columns
spamham_data.columns = ['Spam or Ham','SMS Message']

spamham_data

Unnamed: 0,Spam or Ham,SMS Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


## Exploratory Analysis and Data Visualization

In [28]:
word_ct_df = pd.DataFrame(columns=('word', 'count'))
index = 0

# Loop through the above data frame and check if message is spam. 
# If it is, we convert each message to lower case, then split each message by spaces. 
# Then remove punctuation and add it to the table if not present. Otherwise,
# you increment the corresponding count of occurrences.
for item,row in spamham_data.iterrows():
    if row['Spam or Ham'] == 'spam': 
        split = row['SMS Message'].lower().split()
        for word in split:
            re.sub(r'[^\w]','',word) #remove anything not a word char or space
            if word in word_ct_df['word']:
                temp_row = word_ct_df[['word']==word]
                temp_row['count'] = temp_row['count']+1
            else:
                word_ct_df.loc[index] = [word,1]
                index += 1
            
word_ct_df

Unnamed: 0,word,count
0,free,1.0
1,entry,1.0
2,in,1.0
3,2,1.0
4,a,1.0
5,wkly,1.0
6,comp,1.0
7,to,1.0
8,win,1.0
9,fa,1.0
