# Assignment: Spam Classification

## Task: Detect Spam in SMS messages   

Kaggle challenge: https://www.kaggle.com/uciml/sms-spam-collection-dataset

### Problem description
**Context**
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

## Data
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet.



# Task 1: Problem Statement
Discuss the problem setting and the first implications of the given data set... 
* What assumptions can we make about the data?
* What problems are we expecting?

# Task 2: First Data Analysis and Cleaning
* Import the data to a Pandas DataFrame
* Run first simple statistics and visualizations
* Is there a need to clean the data? If yes, do so...

see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sea
import string
import nltk
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Varinder\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [6]:
data = pd.read_csv('spam.csv' , encoding = "ISO-8859-1")

In [8]:
data.head()
data.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""",GE,"GNT:-)"""
freq,4825,30,3,2,2


In [None]:
data.columns
data.isnull().sum()

In [None]:
data['Unnamed: 2'].unique()
data['Unnamed: 3'].unique()
data['Unnamed: 4'].unique()

In [None]:
# Class labels Frequency 

spam = data['v1']
colors = ["yello", "red"]
sea.countplot(spam, palette = colors)
plt.title("Distribution of Spam", fontsize = 14)




# Task 3: Feature Extraction
## Hint : see lecture of week 6
* How can we handle text?
* Discuss possible features for a numerical repressentation!
* How can we obtain a compact and non-sparse representation?

See: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

In [28]:
data.dropna(how="any", axis=1, inplace=True)
data.columns=['a', 'b']
data.head()

(5572, 8672)

In [None]:
#Changing the labels with numerical data
data['a_num'] = data['a'].map({'ham':0 ,'spam':1})
data.head()

In [None]:
#length of each text b
data['b_len']=data.b.apply(len)
data.head()

In [None]:
#Showing basic stats on the characteristics of b based on ham and spam
data[data.a_num==0].describe()



In [None]:
data[data.a_num==1].describe()

In [None]:
#Separate any punctuation or stopwords in the b. 

#creating a new df

data_1 = data['b'].copy()



In [None]:
def text_process(text):
    
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
    
    return " ".join(text)

In [None]:
data_1=data_1.apply(text_process)
data_1.head()

In [None]:
vectorizer = TfidfVectorizer("english")

In [None]:
#A compressed sparse matrix
features = vectorizer.fit_transform(data_1)
features
data.head()

In [None]:
#splitting features to train and test set

features_train, features_test, Class_train, Class_test = train_test_split(features, data['a_num'], test_size=0.2, random_state=111)

# Task 4: Train a Random Forrest Model
* Train and evaluate the model using the approach from task 3
* Diskuss the results -> possible improovements?
* Use RF feature importance to see which features are driving the RF Decission

See: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
data = RandomForestClassifier(n_estimators=31, random_state=111)

In [None]:
qr = {'RF': data}

In [None]:
#function to fit classifier and make predictions

def train_classifier(qr, feature_train, Class_train):    
    qr.fit(feature_train, Class_train)

def predict_labels(qr, features):
    return (qr.predict(features))

In [None]:
pred_scores = []
for k,v in qr.items():
    train_classifier(v, features_train, Class_train)
    pred = predict_labels(v,features_test)
    pred_scores.append((k, [accuracy_score(Class_test,pred)]))

In [None]:
pred_scores
pred=pd.DataFrame(pred)
pred
Class_train
features_test
features_train

In [None]:
pred
Class_test

In [None]:
#Plotting confusion matrix

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
matrix = confusion_matrix(Class_test,pred)

In [None]:
confusion_matrix?

In [None]:
print('Confusion matrix : \n',matrix)

In [None]:
x, y, z, q = confusion_matrix(Class_test, pred).ravel()

In [None]:
x

In [None]:
y

In [None]:
z

In [None]:
q

In [None]:
print('Precision',q/(q+y))

In [None]:
print('Recall',q/(q+z))