# Exercise 2: Spam Detection
### Spam Data Set: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
The objective is to train a model, which can be used for automatic detection of spam messages.<br>
We will use the experience showing that 
- messages, containing words like 'free', 'win', 'winner', 'cash', 'prize' and the like usually contain spam
- spam messages tend to have words written in all capitals and 
- also tend to use a lot of exclamation marks

## Step 1: Get to Know the Dataset ### 
We will be using a [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) from the UCI Machine Learning repository.

In [None]:
import pandas as pd
# It is a pre-processed table with two columns - a label and a message
# Import the table into a pandas dataframe using the read_table method
df = pd.read_table('/Users/tdi/Documents/Teaching/DS/Data/SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])
df.shape

In [None]:
# Printing out first five rows to get idea about the data
df.head(200)

## Step 2: Data Preprocessing ###

### 2.1 Digitalize

In [None]:
# Convert the labels into numerical values, map 'ham' to 0 and 'spam' to 1
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head() 

### 2.2 Bag-of-Words Processing

A model, which represents a piece of text, such as a sentence or a document, as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
The words are stored as tockens, with a count of frequency of their appearance.

1. Convert strings to lower case
2. Remove punctuation
3. Tokenize the message and give an integer ID to each token
4. Count frequencies

In [None]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

In [None]:
# Create an instance of CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
training_data.shape

In [None]:
# Transform testing data and return the matrix 
# Note we are not fitting the testing data into the CountVectorizer()
test_data = count_vector.transform(X_test)
test_data.shape

## Step 3: Train and Test

In [None]:
# Call Multinominal Naive Bayes and train the model
from sklearn.naive_bayes import MultinomialNB
myNB = MultinomialNB()
myNB.fit(training_data, y_train)

In [None]:
# Test on the test data, try prediction
predictions = myNB.predict(test_data)

In [None]:
# my_data = count_vector.transform([" free"])

In [None]:
# predictions = myNB.predict(my_data)

In [None]:
predictions

In [None]:
predictions.shape

## Step 4: Validate

In [None]:
# Validate the accuracy of the predictions
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

## <span style="color:red">Task</span>
Repeat the training, testing and validation with the Decision Tree method previously researched.
Upload in the Assignment section the answer to the question: Which of the two methods gives better results?
Apply the proves.