## Data Loading 

First, download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR)

In [655]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2022-02-11 03:19:42--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 142.251.40.110
Connecting to docs.google.com (docs.google.com)|142.251.40.110|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/tf5sbh75u0t5rl7c19ec9qtab3dhg0r0/1644567525000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2022-02-11 03:19:42--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/tf5sbh75u0t5rl7c19ec9qtab3dhg0r0/1644567525000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 142.251.40.97
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)|142.251.40.97|:443... connected.
HTTP reque

In [656]:
#!ls

There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [657]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


split into train, val, and test 

In [658]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

np.random.seed(12)


indices = np.arange(int(df.shape[0]))
shuffled_indices = np.random.permutation(indices)

train_size = int(df.shape[0] * 0.7)
train_val_size = train_size + val_size


train_indices = shuffled_indices[0:train_size]
dev_indices = shuffled_indices[train_size:train_val_size]
test_indicies = shuffled_indices[train_val_size:]


train = df.loc[train_indices]
val = df.loc[dev_indices]
test = df.loc[test_indicies]


        
            
train_texts, train_labels = train.loc[:,"v2"].tolist(), train.loc[:,"v1"]
val_texts, val_labels     = val.loc[:,"v2"].tolist(), val.loc[:,"v1"]
test_texts, test_labels  = test.loc[:,"v2"].tolist(), test.loc[:,"v1"]


## Data Processing 

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. 


Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 


Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [659]:
import nltk
nltk.download('punkt')
nltk.data.load('tokenizers/punkt/english.pickle')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/corrinazhang1220/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7fe80f415f50>

In [660]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def preprocess_data(data):
    # This function returns a list of lists of preprocessed tokens for each message
   
    
    preprocessed_data = []
    for i in range(len(data)):
        token_list = nltk.word_tokenize(data[i])
        #change the words to lower cases and ignore the punctuations. 
        revised_list = [word.lower() for word in token_list if word.isalnum()]
        preprocessed_data.append(token_list)
        
            
    
    
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [661]:
import numpy as np
from collections import Counter

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None
        
    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will map each token in self.vocab 
        # to its corresponding index in self.vocab_list
      
        
        data = [x for line in dataset for x in line]
        
        most_frequent_words = Counter(data).most_common(self.max_features)
        self.vocab_list = [token[0]for token in most_frequent_words]
        self.token_to_index = {ch: i for i, ch in enumerate(self.vocab_list)}
        
    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
     
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        
        for i in range(len(dataset)):
            for word in dataset[i]:
                if word in self.vocab_list:
                    index = int(self.token_to_index[word])
                    data_matrix[i][index] += 1 
        
        return data_matrix

In [662]:
max_features = 750 
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list


## Model

We train logistic regression model and save prediction for train, val and test.


In [664]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

The task is to report train, val, test accuracies and F1 scores. 

In [665]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
   
    TP = np.sum(y_true & y_pred)
    FP = np.sum(y_pred) - TP
    TN = np.sum(y_true == 0 ) - FP
    FN = np.sum(y_true) - TP
    accuracy = (TP + TN)/(TP + FP + FN + TN)
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
  
    TP = np.sum(y_true & y_pred)
    FP = np.sum(y_pred) - TP
    TN = np.sum(y_true == 0 ) - FP
    FN = np.sum(y_true) - TP
    
    f1 = TP/(TP + (FP + FN)/2)
    return f1

In [666]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.989, F1 score: 0.958
Validation accuracy: 0.983, F1 score: 0.929
Test accuracy: 0.982, F1 score: 0.934


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** No. There are other metric that should be optimized while training. Precision = Of all positive predictions, how many are really positive. Recall = Of all real positive cases, how many are predicted positive. Precision measures the error caused by FP, and recall measures the error caused by FN. A model prefers both minimum FN and FP, as well as a balance between FN and FP. Thus, we want to optimize F1 score which measures the harmonic mean of FN and FP. 

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** 
No. If the dataset is imbalanced and 99%of the instances are postive, and 1% of the instances are negative. The model could just predict every instances to be positive, and get accuracy of 99%. However, this model is not good, since it does not predict base on features, instead it just predicts all the instances to be postive. 

### Exploration of predicitons 

Show a few examples with true+predicted labels on the train and val sets.

In [667]:

# 1 - spam, 0 - ham

def true_pairs(y_true, y_pred):
    true_predict = []
    for i in range(len(y_true)):
        if y_true[i] == y_pred[i]:
            true_predict.append(i)
    return true_predict
        

y_train_true_pairs = true_pairs(y_train, y_train_pred)
y_val_true_pairs = true_pairs(y_val, y_val_pred)


for pair in (y_train_true_pairs[0:10]):
    print("index",pair, "is truly predicted", "y_train:", y_train[pair], "y_train_pred:", y_train_pred[pair])
    print(train_texts[pair],"\n")
    
    
for pair in (y_val_true_pairs[0:10]):
    print("index",pair, "is truly predicted", "y_val:", y_val[pair], "y_val_pred:", y_val_pred[pair])
    print(val_texts[pair],"\n")
    
    





index 0 is truly predicted y_train: 0 y_train_pred: 0
Wat makes some people dearer is not just de happiness dat u feel when u meet them but de pain u feel when u miss dem!!! 

index 1 is truly predicted y_train: 0 y_train_pred: 0
Thanks for being there for me just to talk to on saturday. You are very dear to me. I cherish having you as a brother and role model. 

index 2 is truly predicted y_train: 0 y_train_pred: 0
She was supposed to be but couldn't make it, she's still in town though 

index 4 is truly predicted y_train: 0 y_train_pred: 0
Thank You for calling.Forgot to say Happy Onam to you Sirji.I am fine here and remembered you when i met an insurance person.Meet You in Qatar Insha Allah.Rakhesh, ex Tata AIG who joined TISSCO,Tayseer. 

index 5 is truly predicted y_train: 0 y_train_pred: 0
K..then come wenever u lik to come and also tel vikky to come by getting free time..:-) 

index 6 is truly predicted y_train: 0 y_train_pred: 0
I will cme i want to go to hos 2morow. After that

**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** When I checked the raw contents from the val sets which were labeled incorrectly by the model. I found out that some of the contents contain some numbers or some random combinations of characters. These words did not appear in the bag of words because they rarely appear in all the texts. From this perspective, after vectorizer transform the text using "bag of words", these "fail predicted" text might not have shown something drastically differnt from other ham texts because the random combination of characters were not counted, and some other words that might appear in ham texts are counted. Thus, the model might not be able to lable these "fail predicted" text correctly.  

In [669]:

def fail_pairs(y_true, y_pred):
    fail_predict = []
    for i in range(len(y_true)):
        if y_true[i] != y_pred[i]:
            fail_predict.append(i)
    return fail_predict
        

y_val_fail_pairs = fail_pairs(y_val, y_val_pred)



    
for pair in (y_val_fail_pairs[0:10]):
    print("index",pair, "fail predicted", "y_val:", y_val[pair], "y_val_pred:", y_val_pred[pair])
    print(val_texts[pair],"\n")
    



index 163 fail predicted y_val: 1 y_val_pred: 0
You can donate å£2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. å£2.50 will be added to your next bill 

index 171 fail predicted y_val: 0 y_val_pred: 1
Derp. Which is worse, a dude who always wants to party or a dude who files a complaint about the three drug abusers he lives with 

index 196 fail predicted y_val: 0 y_val_pred: 1
Hi! You just spoke to MANEESHA V. We'd like to know if you were satisfied with the experience. Reply Toll Free with Yes or No. 

index 198 fail predicted y_val: 1 y_val_pred: 0
TBS/PERSOLVO. been chasing us since Sept forå£38 definitely not paying now thanks to your information. We will ignore them. Kath. Manchester. 

index 213 fail predicted y_val: 1 y_val_pred: 0
SMS. ac sun0819 posts HELLO:\You seem cool 

index 277 fail predicted y_val: 1 y_val_pred: 0
England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND