# Part 1.

The deadline for Part 1 is **2 pm Feb 11, 2021**.   
You should submit a `.ipynb` file with your solutions to NYU Classes.

---
Spam filtering is a well-studied NLP classification problem that is used in many commercial products nowadays, including email clients and mobile SMS apps.

In this assignment we will train a logistic regression model to classify each text in the SMS Spam Collection dataset as either spam or legitimate. This dataset consists of 5,574 English SMS messages, each tagged as either spam or ham (legitimate). The column 'v1' contains the label and the column 'v2' contains the SMS text. We will pre-process and convert the texts into bag-of-words features for our model.

## Data Loading

First, we download the SMS Spam Collection Dataset. The below command downloads the dataset from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loads it to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR).

In [1]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2021-02-07 23:18:49--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 216.58.196.110
Connecting to docs.google.com (docs.google.com)|216.58.196.110|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/u9laobobv30sggjlpv0plslrs8kif3tb/1612720125000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2021-02-07 23:18:50--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/u9laobobv30sggjlpv0plslrs8kif3tb/1612720125000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 216.58.200.161
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)|216.58.200.161|:443... connected.


In [2]:
!ls

HW_1_Part_1_Spam_Prediction.ipynb spam.csv


Now we preview the data. There are two columns: `v1` -- the label, which indicates whether the text is spam or ham (legitimate), and `v2` -- the text of the message.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head(10)


Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


In [4]:
df.info()
df.describe()
print(df['v1'][5])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
v1    5572 non-null int64
v2    5572 non-null object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB
1


Your task is to randomly split the data into training, validation, and test sets. Make sure that each row appears in only one of the splits and that training contains 70% of the data, validation contains 15%, and test contains 15%. You can compute the size of each split afterwards to sanity check your code. **You may use numpy and pandas functions, but please do not use sklearn.**

In [5]:
print(df.loc[[5]])

   v1                                                 v2
5   1  FreeMsg Hey there darling it's been 3 week's n...


In [6]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)
train_size = int(df.shape[0] * 0.7)
"""
YOUR CODE GOES HERE
"""

#RANDOMLY SPLIT DATA : TRAIN, VALIDATE, TEST. 

# EACH ROW IN ONE SPLIT, NOT MULTIPLE DATA SETS

# TRAIN : 70%, VALIDATE : 15%, TEST: 15%



# FIRST WE SPLIT, THEN WE ASSIGN. 

train_size, val_size, test_size = np.split(df.sample(frac = 1, random_state = 42), [int(df.shape[0]*0.7), int(df.shape[0]*0.85)])

train_sizeNP = np.array(train_size)
val_sizeNP = np.array(val_size)
test_sizeNP = np.array(test_size)



train_texts, train_labels = train_sizeNP[:,1], train_sizeNP[:,0]
val_texts, val_labels     = val_sizeNP[:,1], val_sizeNP[:,0]
test_texts, test_labels   = test_sizeNP[:,1], test_sizeNP[:,0]


#Check in the end if there is no overlap



In [7]:
test_texts

array(['We have new local dates in your area - Lots of new people registered in YOUR AREA. Reply DATE to start now! 18 only www.flirtparty.us REPLYS150',
       "Lol I have to take it. member how I said my aunt flow didn't visit for 6 months? It's cause I developed ovarian cysts. Bc is the only way to shrink them.",
       'Okay. No no, just shining on. That was meant to be signing, but that sounds better.',
       'You sure your neighbors didnt pick it up',
       'I actually did for the first time in a while. I went to bed not too long after i spoke with you. Woke up at 7. How was your night?',
       'Hows the street where the end of library walk is?',
       'Neshanth..tel me who r u?',
       'Are you willing to go for apps class.',
       'Freemsg: 1-month unlimited free calls! Activate SmartCall Txt: CALL to No: 68866. Subscriptn3gbp/wk unlimited calls Help: 08448714184 Stop?txt stop landlineonly',
       "K..k..i'm also fine:)when will you complete the course?",
       'When yo

In [8]:
train_labels

array([0, 0, 1, ..., 0, 0, 0], dtype=object)

In [9]:
train_texts

array(['Funny fact Nobody teaches volcanoes 2 erupt, tsunamis 2 arise, hurricanes 2 sway aroundn no 1 teaches hw 2 choose a wife Natural disasters just happens',
       'I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones',
       'We know someone who you know that fancies you. Call 09058097218 to find out who. POBox 6, LS15HB 150p',
       ...,
       "How are you doing. How's the queen. Are you going for the royal wedding",
       'Er yeah, i will b there at 15:26, sorry! Just tell me which pub/cafe to sit in and come wen u can',
       'Yep then is fine 7.30 or 8.30 for ice age.'], dtype=object)

In [10]:
print("The size of the train_texts is : " + str(len(train_texts)))
print("\n")
print("The size of the train_labels is : " + str(len(train_labels)))
print("\n")

print("The size of the val_texts is : " + str(len(val_texts)))
print("\n")

print("The size of the train_labels is : " + str(len(val_labels)))
print("\n")


print("The size of the test_texts is : " + str(len(test_texts)))
print("\n")

print("The size of the test_labels is : " + str(len(test_labels)))
print("\n")

The size of the train_texts is : 3900


The size of the train_labels is : 3900


The size of the val_texts is : 836


The size of the train_labels is : 836


The size of the test_texts is : 836


The size of the test_labels is : 836




## Data Processing

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In Lab 2 we will use the built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the class `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [11]:
test_texts

array(['We have new local dates in your area - Lots of new people registered in YOUR AREA. Reply DATE to start now! 18 only www.flirtparty.us REPLYS150',
       "Lol I have to take it. member how I said my aunt flow didn't visit for 6 months? It's cause I developed ovarian cysts. Bc is the only way to shrink them.",
       'Okay. No no, just shining on. That was meant to be signing, but that sounds better.',
       'You sure your neighbors didnt pick it up',
       'I actually did for the first time in a while. I went to bed not too long after i spoke with you. Woke up at 7. How was your night?',
       'Hows the street where the end of library walk is?',
       'Neshanth..tel me who r u?',
       'Are you willing to go for apps class.',
       'Freemsg: 1-month unlimited free calls! Activate SmartCall Txt: CALL to No: 68866. Subscriptn3gbp/wk unlimited calls Help: 08448714184 Stop?txt stop landlineonly',
       "K..k..i'm also fine:)when will you complete the course?",
       'When yo

In [12]:
import nltk


In [13]:
def preprocess_data(data):
    # This function should return a list of lists of pre-processed tokens, where
    # each nested list contains the tokens for a single message. To pre-process
    # each message, tokenize the message and lowercase all text.
    """
    YOUR CODE GOES HERE
    """
    
    # TAKES IN : LIST OF TEXTS, BASICALLY SOMETHING LIKE THIS : lst = ["Hello my Name is Yash","I am an Idiot"]
    
    # RETURNS: LIST OF LISTS OF TOKENS, Basically : [[token for the single message],  [],  [],]
    
    # LOWERCASE each single instance of text inside the nested array
    preprocessed_data = []
    for sentence in data:
        preprocessed_data.append(nltk.word_tokenize(sentence.lower()))
        
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [14]:
train_data

[['funny',
  'fact',
  'nobody',
  'teaches',
  'volcanoes',
  '2',
  'erupt',
  ',',
  'tsunamis',
  '2',
  'arise',
  ',',
  'hurricanes',
  '2',
  'sway',
  'aroundn',
  'no',
  '1',
  'teaches',
  'hw',
  '2',
  'choose',
  'a',
  'wife',
  'natural',
  'disasters',
  'just',
  'happens'],
 ['i',
  'sent',
  'my',
  'scores',
  'to',
  'sophas',
  'and',
  'i',
  'had',
  'to',
  'do',
  'secondary',
  'application',
  'for',
  'a',
  'few',
  'schools',
  '.',
  'i',
  'think',
  'if',
  'you',
  'are',
  'thinking',
  'of',
  'applying',
  ',',
  'do',
  'a',
  'research',
  'on',
  'cost',
  'also',
  '.',
  'contact',
  'joke',
  'ogunrinde',
  ',',
  'her',
  'school',
  'is',
  'one',
  'me',
  'the',
  'less',
  'expensive',
  'ones'],
 ['we',
  'know',
  'someone',
  'who',
  'you',
  'know',
  'that',
  'fancies',
  'you',
  '.',
  'call',
  '09058097218',
  'to',
  'find',
  'out',
  'who',
  '.',
  'pobox',
  '6',
  ',',
  'ls15hb',
  '150p'],
 ['only',
  'if',
  'you',


In [15]:
import numpy as np
import string

#Bag of Words Model
class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Given a list of lists of tokens, create a vocab list, self.vocab_list, 
        # using the most frequent "max_features" tokens. Also create a token 
        # indexer, self.token_to_index, that will return the index of the token 
        # in self.vocab_list.
        
        
        #vectorize and make a matrix of features
        
        
        """
        YOUR CODE GOES HERE
        """
        # Words simply contains all the words in the 
        words = []
        for outer in dataset:
            for inner in outer:
                words.append(inner)
 
        vocab_listNP = np.array(words)
        
        # Creating a dictionary of all the unique keys, values
        unique, counts = np.unique(vocab_listNP, return_counts=True)
        vocab_dictionary = dict(zip(unique, counts))
        
        
        # Sorting the dictionary in ascending order
        sorted_dict = {}
        sorted_keys = sorted(vocab_dictionary, key = vocab_dictionary.get)
        for w in sorted_keys:
            sorted_dict[w] = vocab_dictionary[w];
        
        # Making a list from the dictionary
        self.vocab_list = list(sorted_dict.keys())
        # Slicing the list
        self.vocab_list = self.vocab_list[len(self.vocab_list)-self.max_features:]
        # First comes the highest frequency of words
        self.vocab_list.reverse()
        #removing the punctuation
        punc_list = string.punctuation
        for word in self.vocab_list:
            if word in punc_list:
                self.vocab_list.remove(word)
        # Dictionary for token to index
        self.token_to_index = {k: v for v, k in enumerate(self.vocab_list)}
        
           

    
    def transform(self, dataset):
        # This function transforms the text dataset (a list of lists of tokens) 
        # into a matrix, "data_matrix," where the entry located at (i, j) is 1 
        # if sample i of the dataset contains token j in the vocab list, and 0 
        # otherwise.
        """
        YOUR CODE GOES HERE
        """
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        for i in range(len(dataset)):
            for j in range(len(dataset[i])):
                #If the word in the current line is in the feature list, turn the switch ON
                if dataset[i][j] in self.vocab_list:
                    data_matrix[i][self.token_to_index[dataset[i][j]]] = 1

        return data_matrix

In [33]:
max_features = 500 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)
y_train = np.array(train_labels)

y_train=y_train.astype('int')

y_val = np.array(val_labels)
y_test = np.array(test_labels)
vocab = vectorizer.vocab_list


You can add more features to the feature matrix. (Not required, but worth extra credit points.)

In [34]:
#Do later to improve the F1 and accuracy scores

#ADDING MORE NEW FEATURES


"""
YOUR CODE GOES HERE
"""

'\nYOUR CODE GOES HERE\n'

## Model

We train a logistic regression model on the bag-of-words features and save predictions for the training, validation, and test sets.

In [35]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)


# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model

Your task is to report train, val, test accuracies and F1 scores.
**You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.**

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [36]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    #ACCURACY, (TP+TN)/(TP+TN+FP+FN)
    
    test = 0;
    for x, y in zip(y_pred, y_true):
        if x == y:
            test+=1
            
    accuracy = test/len(y_true)
    print(accuracy)
    
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    """
        F1 Score is calculated as: 
        F1 = (2*(precision*recall)/(precision+recall))
        
        recall = TP/(TP+FN)
    """
    
    tp, tn, fp, fn = 0, 0, 0, 0
    for x, y in zip(y_pred, y_true):
        if x == y and x!=0:
            tp+=1
        elif x == y and x == 0:
            tn+=1
        elif x-y == -1:
            fn+=1
        elif x-y==1:
            fp+=1
            
    recall = tp/(tp+fn)
    precision = tp/(tp+fp)
    
    f1 = 2*(precision*recall)/(precision+recall)
    print(f1)
    
    return f1

In [37]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

0.9889743589743589
0.9572139303482587
Training accuracy: 0.989, F1 score: 0.957
0.9808612440191388
0.9252336448598131
Validation accuracy: 0.981, F1 score: 0.925
0.9796650717703349
0.9186602870813396
Test accuracy: 0.980, F1 score: 0.919


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** 

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** 

### Exploration of predictions

Show a few examples with correctly predicted labels on the train and val sets.

In [None]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham


**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** 

In [None]:
"""
YOUR CODE GOES HERE
"""

## End of Part 1.
