# Part 1.

The deadline for Part 1 is **1:30 pm Feb 6, 2020**.   
You should submit a `.ipynb` file with your solutions to NYU Classes.

---


In this part we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [2]:
!pip install wget
import wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=aceaf363b1b32d3ffd1a8a277db170167e2bafa84a3062987674de76ddc4b46e
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [3]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2020-02-03 21:18:38--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 108.177.125.101, 108.177.125.100, 108.177.125.138, ...
Connecting to docs.google.com (docs.google.com)|108.177.125.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/aie1bsnedlj43tn9bb8la3928q9gtbvn/1580760000000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2020-02-03 21:18:39--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/aie1bsnedlj43tn9bb8la3928q9gtbvn/1580760000000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 74.125.23.132, 2404:6800:4008:c02::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-0

In [4]:
!ls

sample_data  spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [5]:
!pip install pandas



In [6]:

import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test. Make sure that each row appears only in one of the splits.

In [7]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

"""
YOUR CODE GOES HERE
"""
# val = df.iloc[0:val_size]
# test = df.iloc[val_size:val_size+test_size]
# train = df.iloc[val_size+test_size:]

train_texts, train_labels = None, None
val_texts, val_labels     = None, None
test_texts, test_labels   = None, None
print(df)
df = df.sample(frac=1).reset_index(drop=True)
print(df)
val_texts = df.iloc[0:val_size].v2
val_labels = df.iloc[0:val_size].v1
#print(val_texts)
test_texts = df.iloc[val_size:val_size+test_size].v2
test_labels = df.iloc[val_size:val_size+test_size].v1

train_texts = df.iloc[val_size+test_size:].v2
train_labels = df.iloc[val_size+test_size:].v1

      v1                                                 v2
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...
4      0  Nah I don't think he goes to usf, he lives aro...
...   ..                                                ...
5567   1  This is the 2nd time we have tried 2 contact u...
5568   0              Will Ì_ b going to esplanade fr home?
5569   0  Pity, * was in mood for that. So...any other s...
5570   0  The guy did some bitching but I acted like i'd...
5571   0                         Rofl. Its true to its name

[5572 rows x 2 columns]
      v1                                                 v2
0      0      Anything lar then Ì_ not going home 4 dinner?
1      0    Yo, any way we could pick something up tonight?
2      0                             Yes.i'm in office da:)
3      0  If i 

## Data Processing

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [8]:
!pip install nltk
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    """
    YOUR CODE GOES HERE
    """
    
    preprocessed_data = None
    preprocessed_data = []
    for message in data:
      preprocessed_data.append(word_tokenize(message))
    print(type(preprocessed_data))
    return preprocessed_data
    

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
<class 'list'>
<class 'list'>
<class 'list'>


In [0]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will return index of the token in self.vocab
        """
        YOUR CODE GOES HERE
        """
        vocab_dict = {}

        for elem in dataset:
          for item in elem: 
            if item in vocab_dict.keys():
              vocab_dict[item] += 1
            else:
              vocab_dict[item] = 1
        most = sorted(vocab_dict, key = vocab_dict.get, reverse = True)
        self.vocab_list = most[:self.max_features]

        index = 0
        self.token_to_index = {}
        for elem in self.vocab_list:
          self.token_to_index[elem] = index
          index += 1
        pass

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        YOUR CODE GOES HERE
        """
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        row = 0
        for mylist in dataset: #get a particular list of words in the dataset:
          for word in self.vocab_list:
            if word in mylist:
              data_matrix[row][self.token_to_index[word]] = mylist.count(word)
          row += 1
        return data_matrix

In [0]:
max_features = 3000 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list


You can add more features to the feature matrix.

In [11]:
"""
YOUR CODE GOES HERE
"""

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [0]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

## Performance of the model

Your task is to report train, val, test accuracies and F1 scores.
**You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.**

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [0]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    accuracy = None
    correct = 0
    for i in range(len(y_true)):
      if y_true[i] == y_pred[i]:
        correct += 1
    accuracy = correct/len(y_true)
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    f1 = None
    tp = 0 
    for i in range(len(y_true)):
      if y_true[i] == 1:
        if y_pred[i] == 1:
          tp += 1
    fp = 0
    for i in range(len(y_true)):
      if y_true[i] == 0:
        if y_pred[i] == 1:
          fp += 1
    fn = 0 
    for i in range(len(y_true)):
      if y_true[i] == 1:
        if y_pred[i] == 0:
          fn += 1
    precision = tp / (tp+fp)
    recall = tp / (tp+fn)
    f1 = 2 * ((precision*recall) / (precision+recall))

    return f1

In [14]:
print(y_train)
print(y_train_pred)
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Training accuracy: 0.994, F1 score: 0.979
Validation accuracy: 0.981, F1 score: 0.926
Test accuracy: 0.980, F1 score: 0.917


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** In logistic regression for binary classification, we are trying to optimize the loss or error function. This function tells us how well our hypothesis matches the true labelling of the training point.  

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** Having a high accuracy does not imply that the model is great. This is because of the presence of false negatives. For example, suppose we were working with a dataset with 100 days, 85 of which it did not rain, and 15 of which it did rain. If our model was only predicting that it did not rain, the accuracy would be 85%, which is very high, but it would be predicting no rain for those 15 days. The F1 score alleviates this problem by taking into account the precision and recall which can therefore give a more appropriate idea of the model's validity. 

### Exploration of predicitons

Show a few examples with true+predicted labels on the train and val sets.

In [31]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham
print("Training set examples: ")
for i in range(2353,2360):
  print(train_data[i])
  print("y_train label:", y_train[i])
  print("y_train_pred prediction: ", y_train_pred[i])

print("\n")
print("Validation set examples: ")
for i in range(1,8):
  print(val_data[i])
  print("y_val label:", y_val[i])
  print("y_val_pred prediction: ", y_val_pred[i])

# print(train_data[2353:2360])
# print(y_train[2353:2360])
# print(y_train_pred[2353:2360])

Training set examples: 
['Urgent', 'Ur', 'å£500', 'guaranteed', 'award', 'is', 'still', 'unclaimed', '!', 'Call', '09066368327', 'NOW', 'closingdate04/09/02', 'claimcode', 'M39M51', 'å£1.50pmmorefrommobile2Bremoved-MobyPOBox734LS27YF']
y_train label: 1
y_train_pred prediction:  1
['Free', '1st', 'week', 'entry', '2', 'TEXTPOD', '4', 'a', 'chance', '2', 'win', '40GB', 'iPod', 'or', 'å£250', 'cash', 'every', 'wk', '.', 'Txt', 'POD', 'to', '84128', 'Ts', '&', 'Cs', 'www.textpod.net', 'custcare', '08712405020', '.']
y_train label: 1
y_train_pred prediction:  1
['No', 'other', 'Valentines', 'huh', '?', 'The', 'proof', 'is', 'on', 'your', 'fb', 'page', '.', 'Ugh', 'I', "'m", 'so', 'glad', 'I', 'really', 'DID', "N'T", 'watch', 'your', 'rupaul', 'show', 'you', 'TOOL', '!']
y_train label: 0
y_train_pred prediction:  0
['Aaooooright', 'are', 'you', 'at', 'work', '?']
y_train label: 0
y_train_pred prediction:  0
['Well', 'done', 'ENGLAND', '!', 'Get', 'the', 'official', 'poly', 'ringtone', 'or', 

**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** In the examples shown below, the model incorrectly identified the texts as not spam (predicted 0) when the true labels were spam (labelled 1). Upon closer examination of the texts, they appear as if they could pass as not spam, because they are worded like text messages or personal emails. 

In [32]:
"""
YOUR CODE GOES HERE
"""

count = 0
for i in range(len(val_data)):
  if count != 10:
    if y_val[i] != y_val_pred[i]:
      print(val_data[i])
      print("y_val label:", y_val[i])
      print("y_val_pred prediction: ", y_val_pred[i])
      count += 1
    

['FROM', '88066', 'LOST', 'å£12', 'HELP']
y_val label: 1
y_val_pred prediction:  0
['100', 'dating', 'service', 'cal', ';', 'l', '09064012103', 'box334sk38ch']
y_val label: 1
y_val_pred prediction:  0
['0A', '$', 'NETWORKS', 'allow', 'companies', 'to', 'bill', 'for', 'SMS', ',', 'so', 'they', 'are', 'responsible', 'for', 'their', '\\suppliers\\', "''"]
y_val label: 1
y_val_pred prediction:  0
['For', 'sale', '-', 'arsenal', 'dartboard', '.', 'Good', 'condition', 'but', 'no', 'doubles', 'or', 'trebles', '!']
y_val label: 1
y_val_pred prediction:  0
['XCLUSIVE', '@', 'CLUBSAISAI', '2MOROW', '28/5', 'SOIREE', 'SPECIALE', 'ZOUK', 'WITH', 'NICHOLS', 'FROM', 'PARIS.FREE', 'ROSES', '2', 'ALL', 'LADIES', '!', '!', '!', 'info', ':', '07946746291/07880867867']
y_val label: 1
y_val_pred prediction:  0
['Hi', 'this', 'is', 'Amy', ',', 'we', 'will', 'be', 'sending', 'you', 'a', 'free', 'phone', 'number', 'in', 'a', 'couple', 'of', 'days', ',', 'which', 'will', 'give', 'you', 'an', 'access', 'to', '

## End of Part 1.
