# Homework 4 (Due Friday, Nov. 19th, 11:59pm PST)

1. Identify **three pairs of documents** in the McDonalds review dataset that have over 0.85 cosine similarity using average token word2vec embeddings from spacy.

Lets load dependencies, our data, and inspect it

In [1]:
import re
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from matplotlib.pyplot import figure

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

import spacy
from spacy import displacy

from sklearn.linear_model import LogisticRegression
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate

from tqdm.notebook import tqdm
mcd = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding="ISO-8859-1")
nlp = spacy.load("en_core_web_md")

C:\Users\drpow\anaconda3\lib\site-packages\numpy\.libs\libopenblas.jpijnswnnan3ce6lli5fwsphut2vxmth.gfortran-win_amd64.dll
C:\Users\drpow\anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


In [2]:
mcd.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


## Cleaning data 

While spacy handles tokenization, POS, stopword, and lemmatiziation steps of data cleaning we can still benefit from consolidating concepts using the logic below to map menu items and common themes back to single concepts

In [3]:
def consolidate_concepts(text):
    cleaned_reviews = []
    for review in text['review']:
        review = re.sub(r"(?:mcdonald's?|mcdonalds?|macdonalds?|mcds?)",'_MCDONALD_', review, flags=re.IGNORECASE)
        review = re.sub(r"(?:burgers?|cheeseburgers?|hamburgers?|hamburgersandwiches?)",'_HAMBURGER_', review, flags=re.IGNORECASE)
        review = re.sub(r"(?:McNuggets?|nuggets?|nugs?)",'_NUGGET_', review, flags=re.IGNORECASE)
        review = re.sub(r"(?:fries?|frys?|french fries?)",'_FRIES_', review, flags=re.IGNORECASE)
        cleaned_reviews.append(review)
    
    text['review_cleaned'] = cleaned_reviews
    return text

In [4]:
cleaned_mcd = consolidate_concepts(mcd)
cleaned_mcd.head()

Unnamed: 0,_unit_id,city,review,review_cleaned
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be...","I'm not a huge _MCDONALD_ lover, but I've been..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave...","First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo...","Well, it's _MCDONALD_, so you know what the fo..."


## Run reviews through spaCy pipeline
    
Lets make the data simple for analysis by creating columns for each spacy NLP attribute we want.

While running the pipeline lets compare similarity of document embeddings to each document that has already been processed, this means we will have pairwise comparisons bewteen all documents. we will write reviews with above .85 cos similarity out to be further analyzed

In [5]:
def add_spaCy_cols(text):
    similar_reviews=pd.DataFrame()
    review_1 = []
    review_2 = []
    similarity_l = []
    for n in tqdm(range(len(text['review_cleaned']))):
        target_raw_review=text.iloc[n].review
        target_review=text.iloc[n].review_cleaned
        nlp_target_review=nlp(target_review)
        clean_review_list=list(text['review_cleaned'])
        for object_review in text.iloc[n+1:].review_cleaned:
            object_idx = clean_review_list.index(object_review)
            object_raw_review=text.iloc[object_idx].review
            nlp_object_review = nlp(object_review)
            similarity = nlp_target_review.similarity(nlp_object_review)
            if similarity > 0.85:
                review_1.append(target_raw_review)
                review_2.append(object_raw_review)
                similarity_l.append(similarity)
            if len(review_1) > 2: #Going through all rows takes too much time, so we limited max 4 matches per review
                break
    similar_reviews['review_1']=review_1
    similar_reviews['review_2']=review_2
    similar_reviews['similarity']=similarity_l
    return similar_reviews

In [9]:
similar_reviews = add_spaCy_cols(mcd)

  0%|          | 0/1525 [00:00<?, ?it/s]

## Lets get 3 random reviews that were over .85 similarity

Because the reviews are all related to negative experiences at mcdonalds many reviews have high similarity, if these reviews were mixed with other sentiment reviews, or reviews from other establishments this code would be more effective at generating insights

In [10]:
for idx, row in similar_reviews.sample(3).iterrows():
    print("-----------Review 1-------------", row['review_1'][:300], sep ='\n')
    print("-----------Review 2-------------", row['review_2'][:300], sep ='\n')
    print("SIMILARITY: ", round(row['similarity'], 5))
    print('')


-----------Review 1-------------
Y'all want $15 an hour but I'm waiting 30 minutes for mcnuggets. Y'all got me McHeated.
-----------Review 2-------------
Can't expect much from a McDonald's but this is one of the worst I've been to. The employees are rude, they always get your orders wrong and then have an attitude when they have to fix it. Freshness is an issue as well. Wait time in the drive thru isn't too bad, but don't make the mistake of seeing 
SIMILARITY:  0.85966

-----------Review 1-------------
I work directly across from this place and go for coffee usually - rarely the breakfast. They ALWAYS get my order wrong - especially through the drive-through. Be careful for overcharging - and be specific with your order, otherwise they will give you a larger size than what you asked for. It has ha
-----------Review 2-------------
3 stars is about as good a review as a McDonald's can get. It's McDonald's, what more can you say? I reviewed this one because their coca-cola had the best 

# Using the `SMS_test` and `SMS_train` datasets, build a classification model 

(you can simply use the `sklearn.linear_model.LogisticRegression` model used. Please attempt at least two of the vectorization techniques below:
* `CountVectorization`
* `TfIdfVectorization`
* `word2vec` spacy document-level vectors

Make sure you perform the following:
* use train/test split
* use proper model evaluation metrics
* text preprocessing (regex, stemming/lemmatization, stopword removal, grouping entities, etc.)

A discussion of the following:
* **What techniques** you tried to improve the performance of your model.
* What you would try to do, given more time, that would improve the performance of your model.
* Provide an example of two **error cases** - a false positive and a false negative - that your model got wrong, and why the model did not predict the correct answer.

In [11]:
sms_test = pd.read_csv('SMS_test.csv',  encoding="ISO-8859-1")
sms_train = pd.read_csv('SMS_train.csv',  encoding="ISO-8859-1")
sms_train.head(5)


# map our spam ham values to binary
sms_train['Label'] = sms_train['Label'].map({'Spam':1, 'Non-Spam':0})
sms_test['Label'] = sms_test['Label'].map({'Spam':1, 'Non-Spam':0})

# seperate our y labels for future training
y_train = sms_train['Label']
y_test = sms_test['Label']


In [12]:
print(f"Our test data is ~{round((sms_test.shape[0] / (sms_train.shape[0]+sms_test.shape[0]))*100)}\
% of the total available data")

Our test data is ~12% of the total available data


### Functions for Model Evaluation: (Grid search, CV)

In [13]:
## stole this code from another class

import itertools as it

def clf_score(clf, x_train, y_train, cv=20, n_jobs=-1):
    labels = []
    train_scores = []
    test_scores = []
    score = cross_validate(clf, x_train, y_train, scoring='f1', cv=cv, n_jobs=n_jobs,
                           return_train_score=True, return_estimator=True)
    train_scores.append(score['train_score'])
    test_scores.append(score['test_score'])
    print(f"Mean Train Score:{np.mean(score['train_score'])} \n Mean Test Score:{np.mean(score['test_score'])}")
    return(np.mean(score['test_score']))

def get_paramsList(params_grid):
    """
    Create all possible combinations of params.
    Returns a list of all param names and a list of all param combinations.
    """
    allNames = sorted(params_grid)
    combinations = it.product(*(params_grid[Name] for Name in allNames))
    all_params = list(combinations)
    return allNames, all_params

def param_search(model, X, y, param_grid, verbose = True, scoring = 'ks', 
                 smote = True, stacking = False, models = None):
    """
    Brute force search through param_grid to find the optimal parameter combination based on the specified score type. 
    Can be used to search parameters for both stacking model and regular models.
    When stacking is True, SMOTE is disabled.
    scoring = ['accuracy', 'auc', 'ks']
    """
    param_names, all_params = get_paramsList(param_grid)
    print("Total combination:", len(all_params))
    best_score = 0
    best_param = None
    best_smote = None
    best_scores = None
    
    count = 0
    labels = []
    train_scores = []
    test_scores = []
    param_list = []
    score_list = []
    for cur_params in all_params:
        params = dict(zip(param_names, cur_params))
        model.set_params(**params)
        r2 = clf_score(model, X, y, 20, n_jobs=-1)

        if verbose:
            print("\t", params, f"R2: {r2:.3f}")

        score=r2
        
        param_list.append(params)
        score_list.append(score)
        
        if score > best_score:
            best_score = score
            best_param = params
            best_smote = smote
            best_scores = [r2]

        count += 1
        if count%10 == 0:
            print(f"{count} combinations searched")
    all_scores = pd.DataFrame([param_list,score_list])
    print("Best param:", best_param)
    print("Best scores (f1):", best_scores)
    return best_param, best_scores, all_scores

In [14]:
def find_unique_characters(regex, lines):
    """
    Finds unique characters from a list of strings, almost certainly inefficiently 
    
    """
    #Match anything that is non alpha-numeric or whitespace, creates list of lists of matching characters
    potential_malforms = [re.findall(regex, review) for review in lines]

    #lets whittle down this list of lists to a unqiue list, btw this took me way longer than it needed to
    unique_malforms = set([char for review in potential_malforms for char in review])
    
    print(F"Number of unique potential Malformed Characters: {len(unique_malforms)}, \n\nCandidates: {unique_malforms}")
    return unique_malforms

def clean_sms(df):
    cleaned_message = []
    for message in df['Message_body']:
        cleaned_message.append(re.sub(r"[^A-Za-z0-9 ]",'',message))
    
    df['cleaned_messages'] = cleaned_message
    return df 

def sms_spaCy_cols(text):
    word_embeddings = []
    ents = []
    ent_type = []
    for review in text['Message_body']:
        doc = nlp(review)
        word_embeddings.append(doc.vector)
        ents.append(doc.ents)
        for token in doc:
            ent_list = []
            ent_list.append(token.ent_type_)      
        ent_type.append(ent_list)
    text["doc_embeddings"] = word_embeddings
    text["entities"] = ents
    text['entity_types'] = ent_type
        
    return text

#part of speech logic stolen from: https://www.programiz.com/python-programming/methods/set/update

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def tfidf_process_documents(tot_reviews):
    """ For each document, clean punctuation, tag with POS, lemmatize each word and remove stopwords"""
    cleaned_sms = []
    for review in tot_reviews:
        # Clean punctuation
        clean_review = re.sub(r"[^A-Za-z0-9 ]",'', review)

        # Tokenize into words 
        lemmatized_word = []
        #Tag words with part of speech 
        nltk_tagged = nltk.pos_tag(nltk.word_tokenize(clean_review))  
        wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
        lemmatizer = WordNetLemmatizer()

        # lemetize, use part of speech if available
        for word, tag in wordnet_tagged:
            if tag is None:
                #if there is no available tag, append the token as is
                lemmatized_word.append(word)
            else:        
                #else use the tag to lemmatize the token
                lemmatized_word.append(lemmatizer.lemmatize(word, tag))

        words_clean = []
        for word in lemmatized_word:
            if word in nltk_stopwords:
                continue
            words_clean.append(word)
        cleaned_review = " ".join(words_clean)
        cleaned_sms.append((cleaned_review))

    return cleaned_sms

## Word to Vec Embeddings

### Feature engineering (one hot encoding entity types)
 It may be a variable of interest to have entity types including in aiding the detection of spam, to this end we have created one hot encodings of the entity types mentioned in each sms message

In [15]:
# clean sms
sms_train = clean_sms(sms_train)
sms_test = clean_sms(sms_test)

# run through spaCy pipeline
sms_train_spacy = sms_spaCy_cols(sms_train)
sms_test_spacy = sms_spaCy_cols(sms_test)


In [16]:
# Stole list one hot encoding code from: https://stackoverflow.com/questions/52189126

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

test_onehot = pd.DataFrame(mlb.fit_transform(sms_test_spacy['entity_types']),
                   columns=mlb.classes_,
                   index=sms_test_spacy['entity_types'].index)

train_onehot = pd.DataFrame(mlb.fit_transform(sms_train_spacy['entity_types']),
                   columns=mlb.classes_,
                   index=sms_train_spacy['entity_types'].index)

# drop junk column
train_onehot.drop(columns =[''], inplace=True)
test_onehot.drop(columns =[''], inplace=True)

# join back to main dfs 
sms_train_spacy = train_onehot.join(sms_train_spacy)
sms_test_spacy = test_onehot.join(sms_test_spacy)

### Train test preprocessing and vector flattening

In [17]:
X_train = pd.concat([pd.DataFrame(sms_train_spacy['doc_embeddings'].values.flat),sms_train_spacy.iloc[:,:8]], axis = 1)
X_test = pd.concat([pd.DataFrame(sms_test_spacy['doc_embeddings'].values.flat),sms_test_spacy.iloc[:,:8]], axis = 1)

#lets join all our train test data for CV
X_trntst = pd.concat([X_train, X_test], ignore_index=True).fillna(0).drop(columns = ['S. No.'])
Y_trntst = pd.concat([y_train, y_test], ignore_index=True).fillna(0).drop(columns = ['S. No.'])

In [18]:
%%time
param_grid = {
    'hidden_layer_sizes': [300, 100],
    'activation': ['relu', 'tanh'],
    'solver': ['sgd'],
    'learning_rate':['adaptive'],
    'max_iter':[200],
    'learning_rate_init':[0.1],
    'alpha': [0.0001]
}


model = MLPClassifier()
params, scores, dataframe = param_search(model, X_trntst, Y_trntst, param_grid, verbose = True)

Total combination: 4
Mean Train Score:1.0 
 Mean Test Score:0.9108965372432862
	 {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': 300, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.911
Mean Train Score:1.0 
 Mean Test Score:0.9048032533865081
	 {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': 100, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.905
Mean Train Score:1.0 
 Mean Test Score:0.897537113729064
	 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': 300, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.898
Mean Train Score:1.0 
 Mean Test Score:0.8996799708719212
	 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': 100, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.900
Best param: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes'

## TFIDF Classifier

Decent Classifier, mid 70's f1 score, still worse than word embedding

In [19]:
def tfidf(df):
    vectorizer = TfidfVectorizer(ngram_range=(2, 2),
                                 max_features=500)
    X = vectorizer.fit_transform(df['Message_body'])
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray(), columns=terms)

    return tf_idf 

In [20]:
# Cleaning with regex, POS tagging, lemmatization, stopword removal 

nltk_stopwords=list(set(stopwords.words('english')))

tf_sms_train = tfidf_process_documents(sms_train['Message_body'])
tf_sms_test = tfidf_process_documents(sms_test['Message_body'])

# TFIDF vectorize
tf_X_train = tfidf(sms_train)
tf_X_test = tfidf(sms_test)

tf_X_trntst = pd.concat([tf_X_train, tf_X_test], ignore_index=True).fillna(0)
tf_Y_trntst = pd.concat([y_train, y_test], ignore_index=True).fillna(0)

In [21]:
%%time
param_grid = {
    'hidden_layer_sizes': [300, 100],
    'activation': ['relu', 'tanh'],
    'solver': ['sgd'],
    'learning_rate':['adaptive'],
    'max_iter':[200],
    'learning_rate_init':[0.1],
    'alpha': [0.0001]
}


model = MLPClassifier()
params, scores, dataframe = param_search(model, tf_X_trntst, tf_Y_trntst, param_grid, verbose = True)

Total combination: 4
Mean Train Score:0.9525317270239011 
 Mean Test Score:0.7263279522335249
	 {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': 300, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.726
Mean Train Score:0.9520854966581428 
 Mean Test Score:0.7453664799253035
	 {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': 100, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.745
Mean Train Score:0.9524412601004186 
 Mean Test Score:0.7286340237849525
	 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': 300, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.729
Mean Train Score:0.9525196214589249 
 Mean Test Score:0.740422870902747
	 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': 100, 'learning_rate': 'adaptive', 'learning_rate_init': 0.1, 'max_iter': 200, 'solver': 'sgd'} R2: 0.740
Best param: 

### Rebuild model to find incorrectly labeled messages 


In [22]:
model = MLPClassifier(activation= 'relu', alpha = 0.0001, hidden_layer_sizes = 100,\
                      learning_rate='adaptive', learning_rate_init= 0.1, max_iter= 200, solver= 'sgd')
model.fit(tf_X_train, y_train)


# Lets find some incorrectly labeled messages

predictions = model.predict(tf_X_test)

sms_test['tfidf_pred'] = predictions

sms_test[sms_test['tfidf_pred'] != sms_test['Label']]



Unnamed: 0,S. No.,Message_body,Label,cleaned_messages,doc_embeddings,entities,entity_types,tfidf_pred
0,1,"UpgrdCentre Orange customer, you may now claim...",1,UpgrdCentre Orange customer you may now claim ...,"[-0.061032206, 0.19988048, -0.07019498, -0.023...","((UpgrdCentre, Orange), (0207, 153, 9153), (26...",[],0
1,2,"Loan for any purpose £500 - £75,000. Homeowner...",1,Loan for any purpose 500 75000 Homeowners Te...,"[-0.11778805, 0.2355668, -0.1783199, -0.033316...",(),[],0
2,3,Congrats! Nokia 3650 video camera phone is you...,1,Congrats Nokia 3650 video camera phone is your...,"[-0.16864648, 0.17816488, 0.012670864, -0.0110...","((Nokia), (3650), (16, +), (300603), (BCM4284,...",[PERSON],0
3,4,URGENT! Your Mobile number has been awarded wi...,1,URGENT Your Mobile number has been awarded wit...,"[-0.10755558, 0.18265584, -0.024203632, -0.094...","((2000), (09058094455), (3030), (12hrs))",[],0
4,5,Someone has contacted our dating service and e...,1,Someone has contacted our dating service and e...,"[-0.012037433, 0.21381874, -0.17289515, 0.0333...","((09111032124),)",[],0
...,...,...,...,...,...,...,...,...
99,100,Congratulations ur awarded 500 of CD vouchers ...,1,Congratulations ur awarded 500 of CD vouchers ...,"[-0.18412688, 0.0780921, -0.108404435, 0.00680...","((500), (125gift), (&, Free), (2, 100), (87066))",[CARDINAL],0
100,101,Not directly behind... Abt 4 rows behind ü...,0,Not directly behind Abt 4 rows behind,"[0.03910618, 0.051001094, -0.2445561, -0.08481...","((4),)",[],1
102,103,URGENT! This is the 2nd attempt to contact U!U...,1,URGENT This is the 2nd attempt to contact UU h...,"[-0.0497304, 0.16805999, -0.019893367, -0.0411...","((U!U), (£, 1000CALL), (09071512432), (max£7),...",[CARDINAL],0
105,106,Wanna have a laugh? Try CHIT-CHAT on your mobi...,1,Wanna have a laugh Try CHITCHAT on your mobile...,"[-0.03521142, 0.12336205, -0.15226537, 0.05641...","((4217), (London), (16))",[],0


In [23]:
sms_test.iloc[109]['Message_body']

'Should i buy him a blackberry bold 2 or torch. Should i buy him new or used. Let me know. Plus are you saying i should buy the  &lt;#&gt; g wifi ipad. And what are you saying about the about the  &lt;#&gt; g?'

In [24]:
sms_test.iloc[4]['Message_body']

'Someone has contacted our dating service and entered your phone because they fancy you! To find out who it is call from a landline 09111032124 . PoBox12n146tf150p'

## Summary

**Best Test F1 Score:**

TFIDF: 0.745

Word 2 Vec: 
0.911

--------------------------------------------------------------------------
--------------------------------------------------------------------------    

* *What techniques* you tried to improve the performance of your model.
    - Param grid search 
    - 20-fold CV to ensure accurate score reporting
    - Turning named entity types into one hot encoding categorical variables (word2vec)


* What you would try to do, given more time, that would improve the performance of your model.
    - More model frameworks (logistic regression, random forest, boosted tress, Light gradient boosting)
    - Tinker with params in grid search
    - tinker with cv to get different train data splits
    - more feature engineering 


* Provide an example of two **error cases** - a false positive and a false negative - that your model got wrong, and why the model did not predict the correct answer.

**Below mis-classifications are from TFIDF:**

    - False Positive Original Message (S. No. 110): 
    
<i>
Should i buy him a blackberry bold 2 or torch. Should i buy him new or used. Let me know. Plus are you saying i should buy the  &lt;#&gt; g wifi ipad. And what are you saying about the about the  &lt;#&gt; g?'</i>


    
We can Infer that the model has learned the offers of electronics (blackberry, ipad) are often spam, which we can see is frequent for spam messages in the raw data, and is also why this non-spam discussion of electronics gifts got classified as spam.

--------------------------------------------------------------------------    
    - False Negative Original Message (S. No. 5)
    
<i>
'Someone has contacted our dating service and entered your phone because they fancy you! To find out who it is call from a landline 09111032124 . PoBox12n146tf150p'
</i>



    
False negative is harder to explain as this one as read by a human is very clearly spam, we can hypothesize that there may not be enough mentions of dating services in our spam dataset, our model did not learn this relationship, this could likely be avoided in the future by having a larger dataset with more of these common dating spam messages.

    
    