# Homework 4 (Due Friday, Nov. 19th, 11:59pm PST)

1. Identify **three pairs of documents** in the McDonalds review dataset that have over 0.85 cosine similarity using average token word2vec embeddings from spacy.

Lets load dependencies, our data, and inspect it

In [102]:
import re
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from matplotlib.pyplot import figure


import spacy
from spacy import displacy

from sklearn.linear_model import LogisticRegression
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


mcd = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding="ISO-8859-1")
nlp = spacy.load("en_core_web_md")

In [2]:
mcd.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


## Cleaning data 

While spacy handles tokenization, POS, stopword, and lemmatiziation steps of data cleaning we can still benefit from consolidating concepts using the logic below to map menu items and common themes back to single concepts

In [3]:
def consolidate_concepts(text):
    cleaned_reviews = []
    for review in text['review']:
        review = re.sub(r"(?:mcdonald's?|mcdonalds?|macdonalds?|mcds?)",'_MCDONALD_', review, flags=re.IGNORECASE)
        review = re.sub(r"(?:burgers?|cheeseburgers?|hamburgers?|hamburgersandwiches?)",'_HAMBURGER_', review, flags=re.IGNORECASE)
        review = re.sub(r"(?:McNuggets?|nuggets?|nugs?)",'_NUGGET_', review, flags=re.IGNORECASE)
        review = re.sub(r"(?:fries?|frys?|french fries?)",'_FRIES_', review, flags=re.IGNORECASE)
        cleaned_reviews.append(review)
    
    text['review_cleaned'] = cleaned_reviews
    return text

In [4]:
cleaned_mcd = consolidate_concepts(mcd)
cleaned_mcd.head()

Unnamed: 0,_unit_id,city,review,review_cleaned
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be...","I'm not a huge _MCDONALD_ lover, but I've been..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave...","First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo...","Well, it's _MCDONALD_, so you know what the fo..."


## Run reviews through spaCy pipeline
    
Lets make the data simple for analysis by creating columns for each spacy NLP attribute we want.

While running the pipeline lets compare similarity of document embeddings to each document that has already been processed, this means we will have pairwise comparisons bewteen all documents. we will write reviews with above .85 cos similarity out to be further analyzed

In [51]:
def add_spaCy_cols(text, similarity_df):
    word_embeddings = []
    ents = []
    docs = []
    for review in text['review']:
        doc = nlp(review)
        word_embeddings.append(doc.vector)
        ents.append(doc.ents)
        for count, vector in enumerate(doc):
            if doc.similarity(vector) > .85:
                similarity_df = similarity_df.append({'review_1':review, 'review_2':text.iloc[count, 2],\
                                                      'similarity':doc.similarity(vector)},ignore_index=True)
        docs.append(doc)
    text["doc_embeddings"] = word_embeddings
    text["entities"] = ents

        
    return text, similarity_df

In [52]:
similar_reviews = pd.DataFrame(columns = ['review_1', 'review_2', 'similarity'])

mcd, similar_reviews = add_spaCy_cols(mcd, similar_reviews)

  if doc.similarity(vector) > .85:


## Lets get 3 random reviews that were over .85 similarity

In [53]:
for idx, row in similar_reviews.sample(3).iterrows():
    print("-----------Review 1-------------", row['review_1'][:300], sep ='\n')
    print("-----------Review 2-------------", row['review_2'][:300], sep ='\n')
    print("SIMILARITY: ", round(row['similarity'], 5))
    print('')



-----------Review 1-------------
So my first experience with this McDonald's was the day that I moved in. I had a couple friends helping me. This McDonald's is directly across the street from my apartment.So, after witnessing a guy smash his van (we later were told by the po-po that it was stolen) into the light pole right at the c
-----------Review 2-------------
Worst McDonalds in the history of McDonalds. All of their employees are rude, serve cold food, and have bad attitudes.. Your better off with Taco Bell a mile away. Don't waste your time, or money.
SIMILARITY:  0.85908

-----------Review 1-------------
OK. so. When we first moved here..... We had a bunch of friends visit and come stay at our brand new house back in 2008? I don't remember... a while back. But this joint was still fairly new. We came here on the way to the airport to drop off our weekend guests at the McCarran for a quick bite to ea
-----------Review 2-------------
We came here the day after Christmas with our 2

# Using the `SMS_test` and `SMS_train` datasets, build a classification model 

(you can simply use the `sklearn.linear_model.LogisticRegression` model used. Please attempt at least two of the vectorization techniques below:
* `CountVectorization`
* `TfIdfVectorization`
* `word2vec` spacy document-level vectors

Make sure you perform the following:
* use train/test split
* use proper model evaluation metrics
* text preprocessing (regex, stemming/lemmatization, stopword removal, grouping entities, etc.)

A discussion of the following:
* **What techniques** you tried to improve the performance of your model.
* What you would try to do, given more time, that would improve the performance of your model.
* Provide an example of two **error cases** - a false positive and a false negative - that your model got wrong, and why the model did not predict the correct answer.

In [6]:
sms_test = pd.read_csv('SMS_test.csv',  encoding="ISO-8859-1")
sms_train = pd.read_csv('SMS_train.csv',  encoding="ISO-8859-1")
sms_train.head(5)

Unnamed: 0,S. No.,Message_body,Label
0,1,Rofl. Its true to its name,Non-Spam
1,2,The guy did some bitching but I acted like i'd...,Non-Spam
2,3,"Pity, * was in mood for that. So...any other s...",Non-Spam
3,4,Will ü b going to esplanade fr home?,Non-Spam
4,5,This is the 2nd time we have tried 2 contact u...,Spam


In [69]:
print(f"Our test data is ~{round((sms_test.shape[0] / (sms_train.shape[0]+sms_test.shape[0]))*100)}\
% of the total available data")

Our test data is ~12% of the total available data


In [7]:
def find_unique_characters(regex, lines):
    """
    Finds unique characters from a list of strings, almost certainly inefficiently 
    
    """
    #Match anything that is non alpha-numeric or whitespace, creates list of lists of matching characters
    potential_malforms = [re.findall(regex, review) for review in lines]

    #lets whittle down this list of lists to a unqiue list, btw this took me way longer than it needed to
    unique_malforms = set([char for review in potential_malforms for char in review])
    
    print(F"Number of unique potential Malformed Characters: {len(unique_malforms)}, \n\nCandidates: {unique_malforms}")
    return unique_malforms

In [8]:
def clean_sms(df):
    cleaned_message = []
    for message in df['Message_body']:
        cleaned_message.append(re.sub(r"[^A-Za-z0-9 ]",'',message))
    
    df['cleaned_messages'] = cleaned_message
    return df 

In [72]:
sms_train = clean_sms(sms_train)
sms_test = clean_sms(sms_test)

In [73]:
def sms_spaCy_cols(text):
    word_embeddings = []
    ents = []
    ent_type = []
    for review in text['Message_body']:
        doc = nlp(review)
        word_embeddings.append(doc.vector)
        ents.append(doc.ents)
        for token in doc:
            ent_list = []
            ent_list.append(token.ent_type_)      
        ent_type.append(ent_list)
    text["doc_embeddings"] = word_embeddings
    text["entities"] = ents
    text['entity_types'] = ent_type
        
    return text

In [74]:
sms_train_spacy = sms_spaCy_cols(sms_train)
sms_test_spacy = sms_spaCy_cols(sms_test)

In [75]:
str(sms_train_spacy[sms_train_spacy['S. No.']==953]['Message_body'].values)

'["hows my favourite person today? r u workin hard? couldn\'t sleep again last nite nearly rang u at 4.30"]'

## Feature engineering 

### One hot encoding entity types
 It may be a variable of interest to have entity types including in aiding the detection of spam, to this end we have created one hot encodings of the entity types mentioned in each sms message

In [100]:
# Stole list one hot encoding code from: https://stackoverflow.com/questions/52189126

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

test_onehot = pd.DataFrame(mlb.fit_transform(sms_test_spacy['entity_types']),
                   columns=mlb.classes_,
                   index=sms_test_spacy['entity_types'].index)

train_onehot = pd.DataFrame(mlb.fit_transform(sms_train_spacy['entity_types']),
                   columns=mlb.classes_,
                   index=sms_train_spacy['entity_types'].index)

#fill columns that dont exist in test data set and fill with 0s 
test_onehot['GPE'] = 0 
test_onehot['LAW'] = 0

# drop junk column
train_onehot.drop(columns =[''], inplace=True)
test_onehot.drop(columns =[''], inplace=True)

# join back to main dfs 
sms_train_spacy = train_onehot.join(sms_train_spacy)
sms_test_spacy = test_onehot.join(sms_test_spacy)

In [101]:
sms_train_spacy

Unnamed: 0,CARDINAL,DATE,GPE,LAW,MONEY,ORG,PERSON,PRODUCT,TIME,S. No.,Message_body,Label,cleaned_messages,doc_embeddings,entities,entity_types
0,0,0,0,0,0,0,0,0,0,1,Rofl. Its true to its name,Non-Spam,Rofl Its true to its name,"[0.09184885, 0.14416684, -0.2082083, -0.357044...",(),[]
1,0,0,0,0,0,0,0,0,0,2,The guy did some bitching but I acted like i'd...,Non-Spam,The guy did some bitching but I acted like id ...,"[-0.06460267, 0.17402254, -0.21391848, -0.0767...","((next, week),)",[]
2,0,0,0,0,0,0,0,0,0,3,"Pity, * was in mood for that. So...any other s...",Non-Spam,Pity was in mood for that Soany other suggest...,"[-0.05494839, 0.19570266, -0.13729948, -0.1639...",(),[]
3,0,0,0,0,0,0,0,0,0,4,Will ü b going to esplanade fr home?,Non-Spam,Will b going to esplanade fr home,"[0.082603335, 0.08576301, -0.27380592, -0.1264...",(),[]
4,0,0,0,0,0,0,0,0,0,5,This is the 2nd time we have tried 2 contact u...,Spam,This is the 2nd time we have tried 2 contact u...,"[-0.14296803, 0.27839, -0.02023539, -0.0788021...","((2nd), (2), (£, 750, Pound), (2), (Only, 10p)...",[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
952,1,0,0,0,0,0,0,0,0,953,hows my favourite person today? r u workin har...,Non-Spam,hows my favourite person today r u workin hard...,"[-0.07621735, 0.2518853, -0.15996766, -0.09141...","((today), (rang, u), (4.30))",[CARDINAL]
953,0,0,0,0,0,0,0,0,0,954,How much you got for cleaning,Non-Spam,How much you got for cleaning,"[-0.19166715, 0.31868, -0.28999516, -0.1143298...",(),[]
954,0,0,0,0,0,0,0,0,0,955,Sorry da. I gone mad so many pending works wha...,Non-Spam,Sorry da I gone mad so many pending works what...,"[0.01588447, 0.05694315, -0.2272758, -0.262425...",(),[]
955,0,0,0,0,0,0,0,0,0,956,Wat time ü finish?,Non-Spam,Wat time finish,"[-0.0678088, 0.165192, -0.058875404, -0.404218...",(),[]


In [107]:
x = sms_train_spacy['doc_embeddings'].values

In [108]:
pd.DataFrame(x)

Unnamed: 0,0
0,"[0.09184885, 0.14416684, -0.2082083, -0.357044..."
1,"[-0.06460267, 0.17402254, -0.21391848, -0.0767..."
2,"[-0.05494839, 0.19570266, -0.13729948, -0.1639..."
3,"[0.082603335, 0.08576301, -0.27380592, -0.1264..."
4,"[-0.14296803, 0.27839, -0.02023539, -0.0788021..."
...,...
952,"[-0.07621735, 0.2518853, -0.15996766, -0.09141..."
953,"[-0.19166715, 0.31868, -0.28999516, -0.1143298..."
954,"[0.01588447, 0.05694315, -0.2272758, -0.262425..."
955,"[-0.0678088, 0.165192, -0.058875404, -0.404218..."


In [111]:
pd.DataFrame(x[0])

Unnamed: 0,0
0,0.091849
1,0.144167
2,-0.208208
3,-0.357044
4,0.185335
...,...
295,-0.141610
296,-0.105271
297,0.032140
298,-0.047698
