# Spacy NLP Solution
## Description
This solution involves using a more typical python NLP approach using the Spacy package and Spaczz package, which is used for fuzzy NLP.  Spacy is a python NLP package used for NER, POS tagging and other NLP tasks.  I will use it to find content titles in the responses.

In [180]:
import pandas as pd
import re
import spacy
from spacy.matcher import PhraseMatcher
from spaczz.matcher import SimilarityMatcher
from spaczz.pipeline import SpaczzRuler
#Try load the required language model, otherwise download it
try:
    nlp = spacy.load("en_core_web_md")
except:
    spacy.cli.download("en_core_web_md")
    nlp = spacy.load("en_core_web_md")

## Importing and Transforming the Data
This time I won't be doing an intial transformation of the content titles, nor will I be doing such a total transformation on the survey response.  This is because spacy will take into account the punctuation of the sentences, so removing them will likely harm performance.

In [181]:
content_titles_df = pd.read_csv('ProvidedFiles/Content sample.csv')
survey_response_df = pd.read_csv('ProvidedFiles/Survey response sample data.csv')
survey_response_df['transformed'] = survey_response_df.apply(lambda x: re.sub(' +', ' ',x['Response'].replace('\n', ' ').replace(',', ', ').replace('-','- ')), axis=1)

## Building the NLP Matcher
I will generate a number of different patterns for each content name, 4 in this case, but with some more regex replacements I could come up with more to ensure robust matching.  I'm using spaczz's SimilarityMatcher for this task.

In [182]:
matcher = SimilarityMatcher(nlp.vocab)
for index, row in content_titles_df.iterrows():
    patterns = []
    patterns.append(nlp(row['Content_name']))
    patterns.append(nlp(row['Content_name'].lower()))
    patterns.append(nlp(re.sub(' +', ' ',re.sub(r'[^\w| ]', '', row['Content_name']).lower())))
    patterns.append(nlp(re.sub(' +', ' ',re.sub(r'[^\w| ]', ' ', row['Content_name']).lower())))
    matcher.add(row['Content_name'], patterns)



## Generating the Results

In [183]:
customer_id_column = []
response_column = []
title_column = []
score_column = []
for index, row in survey_response_df.iterrows():
    doc = nlp(row['transformed'])
    matches = matcher(doc)
    for _, start, end, ratio in matches:
        customer_id_column.append(row['Customer_id'])
        response_column.append(row['Response'])
        title_column.append(_)
        score_column.append(ratio)
result_dict = {
    'Customer_id':customer_id_column,
    'Response':response_column,
    'Title':title_column,
    'Score': score_column}
result_df = pd.DataFrame(result_dict).sort_values(by=['Score'], ascending=False)
result_df.head()



Unnamed: 0,Customer_id,Response,Title,Score
0,1,"Fear the walking dead,Supernatural (huge fan a...",Fear the Walking Dead,100
10,3,"Miss scarlet and the duke,knifes out,Dublin mu...",Knives Out,100
22,5,"The Undoing,Game of thrones,Outlander, Vikings...",C.B. Strike,100
21,5,"The Undoing,Game of thrones,Outlander, Vikings...",Vikings,100
20,5,"The Undoing,Game of thrones,Outlander, Vikings...",Outlander,100


## Wrapping Up
You may have noticed the output table is not quite in the requested format of [customer ID, response, names of content found], it's reasonably straight forward to write a function to transform the result_df above into the required format:

In [184]:
def results_transformer(old_result_df, threshold):
    response_list = []
    titles_found_list = []
    customer_id_list = []
    for index, row in old_result_df.iterrows():
        if row['Score'] >= threshold:
            out_index = -1 
            if row['Response'] not in response_list:
                response_list.append(row['Response'])
                titles_found_list.append(row['Title'])
                customer_id_list.append(row['Customer_id'])
            else:
                out_index = response_list.index(row['Response'])
                titles_found_list[out_index] += ", {}".format(row['Title'])
    new_result_dict = {
    'Customer_id':customer_id_list,
    'Response':response_list,
    'Titles Found':titles_found_list}
    return pd.DataFrame(new_result_dict)
            

In [185]:
new_result_df = results_transformer(result_df, 85)
df_styler =new_result_df.style.set_properties(**{'text-align': 'left'})
df_styler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

Unnamed: 0,Customer_id,Response,Titles Found
0,1,"Fear the walking dead,Supernatural (huge fan and sad it has finished),The Gentlemen, Outlander","Fear the Walking Dead, Supernatural, Outlander, The Gentlemen"
1,3,"Miss scarlet and the duke,knifes out,Dublin murders","Knives Out, Dublin Murders, Miss Scarlet and the Duke"
2,5,"The Undoing,Game of thrones,Outlander, Vikings,CB Strike (and most all British dramas) Westworld","C.B. Strike, Vikings, Outlander, Game of Thrones, The Undoing, Westworld"
3,4,"History drama-Vikings,Kid friendly-Casper,Sometimes the conversations while watching Neon can get serious but we all end up having fun together, :)","Casper, Vikings"
4,2,A lot! -good doctor -gangs of London - the gentleman -ma -spies in disguise,"Spies in Disguise, Ma, Gangs of London, The Good Doctor, The Gentlemen"
