## Comparing baseline to NER for Kaggle data

Here we compare the accuracy of named entity recognition for brands to that of our baseline search, over the adapted [Kaggle data](https://www.kaggle.com/kevinhartman0/advertisement-transcripts-from-various-industries) in `data/additional_data.json`.

In [46]:
import pandas as pd
import numpy as np
df = pd.read_json('test_data_pred.json')
df_baseline = pd.read_json('baseline_corrects.json')
df_comparison = df[['transcription', 'brand', 'seen_in_training', 'predictions', 'corrects']].join(df_baseline, how='inner')

In [47]:
def print_accuracy(correct_predictions, string):
    print('{} incorrect, {} correct {} ({:.2f} correct)'.format(
        correct_predictions.shape[0] - correct_predictions.sum(),
        correct_predictions.sum(),
        string,
        correct_predictions.sum() / correct_predictions.shape[0]))
    
print('Model:')
print_accuracy(df_comparison['corrects'], 'overall')
unseen = df_comparison[~df_comparison['seen_in_training']]
print_accuracy(unseen['corrects'], 'in unseen')
print_accuracy(df_comparison[df_comparison['seen_in_training']]['corrects'], 'in seen')
print('---------------------')
print('Baseline:')
print_accuracy(df_comparison['bl_corrects'], 'overall')
print_accuracy(unseen['bl_corrects'], 'in unseen')
print_accuracy(df_comparison[df_comparison['seen_in_training']]['bl_corrects'], 'in seen')
print('---------------------')

Model:
100 incorrect, 162 correct overall (0.62 correct)
95 incorrect, 52 correct in unseen (0.35 correct)
5 incorrect, 110 correct in seen (0.96 correct)
---------------------
Baseline:
149 incorrect, 113 correct overall (0.43 correct)
147 incorrect, 0 correct in unseen (0.00 correct)
2 incorrect, 113 correct in seen (0.98 correct)
---------------------


We can focus specifically on the cases where the brand was already seen but the model returned the wrong brand:

In [48]:
df_comparison[df_comparison['seen_in_training'] & ~df_comparison['corrects']]

Unnamed: 0,transcription,brand,seen_in_training,predictions,corrects,bl_preds,bl_corrects
94,Children diagnosed with cancer need specialize...,Memorial Sloan Kettering Cancer Center,True,"[Memorial Sloan, Memorial Sloan, Best Cancer]",False,[memorial sloan kettering cancer center],True
680,Could there be a new IRA in your future? Perha...,American Express,True,"[Roth IRA, American Express, Roth IRA]",False,[roth ira],False
1697,Long distance service you thought couldnt get ...,AT&T,True,[],False,[at&t],True
257,Benzel-Busch Motor Car Corporation is one of t...,Benzel-Busch Motor Car Corporation,True,"[Car Corporation, Mercedes-Benz]",False,[benzel busch motor car corporation],True
256,"For nearly half a century, Benzel-Busch Motor ...",Benzel-Busch Motor Car Corporation,True,[Mercedes-Benz Dealers],False,"[benzel busch motor car corporation, mercedes ...",True


Conversely, transcriptions where the model succeeded despite not seeing a particular brand before:

In [58]:
df_comparison[~df_comparison['seen_in_training'] & df_comparison['corrects']].sample(10)

Unnamed: 0,transcription,brand,seen_in_training,predictions,corrects,bl_preds,bl_corrects
1922,"The feast is Europe, and you are the chef. Wor...",World Airways,False,"[World Airways, World Airways]",True,[],False
911,My memories of Grandmother always include her ...,Gold Medal Flour,False,"[Grandmas, GOLD MEDAL FLOUR]",True,[always],False
844,California almonds are in! When youre talking ...,California Almonds,False,[California Almonds],True,[],False
1224,I have a confession to make. I really dont lik...,Green Thumb,False,"[Green Thumb, Green Thumb]",True,[],False
1546,When youre looking for the hottest basketball ...,Just For Feet,False,[Just For Feet],True,[],False
452,A curated line up...packed with features like ...,Mitsubishi Motors,False,[Mitsubishi Motors],True,[],False
1675,Its time to get struck...with your 2012-13 Wic...,Wichita Thunder,False,"[Berry Conference Champion Thunder, Central Ho...",True,[],False
1391,Find your inspiration. Biography Magazine. In ...,Biography Magazine,False,"[Biography Magazine, Ernest Shakletons Antarct...",True,[],False
1099,Sick and tired of feeling sick and tired? Use ...,Proleva,False,"[Proleva, Proleva, Proleva, Proleva]",True,[],False
1555,"This week at Publix, enjoy savory Boars Head R...",Publix,False,"[Publix, Roast]",True,[],False


In [60]:
print(df_comparison.loc[1391,'predictions'])
df_comparison.loc[1391,'transcription']

['Biography Magazine', 'Ernest Shakletons Antarctica', 'Kathleen Turner', 'Biography Magazine']


'Find your inspiration. Biography Magazine. In the April issue, Sandra Bullock. Plus, Ernest Shakletons Antarctica. Anne Heche. Kathleen Turner. And much more. For whoever you are, find your inspiration in Biography Magazine. Every life has a story.'