## Comparing baseline to NER for Kaggle data

Here we compare the accuracy of named entity recognition for brands to that of our baseline search, over the adapted [Kaggle data](https://www.kaggle.com/kevinhartman0/advertisement-transcripts-from-various-industries) in `data/additional_data.json`. The output of the baseline calculation is stored in `baseline_corrects.json`, while the scored model predictions are stored in `test_data_pred.json`.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_json('test_data_pred.json')
df_baseline = pd.read_json('baseline_corrects.json')
df_comparison = df[['transcription', 'brand', 'seen_in_training', 'predictions', 'corrects']].join(df_baseline, how='inner')

In [2]:
def print_accuracy(correct_predictions, string):
    print('{} incorrect, {} correct {} ({:.2f} correct)'.format(
        correct_predictions.shape[0] - correct_predictions.sum(),
        correct_predictions.sum(),
        string,
        correct_predictions.sum() / correct_predictions.shape[0]))
    
print('Model:')
print_accuracy(df_comparison['corrects'], 'overall')
unseen = df_comparison[~df_comparison['seen_in_training']]
print_accuracy(unseen['corrects'], 'in unseen')
print_accuracy(df_comparison[df_comparison['seen_in_training']]['corrects'], 'in seen')
print('---------------------')
print('Baseline:')
print_accuracy(df_comparison['bl_corrects'], 'overall')
print_accuracy(unseen['bl_corrects'], 'in unseen')
print_accuracy(df_comparison[df_comparison['seen_in_training']]['bl_corrects'], 'in seen')
print('---------------------')

Model:
99 incorrect, 163 correct overall (0.62 correct)
95 incorrect, 52 correct in unseen (0.35 correct)
4 incorrect, 111 correct in seen (0.97 correct)
---------------------
Baseline:
149 incorrect, 113 correct overall (0.43 correct)
147 incorrect, 0 correct in unseen (0.00 correct)
2 incorrect, 113 correct in seen (0.98 correct)
---------------------


We can focus specifically on the cases where the brand was already seen but the model returned the wrong brand:

In [3]:
df_comparison[df_comparison['seen_in_training'] & ~df_comparison['corrects']]

Unnamed: 0,transcription,brand,seen_in_training,predictions,corrects,bl_preds,bl_corrects
618,Has your hair lost its luster? Missing its bou...,Va-Va-Va-Voom,True,"[Va-va, Va-Va]",False,[va va va voom],True
1697,Long distance service you thought couldnt get ...,AT&T,True,[],False,[at&t],True
257,Benzel-Busch Motor Car Corporation is one of t...,Benzel-Busch Motor Car Corporation,True,"[Benzel-Busch, Mercedes-Benz]",False,[benzel busch motor car corporation],True
256,"For nearly half a century, Benzel-Busch Motor ...",Benzel-Busch Motor Car Corporation,True,"[Benzel-Busch, Benzel-Busch]",False,"[benzel busch motor car corporation, mercedes ...",True


Conversely, transcriptions where the model succeeded despite not seeing a particular brand before:

In [4]:
df_comparison[~df_comparison['seen_in_training'] & df_comparison['corrects']].sample(10)

Unnamed: 0,transcription,brand,seen_in_training,predictions,corrects,bl_preds,bl_corrects
1019,Sunchips lovers believe that wholeness is the ...,Sunchips,False,"[SunChips, SunChips, SunChips, SunChips, Sunch...",True,[],False
1757,"Cancun is the #1 Spring Break destination, per...",Cancun,False,"[Cancun, Yucatan Peninsula]",True,[],False
631,Does your child have difficulty reading? Is ma...,Lumleys Learning Center,False,"[Lumleys Learning Center, Lumleys Learning Cen...",True,[],False
666,The intensity of our concentration cannot be o...,American Century,False,[American Century],True,[always],False
1546,When youre looking for the hottest basketball ...,Just For Feet,False,[Just For Feet],True,[],False
703,Pretty much everyone hates big banks because t...,Compass Bank,False,"[Pretty, Compass Bank]",True,[],False
452,A curated line up...packed with features like ...,Mitsubishi Motors,False,"[Mitsubishi Crossover Family, Mitsubishi Motors]",True,[],False
1534,"If you need to escape the daily grind, come to...",Best Buy,False,"[BEST BUY, BEST BUY]",True,[],False
50,"Today, in the United States, over 134,000 chil...",Dave Thomas Foundation For Adoption,False,[Dave Thomas Foundation For Adoption],True,[],False
623,"Whatever your goals in life, wherever your car...",Bryant University,False,[Bryant University],True,[],False


In [5]:
print(df_comparison.loc[1391,'predictions'])
df_comparison.loc[1391,'transcription']

['Biography Magazine', 'Ernest Shakletons Antarctica', 'Kathleen Turner', 'Biography Magazine']


'Find your inspiration. Biography Magazine. In the April issue, Sandra Bullock. Plus, Ernest Shakletons Antarctica. Anne Heche. Kathleen Turner. And much more. For whoever you are, find your inspiration in Biography Magazine. Every life has a story.'