This notebook allows to explore the results of predicting the `click_rate` from `source_article` to `target_article` using different models (Doc2Vec, Wikipedia2Vec, Smash-RNN Paragraph Level and Smash-RNN Word Level. Results for Smash-RNN Sentence Level are pending).

The class `ResultsAnalyzer` encapsules the logic to compute the results. Main features:
- `get_top_5_predicted_by_article_and_model(source_article, model)`: Gets the top 5 predictions for the `source_article`. The column `is_in_top_5` shows if the `target_article` is in the **actual** top 5 click rate.
- `ResultsAnalyzer.results`: It is a Pandas Datafram containing the consolidated results
- `get_sample_source_articles`: Samples 10 random `source_articles`. Can be used to manually check the results

In [1]:
import pandas as pd
from results_analyzer import ResultsAnalyzer

results_analyzer = ResultsAnalyzer()

Getting a sample of the results

In [2]:
results_analyzer.results.sample(n=10)

Unnamed: 0,model,source_article,target_article,actual_click_rate,predicted_click_rate
7508,word,John Bonham,Dave Grohl,0.013362,0.051661
7082,doc2vec,Paul Giamatti,Man on the Moon (film),0.012309,0.030798
5601,paragraph,Gene Hackman,The Hunting Party (1971 film),0.005673,0.052326
295,wikipedia2vec,Homer,Ionia,0.012511,0.016803
2446,wikipedia2vec,1995 UEFA Champions League Final,Finidi George,0.040576,0.05712
10466,wikipedia2vec,Barrymore family,Maurice Barrymore,0.043892,0.05712
1234,doc2vec,A Wizard of Earthsea,Earthsea (universe),0.069578,0.03899
6671,wikipedia2vec,Serial killer,Robert Hansen,0.007918,0.027473
3741,wikipedia2vec,Carol Kane,Gene Wilder,0.009337,0.063328
1363,word,George S. Patton,Denazification,0.01014,0.023474


Getting a sample of the source articles

In [3]:
results_analyzer.get_sample_source_articles()

2520                                  Willow Smith
3421                 Band of Brothers (miniseries)
2531        List of mobile phone brands by country
6095                                 Marx Brothers
955                              Arrow (TV series)
5183                       Prabhu Deva filmography
6918                 Caroline, Princess of Hanover
1597    Sabrina the Teenage Witch (1996 TV series)
3963                Transformers: Dark of the Moon
4359                                  Tenochtitlan
Name: source_article, dtype: object

Getting all the available models (models `word` and `paragraph` refer to Smash-RNN. `sentence` level is pending)

In [5]:
results_analyzer.get_models()

array(['doc2vec', 'wikipedia2vec', 'word', 'paragraph'], dtype=object)

Getting the top 5 predictions for a `source_article` and a `model`

In [7]:
sample_source_article = "Shania Twain"
model = "paragraph"

results_analyzer.get_top_5_predicted_by_article_and_model(sample_source_article, model)

Unnamed: 0,model,source_article,target_article,actual_click_rate,predicted_click_rate,is_in_top_5
3902,paragraph,Shania Twain,"Robert John ""Mutt"" Lange",0.312118,0.03925,True
4815,paragraph,Shania Twain,RIAA certification,0.002681,0.028766,False
3025,paragraph,Shania Twain,List of highest-certified music artists in the...,0.006869,0.028324,False
4442,paragraph,Shania Twain,"Windsor, Ontario",0.01015,0.026451,True
370,paragraph,Shania Twain,Come On Over,0.045549,0.026083,True


Next steps:
- Include `sentence` level for `Smash-RNN`
- Create some analytics to understand better the results for each model (I will need help here!)