This notebook allows to explore the results of predicting the `click_rate` from `source_article` to `target_article` using different models (Doc2Vec, Wikipedia2Vec, Smash-RNN Paragraph Level, Smash-RNN Sentence Level and Smash-RNN Word Level).

The class `ResultsAnalyzer` encapsules the logic to compute the results. Main features:
- `get_top_5_predicted_by_article_and_model(source_article, model)`: Gets the top 5 predictions for the `source_article`. The column `is_in_top_5` shows if the `target_article` is in the **actual** top 5 click rate.
- `ResultsAnalyzer.results`: It is a Pandas Datafram containing the consolidated results
- `get_sample_source_articles`: Samples 10 random `source_articles`. Can be used to manually check the results

In [1]:
import pandas as pd
from results_analyzer import ResultsAnalyzer

results_analyzer = ResultsAnalyzer()

Getting a sample of the results

In [2]:
results_analyzer.results.sample(n=10)

Unnamed: 0,model,source_article,target_article,actual_click_rate,predicted_click_rate
5200,paragraph,Westworld,Westworld (disambiguation),0.128287,0.051356
4099,paragraph,Brian Quinn (comedian),Monsignor Farrell High School,0.020822,0.05272
5079,doc2vec,Audrey Hepburn,Charade (1963 film),0.003548,0.036424
10402,paragraph,Alternative for Germany,Landtag of Thuringia,0.008003,0.041198
5155,doc2vec,Florence Pugh,The Little Drummer Girl,0.001353,0.038599
2078,wikipedia2vec,The Aeronauts (film),Anne Reid,0.004713,0.05712
5520,sentence,Schutzstaffel,Majdanek concentration camp,0.0062,0.016807
9174,sentence,List of television series based on DC Comics p...,The New Batman Adventures,0.005354,0.023697
2001,doc2vec,Second Chechen War,Shamil Basayev,0.010172,0.04005
4522,word,Bombing of Tokyo (10 March 1945),Treaty of San Francisco,0.005189,0.023185


Getting a sample of the source articles

In [3]:
results_analyzer.get_sample_source_articles()

3302                                  The Big Bang Theory
9359                                         Joe Coulombe
6369                                          Mexico City
5046    United States Public Health Service Commission...
1267                                      American Hustle
5539             Ned's Declassified School Survival Guide
8336                        Rainier III, Prince of Monaco
7171                                           Just Mercy
8078                       The Goldbergs (2013 TV series)
1431                                            T-54/T-55
Name: source_article, dtype: object

Getting all the available models (models `paragraph`, `sentence` and `word` refer to Smash-RNN levels.)

In [8]:
results_analyzer.get_models()

array(['doc2vec', 'wikipedia2vec', 'word', 'sentence', 'paragraph'],
      dtype=object)

Getting the top 5 predictions for a `source_article` and a `model`

In [7]:
sample_source_article = "Gerald Ford"
model = "sentence"

results_analyzer.get_top_5_predicted_by_article_and_model(sample_source_article, model)

Unnamed: 0,model,source_article,target_article,actual_click_rate,predicted_click_rate,is_in_top_5
9100,sentence,Gerald Ford,Twenty-fifth Amendment to the United States Co...,0.014005,0.034134,True
9991,sentence,Gerald Ford,Pardon of Richard Nixon,0.004515,0.031816,True
7565,sentence,Gerald Ford,Détente,0.003849,0.031134,False
4379,sentence,Gerald Ford,Betty Ford,0.044354,0.029983,True
6301,sentence,Gerald Ford,Bob Dole,0.001809,0.028259,False


Next steps:
- Create some analytics to understand better the results for each model (I will need help here!)