This notebook allows to explore the results of predicting the `click_rate` from `source_article` to `target_article` using different models (Doc2Vec, Wikipedia2Vec, Smash-RNN Paragraph Level, Smash-RNN Sentence Level and Smash-RNN Word Level).

The class `ResultsAnalyzer` encapsules the logic to compute the results. Main features:
- `get_ndcg_for_all_models`: Calculates the Normalized Discounted Cumulative Gain for each model
- `get_map_for_all_models`: Calculates the Mean Average Precision for each model
- `get_top_5_predicted_by_article_and_model(source_article, model)`: Gets the top 5 predictions for the `source_article`. The column `is_in_top_5` shows if the `target_article` is in the **actual** top 5 click rate.
- `ResultsAnalyzer.results`: It is a Pandas Datafram containing the consolidated results
- `get_sample_source_articles`: Samples 10 random `source_articles`. Can be used to manually check the results

In [1]:
import pandas as pd
from results_analyzer import ResultsAnalyzer

results_analyzer = ResultsAnalyzer()

Getting NDCG for all models:

In [2]:
results_analyzer.get_ndcg_for_all_models()

[2020-07-27 08:24:03,181] [INFO] Calculating NDCG for each model (get_ndcg_for_all_models@results_analyzer.py:209)
100%|██████████| 10/10 [01:26<00:00,  8.66s/it]


{'word': 0.6186564561748155,
 'sentence': 0.7520079111676228,
 'paragraph': 0.5865783418266132,
 'word_no_sigmoid': 0.7548936484994487,
 'sentence_no_sigmoid': 0.7606452202937065,
 'paragraph_no_sigmoid': 0.7490948963972344,
 'doc2vec': 0.5971752752809486,
 'doc2vec_no_sigmoid': 0.616989843034837,
 'wikipedia2vec': 0.8075977354419501,
 'wikipedia2vec_no_sigmoid': 0.8057212116790428}

Getting MAP for all models:

In [3]:
results_analyzer.get_map_for_all_models()

[2020-07-27 08:25:29,939] [INFO] Calculating MAP for each model (get_map_for_all_models@results_analyzer.py:189)
100%|██████████| 10/10 [01:26<00:00,  8.66s/it]


{'word': 0.4675,
 'sentence': 0.6679,
 'paragraph': 0.431,
 'word_no_sigmoid': 0.6692,
 'sentence_no_sigmoid': 0.6851,
 'paragraph_no_sigmoid': 0.669,
 'doc2vec': 0.4533,
 'doc2vec_no_sigmoid': 0.4916,
 'wikipedia2vec': 0.7515,
 'wikipedia2vec_no_sigmoid': 0.7455}

Getting a sample of the results

In [5]:
results_analyzer.results.sample(n=10)

Unnamed: 0,model,source_article,target_article,actual_click_rate,predicted_click_rate
6128,sentence,Colombiana,Federal Bureau of Investigation,0.0,0.007938
5175,doc2vec_no_sigmoid,List of supporting Harry Potter characters,The Tales of Beedle the Bard,0.0,0.018912
1598,wikipedia2vec_no_sigmoid,Kenneth Branagh,Tony Slattery,0.0,0.004912
8406,doc2vec_no_sigmoid,Vietnam War,Assassination of John F. Kennedy,0.003388,0.019636
10432,doc2vec,Christopher Nolan,The Criterion Collection,0.0,0.018804
5292,word,Indiana Jones (franchise),Nicolas Roeg,0.0,0.023488
18129,sentence,Super Bowl XLVI,.tv,0.0,0.001054
1499,doc2vec,Kraftwerk,Klaus Dinger,0.016262,0.016582
19536,word,A. R. Rahman,Water (2005 film),0.0,0.023951
18136,doc2vec_no_sigmoid,2020 Big Ten Conference Men's Basketball Tourn...,College Basketball on CBS,0.0,0.015419


Getting a sample of the source articles

In [6]:
results_analyzer.get_sample_source_articles()

15479                Harvey Weinstein
8889              American Pie (film)
7411                      Colin Hanks
3154         Little Women (2019 film)
17637                   Wyatt Russell
358                      Eiffel Tower
11833                 Brandon Flowers
11253                    F(x) (group)
1357     Eurovision Song Contest 2019
15756                         Ireland
Name: source_article, dtype: object

Getting all the available models (models `paragraph`, `sentence` and `word` refer to Smash-RNN levels.)

In [7]:
results_analyzer.get_models()

array(['word', 'sentence', 'paragraph', 'word_no_sigmoid',
       'sentence_no_sigmoid', 'paragraph_no_sigmoid', 'doc2vec',
       'doc2vec_no_sigmoid', 'wikipedia2vec', 'wikipedia2vec_no_sigmoid'],
      dtype=object)

Getting the top 5 predictions for a `source_article` and a `model`

In [14]:
sample_source_article = "Ireland"
model = "wikipedia2vec"

results_analyzer.get_top_10_predicted_by_article_and_model(sample_source_article, model)

Unnamed: 0,model,source_article,target_article,actual_click_rate,predicted_click_rate,is_in_top_articles
19146,wikipedia2vec,Ireland,British Isles,0.011749,0.032819,True
2729,wikipedia2vec,Ireland,Great Famine (Ireland),0.005367,0.030813,False
19366,wikipedia2vec,Ireland,Northern Ireland,0.071868,0.020845,True
17664,wikipedia2vec,Ireland,Republic of Ireland,0.317914,0.018831,True
6291,wikipedia2vec,Ireland,Dublin,0.027776,0.015403,True
9209,wikipedia2vec,Ireland,Saint Patrick,0.006274,0.014847,False
1824,wikipedia2vec,Ireland,Irish Famine (1740–41),0.002198,0.014399,False
7879,wikipedia2vec,Ireland,Northwestern Europe,0.00302,0.014337,False
18139,wikipedia2vec,Ireland,Partition of Ireland,0.003041,0.013425,False
7960,wikipedia2vec,Ireland,Irish Free State,0.00413,0.012884,False


Next steps:
- Create some analytics to understand better the results for each model (I will need help here!)