This notebook allows to explore the results of predicting the `click_rate` from `source_article` to `target_article` using different models (Doc2Vec, Wikipedia2Vec, Smash-RNN Paragraph Level, Smash-RNN Sentence Level and Smash-RNN Word Level).

The class `ResultsAnalyzer` encapsules the logic to compute the results. Main features:
- `get_ndcg_for_all_models`: Calculates the Normalized Discounted Cumulative Gain for each model
- `get_map_for_all_models`: Calculates the Mean Average Precision for each model
- `get_top_5_predicted_by_article_and_model(source_article, model)`: Gets the top 5 predictions for the `source_article`. The column `is_in_top_5` shows if the `target_article` is in the **actual** top 5 click rate.
- `ResultsAnalyzer.results`: It is a Pandas Datafram containing the consolidated results
- `get_sample_source_articles`: Samples 10 random `source_articles`. Can be used to manually check the results

In [5]:
import pandas as pd
from results_analyzer import ResultsAnalyzer

results_analyzer = ResultsAnalyzer()

Getting NDCG for all models:

In [6]:
results_analyzer.get_ndcg_for_all_models()

[2020-07-17 14:23:51,333] [INFO] Calculating NDCG for each model (get_ndcg_for_all_models@results_analyzer.py:195)
100%|██████████| 2/2 [00:06<00:00,  3.20s/it]


{'word': 0.6186564561748155, 'paragraph': 0.5865783418266132}

Getting MAP for all models:

In [3]:
results_analyzer.get_map_for_all_models()

[2020-07-17 09:29:23,583] [INFO] Calculating MAP for each model (get_map_for_all_models@results_analyzer.py:175)
100%|██████████| 2/2 [00:06<00:00,  3.17s/it]


{'word': 0.4675, 'paragraph': 0.4297}

Getting a sample of the results

In [4]:
results_analyzer.results.sample(n=10)

Unnamed: 0,model,source_article,target_article,actual_click_rate,predicted_click_rate
8415,word,European theatre of World War II,British occupation of the Faroe Islands,0.0,0.014172
4187,word,Uzbekistan,Landlocked country,0.066685,0.011574
17638,word,Clitoris,Internal pudendal artery,0.006496,0.016869
601,paragraph,Deion Sanders,San Francisco 49ers,0.007007,0.015056
3402,paragraph,Brandon Flowers,Posttraumatic stress disorder,0.024914,0.021252
2112,paragraph,Deion Sanders,Carlton Fisk,0.008333,0.016277
3015,paragraph,Dishonored,Gregg Berger,0.0,0.023677
16595,paragraph,Wayne's World (film),Rob Lowe,0.040094,0.019406
17775,word,USS Constitution,Junk (ship),0.0,0.015668
15628,paragraph,Dean Koontz bibliography,Dragonfly (Koontz novel),0.008052,0.017964


Getting a sample of the source articles

In [None]:
results_analyzer.get_sample_source_articles()

Getting all the available models (models `paragraph`, `sentence` and `word` refer to Smash-RNN levels.)

In [None]:
results_analyzer.get_models()

Getting the top 5 predictions for a `source_article` and a `model`

In [None]:
sample_source_article = "Gerald Ford"
model = "sentence"

results_analyzer.get_top_5_predicted_by_article_and_model(sample_source_article, model)

Next steps:
- Create some analytics to understand better the results for each model (I will need help here!)