This notebook allows to explore the results of predicting the `click_rate` from `source_article` to `target_article` using different models (Doc2Vec, Wikipedia2Vec, Smash-RNN Paragraph Level, Smash-RNN Sentence Level and Smash-RNN Word Level).

The class `ResultsAnalyzer` encapsules the logic to compute the results. Main features:
- `get_ndcg_for_all_models`: Calculates the Normalized Discounted Cumulative Gain for each model
- `get_map_for_all_models`: Calculates the Mean Average Precision for each model
- `get_top_5_predicted_by_article_and_model(source_article, model)`: Gets the top 5 predictions for the `source_article`. The column `is_in_top_5` shows if the `target_article` is in the **actual** top 5 click rate.
- `ResultsAnalyzer.results`: It is a Pandas Datafram containing the consolidated results
- `get_sample_source_articles`: Samples 10 random `source_articles`. Can be used to manually check the results

In [1]:
import pandas as pd
from results_analyzer import ResultsAnalyzer

results_analyzer = ResultsAnalyzer()

Getting NDCG for all models:

In [2]:
results = results_analyzer.calculate_statistics_per_group()

[2020-09-26 15:19:19,266] [INFO] Getting features from DB (calculate_statistics_per_group@results_analyzer.py:342)
[2020-09-26 15:19:34,708] [INFO] Getting predictions by model (calculate_statistics_per_group@results_analyzer.py:358)
[2020-09-26 15:19:34,740] [INFO] Aggregating predictions for each model (get_predictions_by_model@results_analyzer.py:266)
100%|██████████| 29/29 [00:21<00:00,  1.33it/s]
[2020-09-26 15:19:56,571] [INFO] Calculating results by model (calculate_statistics_per_group@results_analyzer.py:395)
100%|██████████| 474/474 [00:06<00:00, 68.67it/s]
100%|██████████| 474/474 [00:06<00:00, 68.71it/s]
100%|██████████| 474/474 [00:06<00:00, 69.01it/s]
100%|██████████| 474/474 [00:06<00:00, 68.60it/s]
100%|██████████| 474/474 [00:06<00:00, 68.85it/s]
100%|██████████| 474/474 [00:06<00:00, 68.47it/s]
100%|██████████| 474/474 [00:06<00:00, 68.45it/s]
100%|██████████| 474/474 [00:06<00:00, 68.92it/s]
100%|██████████| 474/474 [00:06<00:00, 68.93it/s]
100%|██████████| 474/474 [

In [10]:
models = ["paragraph_level_50d_concat_v2_introduction_only", "paragraph_level_50d_concat_introduction_only", "paragraph_level_200d_introduction_only"]

results[models].describe()

Unnamed: 0,paragraph_level_50d_concat_v2_introduction_only,paragraph_level_50d_concat_introduction_only,paragraph_level_200d_introduction_only
count,474.0,474.0,474.0
mean,0.575693,0.584052,0.636067
std,0.341723,0.344866,0.345378
min,0.0,0.0,0.0
25%,0.386853,0.386853,0.430677
50%,0.624051,0.62117,0.679731
75%,0.88546,0.906025,0.95583
max,1.0,1.0,1.0


In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

WORD_COUNT_BIN = "word_count_bin"
WORD_COUNT_COLUMN = "word_count"
OUT_LINKS_BIN = "out_links_bin"
OUT_LINKS_COLUMN= "out_links_count"
IN_LINKS_BIN = "in_links_bin"
IN_LINKS_COLUMN = "in_links_count_column"
PARAGRAPH_COUNT_COLUMN = "paragraph_count"
PARAGRAPH_COUNT_BIN = "paragraph_count_bin"
SENTENCE_COUNT_COLUMN = "sentence_count"
SENTENCE_COUNT_BIN = "sentence_count_bin"
MODEL_COLUMN = "model"

ALL_FEATURES = [WORD_COUNT_COLUMN, OUT_LINKS_COLUMN, IN_LINKS_COLUMN]

selected_models = [
            "doc2vec_no_sigmoid",
            "wikipedia2vec_no_sigmoid",
            "word_no_sigmoid",
            "sentence_no_sigmoid",
            "paragraph_no_sigmoid",
        ]

clean_model_names = {
    "doc2vec_no_sigmoid": "Doc2Vec",
    "paragraph_no_sigmoid": "SMASH RNN (P + S + W)",
    "sentence_no_sigmoid": "SMASH RNN (P + S)",
    "wikipedia2vec_no_sigmoid": "Wikipedia2Vec",
    "word_no_sigmoid": "SMASH RNN (P)",
}

SMASH_HATCH = '//'
DOC2VEC_HATCH = '' 
WIKIPEDIA2VEC_HATCH = ''

system_styles = {
    'doc2vec_no_sigmoid': dict(color='lightcoral', hatch=DOC2VEC_HATCH),
        
    'wikipedia2vec_no_sigmoid': dict(color='yellow', hatch=WIKIPEDIA2VEC_HATCH),
    
    'paragraph_no_sigmoid': dict(color='blue', hatch=SMASH_HATCH),
    'sentence_no_sigmoid': dict(color='green', hatch=SMASH_HATCH),
    'word_no_sigmoid': dict(color='red', hatch=SMASH_HATCH),
}

SMALL_SIZE = 10
MEDIUM_SIZE = 12
BIGGER_SIZE = 14

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE+1)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE+1)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE+1)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

plt.rc('pdf', fonttype=42)
plt.rc('ps', fonttype=42)

plt.rc('text', usetex=False)
plt.rc('font', family='serif')

def get_performance_figure(
    results,
    models,
    feature_column,
    x_label,
    y_label=None,
    figsize=(13, 6),
    legend_columns_count=3,
    buckets_count=5,
    save_file_path=None,
):
    bin_column = f"{feature_column}_bin"
    bins = pd.qcut(results[feature_column], q=buckets_count)

    results[bin_column] = bins
    result_by_model = results.groupby([bin_column]).mean()[models]

    fig = plt.figure(figsize=figsize)

    ax = result_by_model.plot(
        kind="bar", ax=fig.gca(), rot=0, width=0.7, alpha=0.9, edgecolor=["black"],
    )

    box = ax.get_position()
    ax.set_position([box.x0, box.y0 + box.height * 0.25, box.width, box.height * 0.75])

    # Formats the bars
    for container in ax.containers:
        container_system = container.get_label()
        
        style = system_styles[container_system]
        for patch in container.patches:
            if 'color' in style:
                patch.set_color(style['color'])
            if 'hatch' in style:
                patch.set_hatch(style['hatch'])
            if 'linewidth' in style:
                patch.set_linewidth(style['linewidth'])
            if 'edgecolor' in style:
                patch.set_edgecolor(style['edgecolor'])
            else:
                patch.set_edgecolor('black')

    
    model_names = [clean_model_names[model] for model in selected_models]

    ax.legend(
        model_names,
        ncol=legend_columns_count,
        loc="upper center",
        fancybox=True,
        shadow=False,
        bbox_to_anchor=(0.5, 1.2),
    )

    # Formats the x label as "(lower, upper]"
    ax.set_xticklabels(
        [f"({int(i.left)}, {int(i.right)}]" for i in bins.cat.categories]
    )

    y_label = "NDCG@k (k=5)"
    ax.set_xlabel(x_label % len(result_by_model))
    ax.set_ylabel(y_label)
    
    if save_file_path:
        pdf_dpi = 300

        logger.info(f"Saved to {save_file_path}")
        plt.savefig(save_file_path, bbox_inches="tight", dpi=pdf_dpi)

    plt.show()

In [None]:
get_performance_figure(results, selected_models, WORD_COUNT_COLUMN, "Text length as word count (%s equal-sized buckets)")
get_performance_figure(results, selected_models, SENTENCE_COUNT_COLUMN, "NNText length as word count (%s equal-sized buckets)")
get_performance_figure(results, selected_models, PARAGRAPH_COUNT_COLUMN, "NMNText length as word count (%s equal-sized buckets)")


# results

In [None]:
models = ["wikipedia2vec_no_sigmoid", "word_no_sigmoid"]
get_performance_figure(ndcg_by_model_and_article, models, WORD_COUNT_COLUMN, 'Text length as word count (%s equal-sized buckets)')

In [None]:
results_analyzer.get_map_for_all_models()

Getting a sample of the results

In [None]:
results_analyzer.results.sample(n=10)

Getting a sample of the source articles

In [None]:
results_analyzer.get_sample_source_articles()

Getting all the available models (models `paragraph`, `sentence` and `word` refer to Smash-RNN levels.)

In [None]:
results_analyzer.get_models()

Getting the top 5 predictions for a `source_article` and a `model`

In [None]:
sample_source_article = "Quantum mechanics"
model = "wikipedia2vec"

results_analyzer.get_top_10_predicted_by_article_and_model(sample_source_article)

Next steps:
- Create some analytics to understand better the results for each model (I will need help here!)