# Investigate discrepancy between metrics of old and new implementations

In [1]:
import pandas as pd

In [2]:
experiment_results = pd.read_csv("experiment_results.csv")
experiment_results.head()

Unnamed: 0,dataset,seed,num_models_old,num_models_new,all_models_sets_equal,best_models_equal,best_model_old,best_model_new,best_model_score_old,best_model_score_new,time_elapsed_sec_old,time_elapsed_sec_new
0,cancer,7816,143,143,True,True,"[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...","[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...",17.537315,17.537315,2.729563,0.925814
1,cancer,3578,19,19,True,True,"[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...","[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...",17.095751,17.095751,2.016327,0.524491
2,cancer,2656,1,1,True,True,"[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 0...","[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 0...",12.126952,12.126952,0.666444,0.13445
3,cancer,2688,9,9,True,True,"[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...","[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...",17.13001,17.13001,3.290134,0.55071
4,cancer,2494,35,35,True,True,"[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...","[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0...",17.174563,17.174563,2.470846,0.595394


**Check that model-sets returned by the two implementations are the same in all cases**

In [None]:
assert experiment_results["all_models_sets_equal"].all(), "Model sets are not equal in all cases"

Two implementations are completely the same as they return same model sets in every case.

**Check that in cases where the best models differ, they have same strength**

In [None]:
different_best_model_cases = experiment_results[~experiment_results["best_models_equal"]]
assert all(different_best_model_cases["best_model_score_old"] == different_best_model_cases["best_model_score_new"]), \
    "Best models differ in strength in some cases"

In [5]:
different_best_model_cases[['dataset', 'seed', 'all_models_sets_equal', 'best_models_equal', 'best_model_score_old', 'best_model_score_new']]

Unnamed: 0,dataset,seed,all_models_sets_equal,best_models_equal,best_model_score_old,best_model_score_new
19,cancer,5303,True,False,17.349464,17.349464
23,cancer,2145,True,False,17.169093,17.169093
52,earthquake,2656,True,False,33.825122,33.825122
79,earthquake,7166,True,False,34.145325,34.145325
103,survey,2688,True,False,13.723063,13.723063
108,survey,316,True,False,15.382594,15.382594
112,survey,2250,True,False,14.779825,14.779825
115,survey,3595,True,False,13.644357,13.644357
117,survey,4135,True,False,16.091896,16.091896
120,survey,3942,True,False,15.642678,15.642678


The best models differ but they have the same score. The best model would be selected randomly if multiple models have same score equal to maximum. This means that this difference is not caused by a bug and is completely natural given the way we rank and choose the best model.

**Finally check if there were any cases with 50+ models returned**

In [6]:
experiment_results[experiment_results["num_models_new"] > 10000]

Unnamed: 0,dataset,seed,num_models_old,num_models_new,all_models_sets_equal,best_models_equal,best_model_old,best_model_new,best_model_score_old,best_model_score_new,time_elapsed_sec_old,time_elapsed_sec_new
148,survey,3094,71233,71233,True,True,"[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 1...","[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 1...",15.848744,15.848744,183.649269,302.43388


# Conclusion

1. The two implementations are completely equal in model sets they return for each dataset and fact-set case.
2. The reason of getting slightly different metric values is because the best models returned by the two implementations are sometimes different.
3. However, these two models have the same score in all experiments. This means that the difference in the best model is not a bug as the models are equally good according to the model ranking heuristic.