### In this notebook, we implement different strategies to debugging our model, trying to detect those training instances that are leading to most confusion across classes

In [2]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

FEATURES = ['query_num_of_columns', 'query_num_of_rows', 'query_row_column_ratio', 'query_max_skewness', 
            'query_max_kurtosis', 'query_max_unique', 'candidate_num_rows', 'candidate_row_column_ratio', 
            'candidate_max_skewness', 'candidate_max_kurtosis', 'candidate_max_unique', 'query_target_max_pearson', 
            'query_target_max_spearman', 'query_target_max_covariance', 'query_target_max_mutual_info', 
            'candidate_target_max_pearson', 'candidate_target_max_spearman', 'candidate_target_max_covariance', 
            'candidate_target_max_mutual_info', 'max_pearson_difference', 'containment_fraction']
TARGET = 'class'

In [14]:
def train_and_test_over_same_data(data):
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(data[FEATURES], data[TARGET])
    preds = rf.predict(data[FEATURES])
    print(classification_report(data[TARGET], preds))
    return rf

In [3]:
dataset_2 = pd.read_csv('training-simplified-data-generation.csv')
dataset_2[TARGET] = ['good_gain' if row['gain_in_r2_score'] > 0 else 'loss' for index, row in dataset_2.iterrows()] 
dataset_3 = pd.read_csv('training-simplified-data-generation-many-candidates-per-query_with_median_and_mean_based_classes.csv')
dataset_3[TARGET] = ['good_gain' if row['gain_in_r2_score'] > 0 else 'loss' for index, row in dataset_3.iterrows()]

KeyboardInterrupt: 

In [15]:
rf_dataset_2 = train_and_test_over_same_data(dataset_2)

              precision    recall  f1-score   support

   good_gain       1.00      1.00      1.00      5707
        loss       1.00      1.00      1.00      4177

   micro avg       1.00      1.00      1.00      9884
   macro avg       1.00      1.00      1.00      9884
weighted avg       1.00      1.00      1.00      9884



In [17]:
rf_dataset_3 = train_and_test_over_same_data(dataset_3)

              precision    recall  f1-score   support

   good_gain       1.00      1.00      1.00   1019700
        loss       1.00      1.00      1.00   1097156

   micro avg       1.00      1.00      1.00   2116856
   macro avg       1.00      1.00      1.00   2116856
weighted avg       1.00      1.00      1.00   2116856



### So it seems like these two models model their own instances "perfectly", i.e., the in-training error is 0. Let's double-check how these models behave over the college use case --- our focus now.

In [21]:
college = pd.read_csv('college-debt-records-features-single-column-w-class')
college[TARGET] = ['good_gain' if row['gain_in_r2_score'] > 0 else 'loss' for index, row in college.iterrows()]

In [18]:
def create_model(data):
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(data[FEATURES], data[TARGET])
    return rf

In [19]:
college_preds_dataset_2 = rf_dataset_2.predict(college[FEATURES])
college_preds_dataset_3 = rf_dataset_3.predict(college[FEATURES])

In [20]:
print(classification_report(college[TARGET], college_preds_dataset_2))

              precision    recall  f1-score   support

   good_gain       0.16      0.97      0.27       130
        loss       0.99      0.30      0.46       973

   micro avg       0.38      0.38      0.38      1103
   macro avg       0.57      0.63      0.36      1103
weighted avg       0.89      0.38      0.44      1103



In [22]:
print(classification_report(college[TARGET], college_preds_dataset_3))

              precision    recall  f1-score   support

   good_gain       0.12      0.88      0.21       130
        loss       0.90      0.15      0.26       973

   micro avg       0.23      0.23      0.23      1103
   macro avg       0.51      0.51      0.23      1103
weighted avg       0.81      0.23      0.25      1103



### Let's check how often the predictions for both models are positive --- my impression is that this model generates too many 'good_gain' predictions.

In [23]:
print('number of \'good_gain\' predictions using dataset_2:', len([i for i in college_preds_dataset_2 if i == 'good_gain']))
print('number of \'good_gain\' predictions using dataset_3:', len([i for i in college_preds_dataset_3 if i == 'good_gain']))

("number of 'good_gain' predictions using dataset_2:", 808)
("number of 'good_gain' predictions using dataset_3:", 942)


### So yes: for dataset_2, we have 808 out of 1103 'good_gain' predictions (73%); for dataset_3, 942 out of 1103 (85%). 

### How is the classification report for the synthetic test that was generated along with dataset_2? Is it the case that most predictions are also positive there?

In [24]:
test_for_dataset_2 = pd.read_csv('test-simplified-data-generation.csv')
test_for_dataset_2[TARGET] = ['good_gain' if row['gain_in_r2_score'] > 0 else 'loss' for index, row in test_for_dataset_2.iterrows()]
test_for_dataset_2_preds = rf_dataset_2.predict(test_for_dataset_2[FEATURES])
print(classification_report(test_for_dataset_2[TARGET], test_for_dataset_2_preds))

              precision    recall  f1-score   support

   good_gain       0.67      0.73      0.70      2496
        loss       0.57      0.50      0.53      1780

   micro avg       0.64      0.64      0.64      4276
   macro avg       0.62      0.62      0.62      4276
weighted avg       0.63      0.64      0.63      4276



In [27]:
print('number of \'good_gain\' predictions using dataset_2:', 
      len([i for i in test_for_dataset_2_preds if i == 'good_gain']))

("number of 'good_gain' predictions using dataset_2:", 2716)


### Note that the problem of predicting 'good_gain' far more than the necessary did not happen for the synthetic test data. Is it possible to remove some instances from dataset 2 and get good results for both college and synthetic test datasets?

### Let's try removing everything with relative gain in a certain interval first.

In [105]:
filtered_dataset_2 = dataset_2.loc[(dataset_2['gain_in_r2_score'] > 0.025) | (dataset_2['gain_in_r2_score'] < 0)]
filtered_dataset_2.head()

Unnamed: 0,query,target,candidate,query_num_of_columns,query_num_of_rows,query_row_column_ratio,query_max_mean,query_max_outlier_percentage,query_max_skewness,query_max_kurtosis,...,candidate_target_max_mutual_info,max_pearson_difference,containment_fraction,decrease_in_mae,decrease_in_mse,decrease_in_medae,gain_in_r2_score,r2_score_before,r2_score_after,class
1,36dcadcb-1b0d-4429-886e-1ec1b8b65b94,-2.1218834e+000,caefd73a-d31a-420e-bb46-56019fa79bec,6,99,16.5,0.021433,0.010101,0.895298,1.355716,...,0.024278,-0.250439,1.0,-0.017368,-0.022395,0.003941,0.206048,0.098032,0.118231,good_gain
2,0f33be7a-4627-4057-a8e5-5845991e7d77,4.1085606e-001,47b0d820-e6bb-4482-9422-972f20c3d0bd,9,99,11.0,0.014549,0.0,0.480805,1.331217,...,1.0,-0.46889,1.0,0.001482,0.018816,0.165079,-0.009627,0.661532,0.655164,loss
3,108e3ab1-a27b-44f4-bc31-2a3732d775d0,1.0913311e-001,50da1264-30ba-45d2-9898-29686873b921,19,999,52.578947,0.001559,0.004004,0.752183,1.245622,...,0.183599,-0.428453,0.414414,0.002859,0.003995,-0.025124,-0.007954,0.334363,0.331704,loss
5,60040170-eb69-4e6b-afab-a7b201eb4728,class,16ff95e0-84d5-441b-bcc9-75f9cd5d96d3,9,270,30.0,0.66129,0.007407,1.262893,1.964043,...,0.097627,-0.187404,1.0,0.012919,0.000148,0.075,-0.000191,0.435986,0.435903,loss
6,e1989182-d772-4410-94b4-9e31f1973d00,fat,44666071-33bf-4ea2-bbc8-cc08ccab3fe0,60,240,4.0,62.845375,0.020833,1.231087,6.12341,...,0.946263,-0.960529,1.0,0.030768,0.021367,0.095869,-0.000313,0.985557,0.985248,loss


In [106]:
rf_filtered_dataset_2 = create_model(filtered_dataset_2)

In [107]:
college_preds_filtered_dataset_2 = rf_filtered_dataset_2.predict(college[FEATURES])
print(classification_report(college[TARGET], college_preds_filtered_dataset_2))
print(len([pred for pred in college_preds_filtered_dataset_2 if pred == 'good_gain']))

              precision    recall  f1-score   support

   good_gain       0.63      0.72      0.67       130
        loss       0.96      0.94      0.95       973

   micro avg       0.92      0.92      0.92      1103
   macro avg       0.79      0.83      0.81      1103
weighted avg       0.92      0.92      0.92      1103

150


In [110]:
test_for_filtered_dataset_2_preds = rf_filtered_dataset_2.predict(test_for_dataset_2[FEATURES])
print(classification_report(test_for_dataset_2[TARGET], test_for_filtered_dataset_2_preds))
print(len([pred for pred in test_for_filtered_dataset_2_preds if pred == 'good_gain']))

              precision    recall  f1-score   support

   good_gain       0.73      0.48      0.58      2496
        loss       0.51      0.75      0.61      1780

   micro avg       0.59      0.59      0.59      4276
   macro avg       0.62      0.61      0.59      4276
weighted avg       0.64      0.59      0.59      4276

1623


#### What if we filter the synthetic test?

In [111]:
filtered_test_for_dataset_2 = test_for_dataset_2.loc[(test_for_dataset_2['gain_in_r2_score'] > 0.025) | (test_for_dataset_2['gain_in_r2_score'] < 0)]
filtered_test_for_filtered_dataset_2_preds = rf_filtered_dataset_2.predict(filtered_test_for_dataset_2[FEATURES])
print(classification_report(filtered_test_for_dataset_2[TARGET], filtered_test_for_filtered_dataset_2_preds))
print(len([pred for pred in filtered_test_for_filtered_dataset_2_preds if pred == 'good_gain']))

              precision    recall  f1-score   support

   good_gain       0.70      0.65      0.67      1560
        loss       0.71      0.75      0.73      1768

   micro avg       0.70      0.70      0.70      3328
   macro avg       0.70      0.70      0.70      3328
weighted avg       0.70      0.70      0.70      3328

1448


#### Ok. So the results become more compatible with the college use case when we filter *both* training and synthetic test. FYI, the maximum gain_in_r2_score for college is 0.36 and the minimum is -0.10. There's a large concentration of gains around zero and a bit above -0.1. 

#### My hunch is that gains close to zero are not a problem, but overlapping features for different classes. Using TSNE, I noticed that features 'query_row_column_ratio', 'candidate_target_max_pearson', 'candidate_target_max_spearman', 'max_pearson_difference', and 'containment_fraction' seem to separate the data better than all features taken into account at once. That said, when we use just these features and dataset_2, the results get super similar.

#### Let me use HDBSCAN and see if I can remove instances from clusters with high entropy, using both top and all features...

#### For college, there is a cluster that is a bit dirty when we use just the top five features to cluster: cluster size 109 positive fraction 0.7706422018348624 negative fraction 0.22935779816513763. These top features, by the way, led to 30 clusters. With all features, we have 65 clusters and some of them are dirty, but they're not as big as the one we just mentioned. With all features, there is a "non-cluster" (label = -1) with 285 instances. In other words, with all features, around 26% of all 1103 instances are too messy to even be assigned to a proper cluster... Maybe these features simply don't separate these cases well. In this "non-cluster", by the way, 12% of the instances are positive and 88% are negative. When we use just the top features, there's a "non-cluster" with 200 instances (18% of the data), where 4% of them are positive and 96% are negative --- i.e., it's not really a "dirty" cluster, even if it was assigned as such (maybe by chance?).

#### When we use all features to cluster dataset_2, we get 494 clusters. With the top features, 450. With the top features, 2462 out of 9884 instances were assigned to a cluster with label -1 (a cluster for data samples that look so noisy that they are not assigned to any cluster). This specific "cluster" is 62% positive and 38% negative. When we use all features, the "cluster" with label -1 has 2342 instances: 54% positive and 46% negative. If we remove these very messy instances, what do we get for college? Well.. with all features, results don't really get any better. Not even when we remove other clusters that are messy (i.e., the proportional_difference between positive and negative fractions is below 0.6). When we use only top features, the results for college are only marginally better.

In [1]:
rf_just_five_features = RandomForestClassifier(n_estimators=100, random_state=42)

NameError: name 'RandomForestClassifier' is not defined