# Ensambles

Question: Could be use data from minotiry solvents? Does it add value/relevant information?

Descriptions:
- 3 solvent availables, one majority and two minorities
- sample from majority and minorities: 3 samples
- train in majority
- evaluate in 3 samples and compare results, is there any particular improvement in majority?

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from src.config import chemical_inventory_path, raw_data_path
from src.data import notebook_utils as utils
from src.constants import GBL_INCHI_KEY, DMSO_INCHI_KEY, DMF_INCHI_KEY, \
                        INCHI_TO_CHEMNAME, TARGET_COL, RXN_FEAT_NAME, ORGANOAMONIUM_INCHI_KEY_COL
from src import plot_utils

In [6]:
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
plt.style.reload_library()
import matplotlib.patches as mpatches

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, GroupKFold, StratifiedShuffleSplit
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report, confusion_matrix
import sklearn.ensemble as ensamble_models
import sklearn.neighbors as neighbors_models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn import linear_model as linear_models
from sklearn.metrics import matthews_corrcoef
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import sklearn.svm as svm

from imblearn.metrics import classification_report_imbalanced

In [4]:
%cd ../..

/Users/mticona/Documents/tesis/licentiate-thesis-repo


In [9]:
PATH_DATA = 'data/ensemble/'
FILE_EVAL = 'validation_set.csv'
FILE_TEST_TRAIN = 'train_test_set.csv'

SEED = 2

In [10]:
df_test_train = pd.read_csv(PATH_DATA+FILE_TEST_TRAIN)

df_test = pd.read_csv(PATH_DATA+FILE_EVAL)

In [11]:
df_test.shape

(567, 63)

In [12]:
df_test_train.shape

(5077, 63)

In [13]:
df_test.columns

Index(['Unnamed: 0', '_feat_WienerPolarity', '_feat_BondCount', '_feat_fr_NH0',
       '_feat_Refractivity', '_feat_LargestRingSize',
       '_feat_HeteroaliphaticRingCount', '_feat_fr_quatN',
       '_feat_AromaticAtomCount', '_feat_AtomCount_C', '_feat_fr_amidine',
       '_feat_CyclomaticNumber', '_feat_LengthPerpendicularToTheMinArea',
       '_feat_fr_guanido', '_feat_donorcount', '_feat_fr_NH2',
       '_feat_minimalprojectionsize', '_feat_AtomCount_N', '_feat_WienerIndex',
       '_feat_AvgPol', '_feat_donsitecount', '_feat_Hacceptorcount',
       '_feat_ASA-', '_feat_fr_Ar_NH', '_feat_HeteroaromaticRing Count',
       '_feat_Accsitecount', '_feat_acceptorcount', '_feat_ASA',
       '_feat_CarboaromaticRingCount', '_feat_BalabanIndex',
       '_feat_SmallestRingSize', '_feat_RingAtomCount',
       '_feat_PolarSurfaceArea', '_feat_MinimalProjectionArea',
       '_feat_MaximalProjectionArea', '_feat_CarboRingCount',
       '_feat_CarboaliphaticRingCount', '_feat_VanderWaalsSurface

In [14]:
df_test_train = df_test_train.drop(['Unnamed: 0', '_rxn_organic-inchikey'], axis=1)
df_test = df_test.drop(['Unnamed: 0', '_rxn_organic-inchikey'], axis=1)

El estudio de la cristalización de perovskitas mediante aprendizaje automático encuentra desafíos comunes que surgen a la hora de trabajar con datos de experimentación, cuya disponibilidad suele ser muy acotada debido a la dificultad de su generación. De allí la relevancia de estudiar técnicas que permitan desarrollar modelos robusto y no sesgados por la representación que tiene una acotada cantidad de muestras. En este capítulo se estudia la aplicación del paradigma de ensamble mediante una comparación entre los tradicionales modelos homogéneos basados en árboles de decisión y modelos heterogénos específicos para el problema de cristalización de perovskita en particular.  

## Models definitions

Dentro del paradigma de ensamble, los métodos basados en árboles de decisión como estimador de base son los que más aplicación y popularidad han ganado en la última década. Características como lidear con datos categóricos, insensibilidad a la estandarización y XX, resultan que estos algoritmos sean atractivos por su versatilidad y la cantidad de problemas en donde pueden aplicarse.      

En general los métodos basados en árboles presentan sensiblidad a las particularedes del conjunto de entrenamiento, provocando un overfitting o sobre. De allí que los métodos más comúnmenmente empleados de ensambles usen estos tipo de estimadores

# comparar rf, gbc, 

Actualmente Entre los algoritmos más populares por su gran capacidad de generalizaci´pn 
Tradicionalmente se 


In [15]:
def make_model(model_name, model_config={}):
    try:
        model_method = getattr(neighbors_models, model_name)
    except AttributeError:
        try:
            model_method = getattr(ensamble_models, model_name)
        except AttributeError:
            try:
                model_method = getattr(linear_models, model_name)
            except AttributeError:
                try:
                    print('svm')
                    model_method = getattr(svm, model_name)
                except AttributeError:
                    pass
    model = model_method(**model_config)
    return model
                
def split_X_y(df):
    X = df.drop([TARGET_COL], axis=1).values
    y = df[TARGET_COL].values
    return X, y

def get_out_groups(df):
    df['groups'] = df.groupby([ORGANOAMONIUM_INCHI_KEY_COL]).grouper.group_info[0]
    groups = list(df['groups'])
    df = df.drop(['groups', ORGANOAMONIUM_INCHI_KEY_COL], axis=1)
    return df, groups

def proof_concept(model_name, df_train, df_test, model_config = {}):
    model = make_model(model_name, model_config)
    
    X_test, y_test = split_X_y(df_test)
    X_train, y_train = split_X_y(df_train)
    
    pipeline_steps = [('std', StandardScaler()), 
                      ('model', model)
                     ]
    
    pipeline = Pipeline(steps=pipeline_steps)
    
    pipeline.fit(X_train, y_train)
    
    y_pred =  pipeline.predict(X_test)
    
    matt = matthews_corrcoef(y_test, y_pred)
    
    report = classification_report(y_test, y_pred, labels=[0,1], 
                                   output_dict=True, target_names=["No cristaliza", "Cristaliza"])
    
    report_df =  pd.DataFrame(report).transpose()
    
    report_df["matthew"] = matt
    
    return report_df

def proof_concept_model(model_name, df_train, df_test, model_config = {}):
    model = make_model(model_name, model_config)
    
    X_test, y_test = split_X_y(df_test)
    X_train, y_train = split_X_y(df_train)
    
    pipeline_steps = [('std', StandardScaler()), 
                      ('model', model)
                     ]
    
    pipeline = Pipeline(steps=pipeline_steps)
    
    pipeline.fit(X_train, y_train)
    
    return pipeline
    
def full_pipeline(model_name, df, params):
    
    model = make_model(model_name)
    
    X, y = split_X_y(df)
    
    pipeline_steps = [('std', StandardScaler()), 
                      ('model', model)
                     ]
    pipeline = Pipeline(steps=pipeline_steps)
    
    k_fold_config = params['k_fold_config']
    
    params_search = params['params_search']
    
    cv = StratifiedShuffleSplit(**k_fold_config)

    scoring={
            'recall': 'recall', 
            'f1': 'f1',
            'precision': 'precision',
            'matthew': make_scorer(matthews_corrcoef)
    }
    optimizing_metric = 'matthew'
    
    clf = RandomizedSearchCV(pipeline,
                             params_search, 
                             cv=cv,
                             scoring=scoring,
                             random_state=SEED,
                             n_jobs=-1,
                             refit=optimizing_metric,
                             error_score=0,
                             n_iter=params['n_iter'],
                            )
    clf.fit(X, y)
    
    return clf


def full_voting_pipeline(models_params, df, voting='soft'):
    models_voting = []
    for model_name, args in models_params.items():
        model = make_model(models[model_name], args)
        pipeline_steps = [
            ('std', StandardScaler()), 
            ('model', model)
            ]
        pipeline = Pipeline(steps=pipeline_steps)
        models_voting.append((model_name,pipeline))
        
    voting_clf = VotingClassifier(estimators=models_voting, voting=voting, n_jobs=-1)
        
    X, y = split_X_y(df)
    
    scoring={
            'recall': 'recall', 
            'f1': 'f1',
            'precision': 'precision',
            'matthew': make_scorer(matthews_corrcoef)
    }
    optimizing_metric = 'matthew'
    
    voting_clf.fit(X, y)
    
    return voting_clf

def run_evaluation_voting(models_params, df, df_test, 
                          voting='soft',
                          file_name='no_file_name.csv'):
    models_voting = []
    df_reports = []
    X, y = split_X_y(df)
    X_test, y_test = split_X_y(df_test)
    
    for model_name, args in models_params.items():
        model = make_model(models[model_name], args)
        pipeline_steps = [
            ('std', StandardScaler()), 
            ('model', model)
            ]
        pipeline = Pipeline(steps=pipeline_steps)
        
        pipeline.fit(X, y)
        y_pred = pipeline.predict(X_test)
        
        df_class = pd.DataFrame.from_dict(classification_report_imbalanced(y_test, y_pred,
                                                                   target_names=['No cristaliza', 'Cristaliza'],
                                                                   output_dict=True)).T

        df_class['matthew'] = matthews_corrcoef(y_test, y_pred)

        df_report = df_class.reset_index().rename({'index':'metrica'}, axis=1)
        df_report['estimator'] = model_name
        df_reports.append(df_report)
        
    
    
    clf_voting = full_voting_pipeline(models_params, df)
    y_pred = clf_voting.predict(X_test)
    
    df_class = pd.DataFrame.from_dict(classification_report_imbalanced(y_test, y_pred,
                                                                   target_names=['No cristaliza', 'Cristaliza'],
                                                                   output_dict=True)).T

    df_class['matthew'] = matthews_corrcoef(y_test, y_pred)

    df_report = df_class.reset_index().rename({'index':'metrica'}, axis=1)

    df_report['estimator'] = 'voting_' + voting

    df_reports.append(df_report)
    
    df_final_report = pd.concat(df_reports, axis=0)
    
    df_final_report.to_csv(results_path+file_name, index=None)
    
    return df_final_report    

In [61]:
results_path = 'results/ensamble/grid_search_single_estimators/'

models = {
    'knn':'KNeighborsClassifier',
    'lg': 'LogisticRegression',
    'rf':'RandomForestClassifier',
    'gbc':'GradientBoostingClassifier',
    'svm':'SVC',
    'svm_linear': 'SVC',
    'svm_poly': 'SVC',
    'svm_rbf': 'SVC',
    'bagg': 'BaggingClassifier'
}


grid_params_lg_default = dict(
                      dual=False,
                      class_weight='balanced',
                      #model__penalty=['l1','l2'],
                      penalty='l1',
                      random_state=SEED,
                      solver='saga',
                      n_jobs=-1
                     )

grid_params_rf_default = dict(
                      min_samples_split=10,
                      #model__min_samples_split=[10,7,15,20],
                      min_samples_leaf=3,
                      warm_start=False,
                     )

grid_params_gbc_default = dict(
                      subsample=0.9,
                      #model__min_samples_split=[10,7,15,20],
                      min_samples_split=7,
                      min_samples_leaf=5,
                      random_state=SEED,
                     )


grid_params_svm_default = dict(
                            class_weight='balanced',
                            degree=3,
                            probability=True,
                            random_state=SEED)

grid_params_knn_default = dict(
                            n_jobs=-1,
                            weights='distance')


grid_params_bagg_default = dict(
                            model__n_jobs=-1,
                            model__weights='distance',
)

k_splits = 2

k_fold_config_default = {
    'random_state': SEED,
    #'shuffle': True,
    'n_splits': k_splits
}

params_example = {
    'k_fold_config': k_fold_config_default,
    'params_search': 130,
    'n_iter': 50
}

In [122]:
# KNN
grid_params_knn = grid_params_knn_default.copy()
grid_params_knn['n_neighbors'] = 6

# LG
grid_params_lg = grid_params_lg_default.copy()
grid_params_lg['C'] = 0.15
grid_params_lg['max_iter'] = 3000

# SVM linear
grid_params_svm_linear = grid_params_svm_default.copy()
grid_params_svm_linear['C'] = 10
grid_params_svm_linear['kernel']='linear'

# SVM poly
grid_params_svm_poly = grid_params_svm_default.copy()
grid_params_svm_poly['C'] = 10
grid_params_svm_poly['kernel']='poly'

# SVM rbf
grid_params_svm_rbf = grid_params_svm_default.copy()
grid_params_svm_rbf['C'] = 50
grid_params_svm_rbf['kernel']='rbf'

# GBC
grid_params_gbc = grid_params_gbc_default.copy()
grid_params_gbc['n_estimators'] = 110
grid_params_gbc['max_depth'] = 7
grid_params_gbc['learning_rate'] = 0.15

# RF
grid_params_rf = grid_params_rf_default.copy()
grid_params_rf['n_estimators'] = 60
grid_params_rf['class_weight']='balanced'
grid_params_rf['max_depth'] = 8

# Bagging base estimator

grid_params_bagg = {}
grid_params_bagg['n_estimators'] = 300
#grid_params_bagg['base_estimator'] = [make_model(models[base_estimator_name], model_config=base_estimator_param)]
grid_params_bagg['n_jobs'] = -1
grid_params_bagg['random_state'] = SEED

    
models_params_simple_estimators = {
  #  'knn': grid_params_knn,
    'lg': grid_params_lg,
    'svm_linear': grid_params_svm_linear,
    'svm_poly': grid_params_svm_poly,
    'svm_rbf': grid_params_svm_rbf,
   'rf': grid_params_rf,
   'gbc': grid_params_gbc,
}

models_params_ensamble_estimators = {
    'rf': grid_params_rf,
    'gbc': grid_params_gbc,
}

In [123]:
def confusion_matrix_scorer(y, y_pred):
    cm = confusion_matrix(y, y_pred)
    return {'tn': cm[0, 0], 'fp': cm[0, 1],
            'fn': cm[1, 0], 'tp': cm[1, 1]}

In [68]:
def report_bagging_pipeline(models_params, df, df_test, voting='soft'):
    clf_voting = full_voting_pipeline(models_params, df)
    X_test, y_test = split_X_y(df_test)
    y_pred = clf_voting.predict(X_test)
    
    df_class = pd.DataFrame.from_dict(classification_report_imbalanced(y_test, y_pred,
                                                                   target_names=['No cristaliza', 'Cristaliza'],
                                                                   output_dict=True)).T

    df_class['matthew'] = matthews_corrcoef(y_test, y_pred)

    df_report = df_class.reset_index().rename({'index':'metrica'}, axis=1)

    df_report['estimator'] = 'voting_' + voting
    return df_report

In [69]:
def report_pipeline(models_params, df, df_test, voting='soft'):
    clf_voting = full_voting_pipeline(models_params, df)
    X_test, y_test = split_X_y(df_test)
    y_pred = clf_voting.predict(X_test)
    
    df_class = pd.DataFrame.from_dict(classification_report_imbalanced(y_test, y_pred,
                                                                   target_names=['No cristaliza', 'Cristaliza'],
                                                                   output_dict=True)).T

    df_class['matthew'] = matthews_corrcoef(y_test, y_pred)

    df_report = df_class.reset_index().rename({'index':'metrica'}, axis=1)

    df_report['estimator'] = 'voting_' + voting
    return df_report

In [188]:
cm = confusion_matrix_scorer(y_test, y_pred)

In [124]:
single_voting_estimators = run_evaluation_voting(models_params_simple_estimators, df_test_train, df_test,
                                                 file_name="all_svm_con_gbc_rf.csv")

svm
svm
svm
svm
svm
svm


In [119]:
single_voting_estimators.tail(30)

Unnamed: 0,metrica,pre,rec,spe,f1,geo,iba,sup,matthew,estimator
6,avg_geo,0.766306,0.766306,0.766306,0.766306,0.766306,0.766306,0.766306,0.449199,svm_linear
7,avg_iba,0.576173,0.576173,0.576173,0.576173,0.576173,0.576173,0.576173,0.449199,svm_linear
8,total_support,567.0,567.0,567.0,567.0,567.0,567.0,567.0,0.449199,svm_linear
0,0,0.959786,0.786813,0.866071,0.864734,0.825492,0.676035,455.0,0.547893,rf
1,1,0.5,0.866071,0.786813,0.633987,0.825492,0.686837,112.0,0.547893,rf
2,avg_pre,0.868964,0.868964,0.868964,0.868964,0.868964,0.868964,0.868964,0.547893,rf
3,avg_rec,0.802469,0.802469,0.802469,0.802469,0.802469,0.802469,0.802469,0.547893,rf
4,avg_spe,0.850415,0.850415,0.850415,0.850415,0.850415,0.850415,0.850415,0.547893,rf
5,avg_f1,0.819155,0.819155,0.819155,0.819155,0.819155,0.819155,0.819155,0.547893,rf
6,avg_geo,0.825492,0.825492,0.825492,0.825492,0.825492,0.825492,0.825492,0.547893,rf


In [96]:
single_voting_estimators

Unnamed: 0,metrica,pre,rec,spe,f1,geo,iba,sup,matthew,estimator
0,0,0.913607,0.92967,0.642857,0.921569,0.773075,0.614786,455.0,0.588985,knn
1,1,0.692308,0.642857,0.92967,0.666667,0.773075,0.580504,112.0,0.588985,knn
2,avg_pre,0.869893,0.869893,0.869893,0.869893,0.869893,0.869893,0.869893,0.588985,knn
3,avg_rec,0.873016,0.873016,0.873016,0.873016,0.873016,0.873016,0.873016,0.588985,knn
4,avg_spe,0.699512,0.699512,0.699512,0.699512,0.699512,0.699512,0.699512,0.588985,knn
5,avg_f1,0.871218,0.871218,0.871218,0.871218,0.871218,0.871218,0.871218,0.588985,knn
6,avg_geo,0.773075,0.773075,0.773075,0.773075,0.773075,0.773075,0.773075,0.588985,knn
7,avg_iba,0.608015,0.608015,0.608015,0.608015,0.608015,0.608015,0.608015,0.588985,knn
8,total_support,567.0,567.0,567.0,567.0,567.0,567.0,567.0,0.588985,knn
0,0,0.941176,0.668132,0.830357,0.781491,0.744841,0.545788,455.0,0.400843,lg


In [249]:
df_report_hard = report_pipeline(models_params, df_test_train, df_test, voting='hard')

svm
svm
svm


In [234]:
df_reports_estimators = run_evaluation_voting(models_params, df_test_train, df_test)

svm
svm
svm


In [251]:
ensamble_report = pd.concat([df_report_soft, df_report_hard, df_reports_estimators], axis=0)

In [273]:
ensamble_report[~ensamble_report['metrica'].isin(['total_support', 0])].sort_values('rec', ascending=False)

Unnamed: 0,metrica,pre,rec,spe,f1,geo,iba,sup,matthew,estimator
1,1,0.381818,0.9375,0.626374,0.542636,0.766306,0.605495,112.0,0.449199,svm_linear
1,1,0.437768,0.910714,0.712088,0.591304,0.8053,0.66139,112.0,0.50398,svm_poly
1,1,0.471963,0.901786,0.751648,0.619632,0.823302,0.688002,112.0,0.536688,svm_rbf
3,avg_rec,0.883598,0.883598,0.883598,0.883598,0.883598,0.883598,0.883598,0.607005,gbc
5,avg_f1,0.878109,0.878109,0.878109,0.878109,0.878109,0.878109,0.878109,0.607005,gbc
2,avg_pre,0.877473,0.877473,0.877473,0.877473,0.877473,0.877473,0.877473,0.607005,gbc
4,avg_spe,0.876043,0.876043,0.876043,0.876043,0.876043,0.876043,0.876043,0.449199,svm_linear
1,1,0.485149,0.875,0.771429,0.624204,0.821584,0.681991,112.0,0.537419,rf
3,avg_rec,0.873016,0.873016,0.873016,0.873016,0.873016,0.873016,0.873016,0.576909,voting_soft
4,avg_spe,0.872129,0.872129,0.872129,0.872129,0.872129,0.872129,0.872129,0.536688,svm_rbf


In [267]:
cols=['estimator', 'matthew', 'rec']
ensamble_report[cols].drop_duplicates().sort_values('rec', ascending=False)[8:].head(15)

Unnamed: 0,estimator,matthew,rec
0,gbc,0.607005,0.953846
0,voting_hard,0.56979,0.940659
0,voting_soft,0.576909,0.940659
1,svm_linear,0.449199,0.9375
1,svm_poly,0.50398,0.910714
1,svm_rbf,0.536688,0.901786
3,gbc,0.607005,0.883598
5,gbc,0.607005,0.878109
2,gbc,0.607005,0.877473
4,svm_linear,0.449199,0.876043
