## **A**utomated **L**earning for **I**nsightful **C**omparison and **E**valuation - (ALICE)

This is a demo notebook where I demonstrate the main functionalities of the proposed python framework.

For demonstrational purposes, I will be using a simple and small dataset called California Housing, which includes 8 predictors and 1 target (`MedHouseVal`) - which is a house value. This amount of predictors is small for us but a small dataset helps to run the python framework in a short amount of time, so it is very suitable for demonstration purposes.

Because all of the variables in the dataset are continuous, I simulate a binary categorical column from MedHouseVal which just measures if house value is above average. 

I also generate a fake categorical variable just to demonstrate the framework's functionality to treat $n$ amount of columns obtained from dummy encoding of a categorical variable as one variable in feature selection process.

Further details are given in comments and markdown notes throughout the notebook.

In [1]:
# Import numpy for mathematical operations
import numpy as np
# Import pandas for handling data tables
import pandas as pd
# Import stats from scipy - the key statistical package (an extension of Numpy) for python
from scipy import stats
# This just saves the directory
import os
cur_dir = os.getcwd()



Loading the dataset from sklearn. The california housing can be loaded from sklearn, the go-to python library for ML and predictive modeling. 

In [2]:
## Here I load the dataset
from sklearn.datasets import fetch_california_housing

# I save the dataset as data
data = fetch_california_housing()

# Save predictors as vaiable X
X = pd.DataFrame(data=data.data, columns=data.feature_names)
# Save target as y
y = pd.DataFrame(data=data.target, columns=data.target_names)
# Combine X and y into a dataframe called df
df = pd.concat([X,y], axis=1)

The data table can just be inspected by calling its name

In [3]:
# A brief look at the dataframe
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [4]:
# Look at descriptive statistics
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


Since every feature here is numerical, I will generate a fake categorical feature and encode it into dummies to demonstrate that function can handle dummies toegether.

In [5]:
# Fake categorical labels for one categorical variable
fake_labels = ['nice', 'not_nice', 'mid']
# Generate the new column by randomly assigning one of the fake labels to each observation
df['HouseEval'] = np.random.choice(fake_labels, size=len(df))

# Second set of fake labels for the second categorical variable
fake_labels_2 = ['white', 'black', 'gray']
# Again generate set of fake labels for the second categorical variable
df['WallColors'] = np.random.choice(fake_labels_2, size=len(df))

In [6]:
# Get dummies
# Get dummies for the two columns and set one value as base
df = pd.get_dummies(df, columns=['HouseEval', 'WallColors'], drop_first=True)

In [7]:
# We can again look at the dataframe
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,HouseEval_nice,HouseEval_not_nice,WallColors_gray,WallColors_white
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526,1,0,1,0
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585,0,0,0,0
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521,0,1,0,1
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413,0,0,1,0
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422,0,1,0,0


For ease of demonstration, as said, I will generate a second target variable. This one will be binary and take value of 1 if house value is higher than the sample average and 0 if the house value is lower.

Having two datasets, one with a continuous target and one with a binary target will help us demosntrate the functionalities of the dataframe on both regression and classification tasks.

In [8]:
# Obtain mean of the target variable
mean_target = df['MedHouseVal'].mean()
# Create copy of the dataframe and name it df_discrete
df_discrete = df.copy()
# Generate the new target variable
df_discrete['AboveMean'] = (df_discrete['MedHouseVal'] >= mean_target).astype(int)
# From the new dataframe, drop the continuous target variable
df_discrete.drop('MedHouseVal', axis=1, inplace=True)

### Demonstration of the Framework.

Below I demonstrate the functionalities of the python framework I am working on.

The module is built in a way that allows for either importing the entire thing in one go, or importing some individual functionalities (like functions that compute correlation, or cohen's kappa and so on) separately.

**Entire library:**

In [9]:
# Can just import entire module
import alice

**Individual Functions:**

In [10]:
# Or import individual functions

# Import regression metrics
from alice.metrics.regress import mse, rmse, mae
# Import classification metrics
from alice.metrics.classify import accuracy, precision, recall, f1

# Import regression agreeability metric
from alice.agreeability.regress import pearson
# Import classification agreeability metric
from alice.agreeability.classify import cohen_kappa

# Import regression tests
from alice.testing.regress import t_test
# Import classification tests
from alice.testing.classify import mcnemar_binomial, mcnemar_chisquare

**Algorithm That Combines Backwards Feature Elimination with Inter-Rater Reliability:**

In [10]:
# Import our demo search algorithm 
from alice.search_and_compare.sequential import BackEliminator

In the huge code block below, I just copied the entire code I made for the `BackEliminator` module. The source code can be found in the `search_and_compare` folder in `sequential.py` file. 

In [11]:
from alice.metrics.regress import mse, rmse, mae
from alice.metrics.classify import accuracy, precision, recall, f1
from alice.agreeability.regress import pearson
from alice.agreeability.classify import cohen_kappa
from alice.testing.classify import mcnemar_binomial, mcnemar_chisquare
from alice.testing.regress import t_test
from alice.utils.feature_lists import dummy_grouper
from alice.utils.feature_lists import feature_fixer
from alice.utils.feature_lists import feature_list_flatten
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import numpy as np

class BackEliminator():
    '''
    The class is built for conducting backwards feature elimination in combination with model agreeability. A more detailed documentation will follow.
    
    Args:
        X (pd.DataFrame): A pandas dataframe containing predictors.
        y (pd.DataFrame): A pandas dataframe containing target.
        validation_data (tuple): A tuple of validation data (X_val, y_val).
        task_type (str): String for task type. Available options - 'classification' or 'regression'.
        criterion (str): String for intra-model evaluation criterion. Available options: 'mse', 'rmse', 'mae', 'accuracy', 'precision', 'recall', 'f1'.
        agreeability (str): String for inter-model comparison. Available options: 'pearson', 'cohen_kappa'
        dummy_list (list): List of lists containing column names (str) of dummy features generated from a categorical variable. (Optional).
        features_to_fix (list): List containing column names (str) of features that will be excluded from feature elimination and thus always included in modeling. (Optional)
    
    Regression Example:
        seeker = Backeliminator(
            X=X_train,
            y=y_train,
            validation_data=(X_val, y_val),
            task_type='regression',
            criterion='rmse',
            agreeability='pearson',
            dummy_list=[
                ['dummy_1_from_variable_1', 'dummy_2_from_variable_1'],
                ['dummy_1_from_variable_2', 'dummy_2_from_variable_2', 'dummy_3_from_variable_2']
            ],
            features_to_fix=[
                'variable_3',
                'variable_4'
            ]
        )
    
    Classification Example:
        seeker = Backeliminator(
            X=X_train,
            y=y_train,
            validation_data=(X_val, y_val),
            task_type='classification',
            criterion='f1',
            agreeability='cohen_kappa',
            dummy_list=[
                ['dummy_1_from_variable_1', 'dummy_2_from_variable_1'],
                ['dummy_1_from_variable_2', 'dummy_2_from_variable_2', 'dummy_3_from_variable_2']
            ],
            features_to_fix=[
                'variable_3',
                'variable_4'
            ]
        )    
    '''

    def __init__(self,
                 X=None,
                 y=None, 
                 validation_data=None,
                 task_type=None,
                 criterion=None,
                 agreeability=None,
                 dummy_list=None,
                 features_to_fix=None
                 ):

        self.X = X
        self.y = y
        if validation_data:
            self.validation_data = validation_data
            self.X_val = self.validation_data[0]
            self.y_val = self.validation_data[1]
        self.criterion_registry = {
            'mse': mse,
            'rmse': rmse,
            'mae': mae,
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1
            }
        self.criterion = criterion
        self.agreeability_registry = {
            'pearson': pearson,
            'cohen_kappa': cohen_kappa
        }
        self.testing_registry = {
            'mcnemar_binomial': mcnemar_binomial,
            'mcnemar_chisquare': mcnemar_chisquare,
            't_test': t_test
        }
        self.agreeability = agreeability
        # To append all scores per dropped feature for all iterations of while loop
        self.scores_and_preds_m1 = []
        self.scores_and_preds_m2 = []
    
        #### =========================================================================================== ####
        #### NEW SORTING DEFINED                                                                            #
        #### ---------------------------------------------------------------------------------------------- #
        #### Rationale:                                                                                     #
        #### In Classification metrics: Higher Score <=> Better Predictive Performance                      #
        #### Worst feature will be that whose removal led to the highest score in iteration                 #
        #### In Regression metrics: Lower score <=> Better Predictive Performance                           #
        #### Worst feature will be that whose removal led to the lowest score in iteration                  #
        if task_type == 'classification':                                                                   #
            # Get the entry which has highest score (in second column - [1]) - used in compare_best_models  #
            self.find_worst_feature = lambda scores: max(scores, key=lambda x: x[1])                        #
            # Order the container in descending from max score to min score - used in compare_all_models    #
            self.sort_scores = lambda scores: sorted(scores, key=lambda x: x[1], reverse=True)              #
        elif task_type == 'regression':                                                                     #
            # Get the entry which has lowest score (in second column - [1])                                 #
            self.find_worst_feature = lambda scores: min(scores, key=lambda x: x[1])                        #
            # Order the container in ascending - min score on top max score on bottom                       #
            self.sort_scores = lambda scores: sorted(scores, key=lambda x: x[1])                            #
        else:                                                                                               #
            raise ValueError("Invalid task type specified. Choose 'regression' or 'classification'.")       #
        #### Return will be (worst_feature, best_score, best_preds) from iteration                          #
        #### =========================================================================================== ####

        # Handle feature lists
        # Will default to None if not provided
        self.dummy_list = dummy_list
        # Will default to None if not provided
        self.features_to_fix = features_to_fix
        # Group columns obtained from a one-hot-encoded variable together
        if self.dummy_list:
            self.initial_feature_list = dummy_grouper(feature_list=list(self.X.columns), dummy_list=self.dummy_list)
        else:
            self.initial_feature_list = list(self.X.columns)
        # Remove features we want to fix from the feature list
        if self.features_to_fix:
            self.initial_feature_list = feature_fixer(self.initial_feature_list, self.features_to_fix)
        
    
    # Method to be called in the main method of back elimination
    def _deselect_feature(self,
                          feature_list,
                          model):
        # Empty list for scores
        score_per_dropped_feature = []
        # Iterate over all features
        for feature in feature_list:
            # Generate temporary feature set to manipulate
            temporary_set = feature_list.copy()
            # Drop feature from set
            temporary_set.remove(feature)
            # Flatten list
            temporary_set = feature_list_flatten(temporary_set)
            # Train
            model.fit(self.X[temporary_set], self.y)
            # Predict on validation set
            if self.validation_data:
                y_preds = model.predict(self.X_val[temporary_set])
                # Evaluate
                score = self.criterion_registry[self.criterion](self.y_val, y_preds)
            # Predict on training set
            else:
                y_preds = model.predict(self.X[temporary_set])
                score = self.criterion_registry[self.criterion](self.y, y_preds)
            # Append feature name, score after dropping it, y_preds after dropping it
            score_per_dropped_feature.append((feature, score, y_preds))

        #### Deprecated ####
        # At the end of loop, identify feature
        # which led to the worst score when 
        # feature dropped
        # Descending sort based on score, (x[1])
        #score_per_dropped_feature = self.sort_scores(score_per_dropped_feature) #### REMOVE THIS

        # For ease of read
        #worst_feature = score_per_dropped_feature[0][0] #### REMOVE THIS
        #best_score = score_per_dropped_feature[0][1] ##### REMOVE THIS
        #best_preds = score_per_dropped_feature[0][2] ##### REMOVE THIS

        #del score_per_dropped_feature #### RETURN THIS
        # Return feature name
        #return worst_feature, best_score, best_preds
        
        #### =========================================================================================== ####
        #### NEW RETURN DEFINED                                                                             #
        #### ---------------------------------------------------------------------------------------------- #
        return score_per_dropped_feature                                                                    #
        #### Returns a list of tuples with three entries: str(feature_name), float(score), np.array(preds)  #
        #### =========================================================================================== ####
        ### TO DO ###
        # Add functionality to possibly save trained models 
        # Will take up large memory, may be unfeasible
        ### TO DO ###
    
    def compare_best_models(
            self,
            m1,
            m2
        ): 
        # Copy all features initially
        # for both models
        new_feature_list_m1 = self.initial_feature_list.copy()
        new_feature_list_m2 = self.initial_feature_list.copy()
        # Aggreeability scores
        results = []
        # First fit models w/o any removed features
        # Flat lists for fitting
        full_fit_m1 = feature_list_flatten(new_feature_list_m1)
        full_fit_m2 = feature_list_flatten(new_feature_list_m2)
        m1.fit(self.X[full_fit_m1], self.y)
        m2.fit(self.X[full_fit_m2], self.y)
        # Predict on validation set
        if self.validation_data:
            # Model 1
            m1_preds = m1.predict(self.X_val[full_fit_m1])
            m1_score = self.criterion_registry[self.criterion](self.y_val, m1_preds)
            # Model 2
            m2_preds = m2.predict(self.X_val[full_fit_m2])
            m2_score = self.criterion_registry[self.criterion](self.y_val, m2_preds)
            # Aggreeability Score
            agreeability_coeff = self.agreeability_registry[self.agreeability](m1_preds, m2_preds)
        # Predict on training set
        else:
            # Model 1
            m1_preds = m1.predict(self.X[full_fit_m1])
            m1_score = self.criterion_registry[self.criterion](self.y, m1_preds)
            # Model 2
            m2_preds = m2.predict(self.X[full_fit_m2])
            m2_score = self.criterion_registry[self.criterion](self.y, m2_preds)
            # Agreeability score
            agreeability_coeff = self.agreeability_registry[self.agreeability](m1_preds, m2_preds)
        
        # Append to results
        
        results.append({
            f'Best: M1 Included Features': full_fit_m1.copy(),
            f'Best: M1 {self.criterion.upper()}': m1_score,
            f'Best: M2 Included Features': full_fit_m2.copy(),
            f'Best: M2 {self.criterion.upper()}': m2_score,
            f'Best: Agreeability ({self.agreeability})': agreeability_coeff,
            })            

        ### DEBUG PRINTS
        print(f'Initial run: fitted both models with full feature set.')
        print(f'-' * 150)
        print(f'Model 1 included: {new_feature_list_m1}. {self.criterion.upper()}: {m1_score}')
        print(f'Model 2 included: {new_feature_list_m2}. {self.criterion.upper()}: {m2_score}')
        print(f'-' * 150)
        print(f'Agreeability Coefficient ({self.agreeability}): {agreeability_coeff}')
        print(f'=' * 150)
        ### DEBUG PRINTS   
        
        ### DEBUG
        counter = 0
        ### DEBUG

        # Begin loop to deselect and evaluate
        while len(new_feature_list_m1) > 1 and len(new_feature_list_m2) > 1:

            ### DEBUG
            counter += 1    
            ### DEBUG    

            # Obtain worst_feature, score and preds from deselect_feature functions
            #worst_feature_m1, m1_score, m1_preds = self._deselect_feature(new_feature_list_m1, m1)
            #worst_feature_m2, m2_score, m2_preds = self._deselect_feature(new_feature_list_m2, m2)
            # Update included feature lists
            #new_feature_list_m1.remove(worst_feature_m1) 
            #new_feature_list_m2.remove(worst_feature_m2)

            # Obtain the score lists (removed feature, corresponding score, corresponding preds)
            score_per_dropped_feature_m1 = self._deselect_feature(new_feature_list_m1, m1)
            score_per_dropped_feature_m2 = self._deselect_feature(new_feature_list_m2, m2)

            # Get the worst_feature, best_score, best_preds
            worst_feature_m1, m1_score, m1_preds = self.find_worst_feature(score_per_dropped_feature_m1)
            worst_feature_m2, m2_score, m2_preds = self.find_worst_feature(score_per_dropped_feature_m2)

            # Update included feature lists
            new_feature_list_m1.remove(worst_feature_m1)
            new_feature_list_m2.remove(worst_feature_m2)
            # Flat lists to append to results
            flat_feature_list_m1 = feature_list_flatten(new_feature_list_m1)
            flat_feature_list_m2 = feature_list_flatten(new_feature_list_m2)

            # Compute agreeability
            agreeability_coeff = self.agreeability_registry[self.agreeability](m1_preds, m2_preds)
            # Append to results
            results.append({
                'Model 1 Included Features': flat_feature_list_m1.copy(),
                f'Model 1 {self.criterion.upper()}': m1_score,
                'Model 2 Included Features': flat_feature_list_m2.copy(),
                f'Model 2 {self.criterion.upper()}': m2_score,
                f'Agreeability Coefficient ({self.agreeability})': agreeability_coeff
            })

            ### DEBUG PRINTS
            print(f'Iteration {counter}:')
            print(f'-' * 150)
            print(f'Model 1 included: {new_feature_list_m1}. {self.criterion.upper()}: {m1_score}')
            print(f'Model 2 included: {new_feature_list_m2}. {self.criterion.upper()}: {m2_score}')
            print(f'-' * 150)
            print(f'Agreeability Coefficient ({self.agreeability}): {agreeability_coeff}')
            print(f'=' * 150)
            ### DEBUG PRINTS
        # Save results
        self.results = results
        # Return results
        return results
    
### Order for best for best    
    def compare_all_models(
            self,
            m1,
            m2
        ):
        '''
        No docstring yet.
        ''' 
        # Copy all features initially
        # for both models
        new_feature_list_m1 = self.initial_feature_list.copy()
        new_feature_list_m2 = self.initial_feature_list.copy()
        # Aggreeability scores
        results = []
        # Flat lists for fitting
        full_fit_m1 = feature_list_flatten(new_feature_list_m1)
        full_fit_m2 = feature_list_flatten(new_feature_list_m2)
        # First fit models w/o any removed features
        m1.fit(self.X[full_fit_m1], self.y)
        m2.fit(self.X[full_fit_m2], self.y)
        # Predict on validation set
        if self.validation_data:
            # Model 1
            m1_preds = m1.predict(self.X_val[full_fit_m1])
            best_score_m1 = self.criterion_registry[self.criterion](self.y_val, m1_preds)
            # Model 2
            m2_preds = m2.predict(self.X_val[full_fit_m2])
            best_score_m2 = self.criterion_registry[self.criterion](self.y_val, m2_preds)
            # Aggreeability Score
            agreeability_coeff = self.agreeability_registry[self.agreeability](m1_preds, m2_preds)
        # Predict on training set
        else:
            # Model 1
            m1_preds = m1.predict(self.X[full_fit_m1])
            best_score_m1 = self.criterion_registry[self.criterion](self.y, m1_preds)
            # Model 2
            m2_preds = m2.predict(self.X[full_fit_m2])
            best_score_m2 = self.criterion_registry[self.criterion](self.y, m2_preds)
            # Agreeability score
            agreeability_coeff = self.agreeability_registry[self.agreeability](m1_preds, m2_preds)
        
        # Append to results
        #### TO FIX
        #### Since the first run is on entire dataset, - mean agreeability == agreeability, stdev == 0
        #results.append({
            #f'Best: M1 Included Features': new_feature_list_m1.copy(),
            #f'Best: M1 {self.criterion.upper()}': best_score_m1,
            #f'Best: M2 Included Features': new_feature_list_m2.copy(),
            #f'Best: M2 {self.criterion.upper()}': best_score_m2,
            #f'Best: Agreeability ({self.agreeability})': agreeability_coeff,
            #f'All: Mean Agreeability ({self.agreeability})': np.mean(agreeability_coeff),
            #f'All: Agreeability St. Dev.': np.std(agreeability_coeff)
        #})          

        results.append({
            f'Best: M1 Included Features': full_fit_m1.copy(),
            f'Best: M1 {self.criterion}': best_score_m1,
            f'Best: M2 Included Features': full_fit_m2.copy(),
            f'Best: M2 {self.criterion}': best_score_m2,
            f'Best: Agreeability ({self.agreeability})': agreeability_coeff,
            f'All: M1 Mean {self.criterion}': best_score_m1,
            f'All: M1 STD {self.criterion}': 0,
            f'All: M2 Mean {self.criterion}': best_score_m2,
            f'All: M2 STD {self.criterion}': 0,
            f'All: Mean Agreeability ({self.agreeability})': agreeability_coeff,
            f'All: Agreeability St. Dev.': 0
            })      

        ### DEBUG PRINTS
        print(f'Initial run: fitted both models with full feature set.')
        print(f'-' * 150)
        print(f'Model 1 included: {new_feature_list_m1}. {self.criterion.upper()}: {best_score_m1:.4f}')
        print(f'Model 2 included: {new_feature_list_m2}. {self.criterion.upper()}: {best_score_m2:.4f}')
        print(f'-' * 150)
        print(f'Agreeability Coefficient ({self.agreeability}): {agreeability_coeff:.4f}')
        print(f'=' * 150)
        ### DEBUG PRINTS   
        
        ### DEBUG
        counter = 0
        ### DEBUG

        # Begin loop to deselect and evaluate
        while len(new_feature_list_m1) > 1 and len(new_feature_list_m2) > 1:

            ### DEBUG
            counter += 1    
            ### DEBUG    

            # Obtain worst_feature, score and preds from deselect_feature functions
            #worst_feature_m1, m1_score, m1_preds = self._deselect_feature(new_feature_list_m1, m1)
            #worst_feature_m2, m2_score, m2_preds = self._deselect_feature(new_feature_list_m2, m2)
            # Update included feature lists
            #new_feature_list_m1.remove(worst_feature_m1) 
            #new_feature_list_m2.remove(worst_feature_m2)

            # Obtain the score lists (removed feature, score, preds)
            score_per_dropped_feature_m1 = self._deselect_feature(new_feature_list_m1, m1)
            score_per_dropped_feature_m2 = self._deselect_feature(new_feature_list_m2, m2)

            # Sort the list
            # Note that after sorting row results will not match iteration for iteration in _deselect_feature runs for m1 and m2
            score_per_dropped_feature_m1 = self.sort_scores(score_per_dropped_feature_m1)
            score_per_dropped_feature_m2 = self.sort_scores(score_per_dropped_feature_m2)

            ####################################################################################################################
            ############################################### HANDLE SCORES ######################################################
            ####################################################################################################################
            
            # Obtain all scores for m1 and m2
            all_scores_m1 = [row[1] for row in score_per_dropped_feature_m1]
            all_scores_m2 = [row[1] for row in score_per_dropped_feature_m2]
            # Obtain all preds for m1 and m2
            all_preds_m1 = [row[2] for row in score_per_dropped_feature_m1]
            all_preds_m2 = [row[2] for row in score_per_dropped_feature_m2]
            # Append to respective containers ####### TO BE USED IN A NEW METHOD FOR TESTING #########
            self.scores_and_preds_m1.append((all_scores_m1, all_preds_m1))
            self.scores_and_preds_m2.append((all_scores_m2, all_preds_m2))
            # Get best scores 
            best_score_m1 = all_scores_m1[0]
            best_score_m2 = all_scores_m2[0]
            # Average of all scores
            mean_score_m1 = np.mean(all_scores_m1)
            mean_score_m2 = np.mean(all_scores_m2)
            # Get std-s of all scores (a bit manually not to recompute means implicitly by using np.std())
            std_score_m1 = np.sqrt(np.mean((all_scores_m1 - mean_score_m1) ** 2))
            std_score_m2 = np.sqrt(np.mean((all_scores_m2 - mean_score_m2) ** 2))

            ####################################################################################################################
            ############################################ HANDLE AGREEABILITY ###################################################
            ####################################################################################################################

            # Get all predictions from both models as a list of lists
            # This will iterate row for row in the third column of the containers, where prediction arrays are given. 
            all_preds_m1 = [row[2] for row in score_per_dropped_feature_m1]
            all_preds_m2 = [row[2] for row in score_per_dropped_feature_m2]

            # Get agreeability measures row for row
            # Result will be ordered s.t. entry on top is from the two models with best performance going all the way down to worst

            all_agreeabilities = [self.agreeability_registry[self.agreeability](all_preds_m1[i], all_preds_m2[i]) for i in range(len(all_preds_m1))]
            # Grab the agreeability coefficient between the predictions of best models
            agreeability_coeff = all_agreeabilities[0]
            # Takes average of all agreeability coeffs
            mean_agreeability = np.mean(all_agreeabilities)
            std_agreeability = np.std(all_agreeabilities)

            ####################################################################################################################
            ############################################## HANDLE FEATURES #####################################################
            #################################################################################################################### 

            #### FOR BETTER READABILITY DEFINE ALL VARIABLES INDIVIDUALLY
            worst_feature_m1 = score_per_dropped_feature_m1[0][0]
            worst_feature_m2 = score_per_dropped_feature_m2[0][0]
            # Update included feature lists
            new_feature_list_m1.remove(worst_feature_m1)
            new_feature_list_m2.remove(worst_feature_m2)
            # Flat lists to append to results
            flat_feature_list_m1 = feature_list_flatten(new_feature_list_m1)
            flat_feature_list_m2 = feature_list_flatten(new_feature_list_m2)
            #### ADD A TOPRINT METHOD SOMEWHERE TO MAKE SURE WE ARE NOT calling .upper() uselessly -- for the time being removed uppers.
            # Append to results
            results.append({
                f'Best: M1 Included Features': flat_feature_list_m1.copy(),
                f'Best: M1 {self.criterion}': best_score_m1,
                f'Best: M2 Included Features': flat_feature_list_m2.copy(),
                f'Best: M2 {self.criterion}': best_score_m2,
                f'Best: Agreeability ({self.agreeability})': agreeability_coeff,
                f'All: M1 Mean {self.criterion}': mean_score_m1,
                f'All: M1 STD {self.criterion}': std_score_m1,
                f'All: M2 Mean {self.criterion}': mean_score_m2,
                f'All: M2 STD {self.criterion}': std_score_m2,
                f'All: Mean Agreeability ({self.agreeability})': mean_agreeability,
                f'All: Agreeability St. Dev.': std_agreeability
            })  

        
            ### DEBUG PRINTS
            print(f'Iteration {counter}:')
            print(f'-' * 150)
            print(f'Results from best models:')
            print(f'Best Model 1 included: {new_feature_list_m1}. {self.criterion.upper()}: {best_score_m1:.4f}')
            print(f'Best Model 2 included: {new_feature_list_m2}. {self.criterion.upper()}: {best_score_m2:.4f}')
            print(f'Agreeability Coefficient ({self.agreeability}) between best models: {agreeability_coeff}')
            print(f'-' * 150)
            print(f'Results from all models:')
            print(f'M1 mean score: {mean_score_m1:.4f}. Standard deviation: {std_score_m1:.4f}')
            print(f'M1 mean score: {mean_score_m2:.4f}. Standard deviation: {std_score_m2:.4f}')
            print(f'Mean agreeability coefficient ({self.agreeability}): {mean_agreeability:.4f}. Standard deviation: {std_agreeability:.4f}')
            print(f'=' * 150)
            ### DEBUG PRINTS
        # Save results
        self.results = results
        # Return results
        return results

    def compare_n_best(self, 
                       n=None,
                       test=None):
        '''
        Method for pair-wise comparison of n amount of best predictions obtained by the models.
        The pairwise tests are conducted within the predictions of each models and will test if predictions obtained are statistically significantly different from each other.
        
        Args:
            n (int): How many best results to compare.
            test (str): Statistical test to use. Options: 'mcnemar_binomial' and 'mcnemar_chisquare' for binary classification. 't_test' for regression.
        
        Returns:
            None. pval_and_stats_m1 and pval_and_stats_m2 are callable lists containing corresponding test statistics and p-values.
        
        Example: Setting n=3 will test:
                - M1: best predictions against second best predictions; second best predictions and third best predictions.
                - M2: best predictions against second best predictions; second best predictions and third best predictions. 
        '''
        # Make sure the search is alrady ran and results are there.
        if not self.scores_and_preds_m1 and not self.scores_and_preds_m2:
            raise ValueError('No predictions found. Run a comparison algorithm first.')
        # Make sure n != value more than available best predictions
        if n > len(self.scores_and_preds_m1):
            raise ValueError(f'Picked n is more than available amount of best predictions. Use n <= {len(self.scores_and_preds_m1)}.')
        # Make sure test supported
        if test not in self.testing_registry:
            raise ValueError("Test not supported. Please use 'mcnemar_binomial' or 'mcnemar_chisquare' for classification or 't_test' for regression.")
        # Empty containers
        self.pval_and_stats_m1 = []
        self.pval_and_stats_m2 = []
        # Iterate n-1 times
        for i in range(n-1):
            # Get result for model 1
            pval_m1, stat_m1 = self.testing_registry[test](
                self.scores_and_preds_m1[i][1][0],
                self.scores_and_preds_m1[i+1][1][0],
                self.y_val
            )
            self.pval_and_stats_m1.append((pval_m1, stat_m1))
            # Get result for model 2
            pval_m2, stat_m2 = self.testing_registry[test](
                self.scores_and_preds_m2[i][1][0],
                self.scores_and_preds_m2[i+1][1][0],
                self.y_val
            )
            self.pval_and_stats_m2.append((pval_m2, stat_m2))
            print(f'Model 1: Results for No. {i+1} and No. {i+2} best predictions: P-value: {pval_m1:.8f}. Test statistic: {stat_m1:.8f}.')
            print(f'Model 2: Results for No. {i+1} and No. {i+2} best predictions: P-value: {pval_m2:.8f}. Test statistic: {stat_m2:.8f}.')
            print('='*120)

        #### REMOVE DESELECT_INPROG REMOVE DESELECT_INPROG REMOVE DESELECT_INPROG

    # Method to turn results into a df
    def dataframe_from_results(self):
        '''
        Return results as a dataframe.
        '''
        # Check if results exist
        if not self.results:
            raise ValueError("There are no results available. Make sure to run compare_models first.")
        # Return results
        return pd.DataFrame(self.results)
    
    # Method to turn results into an interactive plot
    def plot_from_results(self):
        '''
        Makes an interactive plot from the results.
        '''
        if not self.results:
            raise ValueError("There are no results available. Make sure to run compare_models first.")
        df = pd.DataFrame(self.results)

        df['Summary_Agreeability'] = df.apply(lambda row: f"<br> {df.columns[4]}: <br> {row.iloc[4]:.4f} <br> {df.columns[9]}: <br> {row.iloc[9]:.4f} <br> {df.columns[10]}: <br> {row.iloc[10]:.4f}", axis=1)
        df['Summary_M1'] = df.apply(lambda row: f"<br> {df.columns[1]}: <br> {row.iloc[1]:.4f} <br> {df.columns[0]}: <br> {', '.join(row.iloc[0])} <br> {df.columns[5]}: <br> {row.iloc[5]:.4f} <br> {df.columns[6]}: <br> {row.iloc[6]:.4f}", axis=1)
        df['Summary_M2'] = df.apply(lambda row: f"<br> {df.columns[3]}: <br> {row.iloc[3]:.4f} <br> {df.columns[2]}: <br> {', '.join(row.iloc[2])} <br> {df.columns[7]}: <br> {row.iloc[7]:.4f} <br> {df.columns[8]}: <br> {row.iloc[8]:.4f}", axis=1)


        fig = make_subplots(
            specs=[[{'secondary_y': True}]]
        )

        # Plot agreeability
        fig.add_trace(
            go.Scatter(
            x=df.index + 1,
            y=df.iloc[:, 4],
            name=f'{df.columns[4]}',
            mode='lines+markers',
            hovertext=df['Summary_Agreeability'],
            hoverinfo='text' 
            ),
            secondary_y=False
        )

        # Plot model 1 score
        fig.add_trace(
            go.Scatter(
                x=df.index + 1,
                y=df.iloc[:, 1],
                name=f'{df.columns[1]}',
                mode='lines+markers',
                hovertext=df['Summary_M1'],
                hoverinfo='text'
            ),
            secondary_y=True
        )

        # Plot model 2 score
        fig.add_trace(
            go.Scatter(
                x=df.index+1,
                y=df.iloc[:, 3],
                name=f'{df.columns[3]}',
                mode='lines+markers',
                hovertext=df['Summary_M2'],
                hoverinfo='text'
            ),
            secondary_y=True
        )

        fig.update_layout(
            title='Agreeability Coefficients and Model Scores Over Algorithm Iterations',
            xaxis_title='Iteration',
            yaxis_title='Agreeability',
            yaxis2_title='Model Scores',
            hovermode='closest'
        )

        fig.update_xaxes(type='category')
        fig.show()

Before running the algorithm I just look at all the columns I have:

In [12]:
df.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'MedHouseVal', 'HouseEval_nice',
       'HouseEval_not_nice', 'WallColors_gray', 'WallColors_white'],
      dtype='object')

Below I demonstrate two of the new functionalities I added:
- **Grouping dummy-encoded variables:** It would not make sense to treat dummy variables generated from categorical variables as individual features. Therefore in the framework I ensured you can provide the variable names and the algorithm treats them as one feature.<br>
- **Fixing features:** You can choose variables that you want to exclude from feature selection (in other words, include them in every model built and fitted).<br>

The dummy groups must be provided as a list of lists. Below I show a list that includes two sub-lists. In each sublist, I include the exact names of the variables (as strings). The algorithm will then treat each sublist as *one* feature when selecting features to drop. 

In [13]:
dummy_list = [
    ['HouseEval_nice', 'HouseEval_not_nice'],
    ['WallColors_gray', 'WallColors_white']
]
dummy_list

[['HouseEval_nice', 'HouseEval_not_nice'],
 ['WallColors_gray', 'WallColors_white']]

The features to fix must be provided as a list that contains the names of features to always keep in the models as string. Additionaly, a list with dummy column names can be given to fix them in place as well. 

*Note* that eventually when I run the algorithm I do not provide any features to fix, even though I demonstrate it here.

In [14]:
ftofix = [
    'Latitude',
    'Longitude',
    ['WallColors_gray', 'WallColors_white']
]
ftofix

['Latitude', 'Longitude', ['WallColors_gray', 'WallColors_white']]

### Running a regression task

In [15]:
# Import the train-test splitting function from sklearn
from sklearn.model_selection import train_test_split

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MedInc              20640 non-null  float64
 1   HouseAge            20640 non-null  float64
 2   AveRooms            20640 non-null  float64
 3   AveBedrms           20640 non-null  float64
 4   Population          20640 non-null  float64
 5   AveOccup            20640 non-null  float64
 6   Latitude            20640 non-null  float64
 7   Longitude           20640 non-null  float64
 8   MedHouseVal         20640 non-null  float64
 9   HouseEval_nice      20640 non-null  uint8  
 10  HouseEval_not_nice  20640 non-null  uint8  
 11  WallColors_gray     20640 non-null  uint8  
 12  WallColors_white    20640 non-null  uint8  
dtypes: float64(9), uint8(4)
memory usage: 1.5 MB


In [17]:
# Save the target variable as y
y = df['MedHouseVal']
# Save the rest of the predictors as X
X = df.drop('MedHouseVal', axis=1)


In [18]:
# Obtain training and validation sets of 0.8-0.2 proportion. random_state=66 ensures i get same split whenever I re-run the train test split function 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=66)

In [19]:
# Import linear regression from sklearn
from sklearn.linear_model import LinearRegression
# Import decision tree regressor from sklearn
from sklearn.tree import DecisionTreeRegressor

In [20]:
# set up model1
m1 = LinearRegression()
# set up model2
m2 = DecisionTreeRegressor()

In [21]:
# We can look at my documentation for the BackEliminator
BackEliminator?

[0;31mInit signature:[0m
[0mBackEliminator[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mX[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalidation_data[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtask_type[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcriterion[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0magreeability[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdummy_list[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeatures_to_fix[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
The class is built for conducting backwards feature elimination in combination with model agreeability. A more detailed documentation will follow.

Args:
    X (pd.DataFrame): A pandas d

In [22]:
# Initialize the python class.
seeker = BackEliminator(
    X=X_train,
    y=y_train,
    validation_data=(X_val, y_val),
    task_type='regression',
    criterion='rmse',
    agreeability='pearson',
    dummy_list=dummy_list,
)

In [23]:
# Run the algorithm by providing model 1 (in this case linreg) and model 2 (decision tree)
results = seeker.compare_all_models(
    m1=m1,
    m2=m2
)

Initial run: fitted both models with full feature set.
------------------------------------------------------------------------------------------------------------------------------------------------------
Model 1 included: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', ['HouseEval_nice', 'HouseEval_not_nice'], ['WallColors_gray', 'WallColors_white']]. RMSE: 0.7314
Model 2 included: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', ['HouseEval_nice', 'HouseEval_not_nice'], ['WallColors_gray', 'WallColors_white']]. RMSE: 0.7327
------------------------------------------------------------------------------------------------------------------------------------------------------
Agreeability Coefficient (pearson): 0.7568
Iteration 1:
------------------------------------------------------------------------------------------------------------------------------------------------------
Results 

We can print the raw dictionary carrying the results:

In [24]:
results

[{'Best: M1 Included Features': ['MedInc',
   'HouseAge',
   'AveRooms',
   'AveBedrms',
   'Population',
   'AveOccup',
   'Latitude',
   'Longitude',
   'HouseEval_nice',
   'HouseEval_not_nice',
   'WallColors_gray',
   'WallColors_white'],
  'Best: M1 rmse': 0.7313631839067938,
  'Best: M2 Included Features': ['MedInc',
   'HouseAge',
   'AveRooms',
   'AveBedrms',
   'Population',
   'AveOccup',
   'Latitude',
   'Longitude',
   'HouseEval_nice',
   'HouseEval_not_nice',
   'WallColors_gray',
   'WallColors_white'],
  'Best: M2 rmse': 0.7327078108794725,
  'Best: Agreeability (pearson)': 0.7567790142007598,
  'All: M1 Mean rmse': 0.7313631839067938,
  'All: M1 STD rmse': 0,
  'All: M2 Mean rmse': 0.7327078108794725,
  'All: M2 STD rmse': 0,
  'All: Mean Agreeability (pearson)': 0.7567790142007598,
  'All: Agreeability St. Dev.': 0},
 {'Best: M1 Included Features': ['MedInc',
   'HouseAge',
   'AveRooms',
   'AveBedrms',
   'Population',
   'Latitude',
   'Longitude',
   'HouseEval

Or we can generate a good looking pandas dataframe from the results from one of my built-in functionalities in the framework.

In [25]:
# For pandas to allow viewing full column width
pd.set_option('display.max_colwidth', None)

In [26]:
# Generate the dataframe
results_df = seeker.dataframe_from_results()

In [27]:
results_df

Unnamed: 0,Best: M1 Included Features,Best: M1 rmse,Best: M2 Included Features,Best: M2 rmse,Best: Agreeability (pearson),All: M1 Mean rmse,All: M1 STD rmse,All: M2 Mean rmse,All: M2 STD rmse,All: Mean Agreeability (pearson),All: Agreeability St. Dev.
0,"[MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.731363,"[MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.732708,0.756779,0.731363,0.0,0.732708,0.0,0.756779,0.0
1,"[MedInc, HouseAge, AveRooms, AveBedrms, Population, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.724195,"[MedInc, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.69461,0.768204,0.764343,0.055009,0.741457,0.047692,0.721909,0.069739
2,"[MedInc, HouseAge, AveRooms, AveBedrms, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.724195,"[MedInc, AveRooms, AveBedrms, Population, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.673339,0.778682,0.759649,0.054841,0.739133,0.070063,0.727457,0.072945
3,"[MedInc, HouseAge, AveRooms, AveBedrms, Latitude, Longitude, WallColors_gray, WallColors_white]",0.72423,"[MedInc, AveRooms, AveBedrms, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.656888,0.782832,0.764257,0.056605,0.745178,0.120096,0.724589,0.083219
4,"[MedInc, HouseAge, AveRooms, AveBedrms, Latitude, Longitude]",0.724321,"[AveRooms, AveBedrms, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.649506,0.741606,0.770011,0.058328,0.740899,0.134446,0.722776,0.085416
5,"[MedInc, HouseAge, AveBedrms, Latitude, Longitude]",0.729908,"[AveRooms, AveBedrms, Latitude, Longitude, WallColors_gray, WallColors_white]",0.642569,0.742414,0.777709,0.05971,0.839496,0.242373,0.631184,0.114599
6,"[MedInc, HouseAge, Latitude, Longitude]",0.734025,"[AveRooms, AveBedrms, Latitude, Longitude]",0.632129,0.743102,0.822217,0.105224,0.868696,0.247616,0.572183,0.16918
7,"[MedInc, Latitude, Longitude]",0.741958,"[AveRooms, Latitude, Longitude]",0.649022,0.719247,0.845597,0.106522,0.922881,0.241907,0.525762,0.16354
8,"[MedInc, Latitude]",0.830187,"[Latitude, Longitude]",0.633092,0.597978,0.897744,0.089312,1.019702,0.277711,0.441698,0.149115
9,[MedInc],0.839073,[Longitude],1.00587,0.326271,1.0013,0.162228,1.025579,0.01971,0.307457,0.018813


In [28]:
# We can now look at the interactive plot
seeker.plot_from_results()

**New functionality:**

I have added a new method to the framework that allows us to compare how statistically different the $n$ amount of best results from each models are from each-other - if we let $n=2$, then we are testing best results from model 1 with second best results from model 1, and testing best results from model 2 with second best results from model 2.

Some experiments on the small dataset we have for demonstrational purposes shows that test statistics (t-test in this regression task, and mcnemar's test for classification task further down in the notebook) already shows that test statistics when comparing predictions of linear / logistic regressors are stable, while test statistics from comparing decision tree models tend do highly with different features.

In [29]:
# You can further inspect my documentation for this functionality
seeker.compare_n_best?

[0;31mSignature:[0m [0mseeker[0m[0;34m.[0m[0mcompare_n_best[0m[0;34m([0m[0mn[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mtest[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Method for pair-wise comparison of n amount of best predictions obtained by the models.
The pairwise tests are conducted within the predictions of each models and will test if predictions obtained are statistically significantly different from each other.

Args:
    n (int): How many best results to compare.
    test (str): Statistical test to use. Options: 'mcnemar_binomial' and 'mcnemar_chisquare' for binary classification. 't_test' for regression.

Returns:
    None. pval_and_stats_m1 and pval_and_stats_m2 are callable lists containing corresponding test statistics and p-values.

Example: Setting n=3 will test:
        - M1: best predictions against second best predictions; second best predictions and third best predictions.
        - M2: best predictions 

In [30]:
seeker.compare_n_best(n=9, test='t_test')

Model 1: Results for No. 1 and No. 2 best predictions: P-value: 0.70239521. Test statistic: -0.38210314.
Model 2: Results for No. 1 and No. 2 best predictions: P-value: 0.77188139. Test statistic: -0.28992436.
Model 1: Results for No. 2 and No. 3 best predictions: P-value: 0.70189896. Test statistic: -0.38277231.
Model 2: Results for No. 2 and No. 3 best predictions: P-value: 0.94458274. Test statistic: 0.06951329.
Model 1: Results for No. 3 and No. 4 best predictions: P-value: 0.70389361. Test statistic: -0.38008364.
Model 2: Results for No. 3 and No. 4 best predictions: P-value: 0.48370647. Test statistic: -0.70038528.
Model 1: Results for No. 4 and No. 5 best predictions: P-value: 0.69560290. Test statistic: -0.39127734.
Model 2: Results for No. 4 and No. 5 best predictions: P-value: 0.67762958. Test statistic: -0.41571472.
Model 1: Results for No. 5 and No. 6 best predictions: P-value: 0.73395124. Test statistic: -0.33988637.
Model 2: Results for No. 5 and No. 6 best predictions: P

In [31]:
# The pvalues and test statistics can be saved as variables (list of tuples (p_value, test_statistic))
pval_and_stats_m1 = seeker.pval_and_stats_m1
pval_and_stats_m2 = seeker.pval_and_stats_m2

(A note for self, disregard). How the indexing works in scores_and_preds. The resulting list has three possible indexing.

$\text{score} = \text{list}[i][j][k]$, where $i$ ranges from $0$ to the maximum number of iteration of the search algorithm (however-many features we dropped before ending up with one feature), $j$ ranges from $0$ to $1$ with $0$ grabbing the score and $1$ grabbing the preds, and $k$ ranges from 0 to however-many features could be dropped during the iteration of search algorithm (for example at the very first iteration all $n$ amount of features can be dropped, thus $k$ will have that many things.)

### Check functionality on a classification task

Below I do not include detailed comments at the moment as we just re-run everything but with a different (binary) target

In [32]:
df_discrete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MedInc              20640 non-null  float64
 1   HouseAge            20640 non-null  float64
 2   AveRooms            20640 non-null  float64
 3   AveBedrms           20640 non-null  float64
 4   Population          20640 non-null  float64
 5   AveOccup            20640 non-null  float64
 6   Latitude            20640 non-null  float64
 7   Longitude           20640 non-null  float64
 8   HouseEval_nice      20640 non-null  uint8  
 9   HouseEval_not_nice  20640 non-null  uint8  
 10  WallColors_gray     20640 non-null  uint8  
 11  WallColors_white    20640 non-null  uint8  
 12  AboveMean           20640 non-null  int64  
dtypes: float64(8), int64(1), uint8(4)
memory usage: 1.5 MB


In [33]:
# Use binary target
y = df_discrete['AboveMean']
X = df_discrete.drop('AboveMean', axis=1)


In [34]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=66)

In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [36]:
m1 = LogisticRegression(solver='liblinear')
m2 = DecisionTreeClassifier()

In [37]:
# Save this class under a different name
seeker_2 = BackEliminator(
    X=X_train,
    y=y_train,
    validation_data=(X_val, y_val),
    task_type='classification',
    criterion='f1',
    agreeability='cohen_kappa',
    dummy_list=dummy_list
)

In [38]:
results_2 = seeker_2.compare_all_models(
    m1=m1,
    m2=m2
)

Initial run: fitted both models with full feature set.
------------------------------------------------------------------------------------------------------------------------------------------------------
Model 1 included: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', ['HouseEval_nice', 'HouseEval_not_nice'], ['WallColors_gray', 'WallColors_white']]. F1: 0.7773
Model 2 included: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', ['HouseEval_nice', 'HouseEval_not_nice'], ['WallColors_gray', 'WallColors_white']]. F1: 0.7961
------------------------------------------------------------------------------------------------------------------------------------------------------
Agreeability Coefficient (cohen_kappa): 0.6436
Iteration 1:
------------------------------------------------------------------------------------------------------------------------------------------------------
Results 

In [41]:
results_2

[{'Best: M1 Included Features': ['MedInc',
   'HouseAge',
   'AveRooms',
   'AveBedrms',
   'Population',
   'AveOccup',
   'Latitude',
   'Longitude',
   'HouseEval_nice',
   'HouseEval_not_nice',
   'WallColors_gray',
   'WallColors_white'],
  'Best: M1 f1': 0.7772713460954117,
  'Best: M2 Included Features': ['MedInc',
   'HouseAge',
   'AveRooms',
   'AveBedrms',
   'Population',
   'AveOccup',
   'Latitude',
   'Longitude',
   'HouseEval_nice',
   'HouseEval_not_nice',
   'WallColors_gray',
   'WallColors_white'],
  'Best: M2 f1': 0.7960877296976882,
  'Best: Agreeability (cohen_kappa)': 0.643642219845067,
  'All: M1 Mean f1': 0.7772713460954117,
  'All: M1 STD f1': 0,
  'All: M2 Mean f1': 0.7960877296976882,
  'All: M2 STD f1': 0,
  'All: Mean Agreeability (cohen_kappa)': 0.643642219845067,
  'All: Agreeability St. Dev.': 0},
 {'Best: M1 Included Features': ['MedInc',
   'HouseAge',
   'AveRooms',
   'AveBedrms',
   'AveOccup',
   'Latitude',
   'Longitude',
   'HouseEval_nice',


In [42]:
results_df_2 = seeker_2.dataframe_from_results()

In [43]:
results_df_2

Unnamed: 0,Best: M1 Included Features,Best: M1 f1,Best: M2 Included Features,Best: M2 f1,Best: Agreeability (cohen_kappa),All: M1 Mean f1,All: M1 STD f1,All: M2 Mean f1,All: M2 STD f1,All: Mean Agreeability (cohen_kappa),All: Agreeability St. Dev.
0,"[MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.777271,"[MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.796088,0.643642,0.777271,0.0,0.796088,0.0,0.643642,0.0
1,"[MedInc, HouseAge, AveRooms, AveBedrms, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.781449,"[HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.81211,0.566387,0.760891,0.030383,0.786765,0.019806,0.592901,0.057391
2,"[MedInc, HouseAge, AveRooms, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.782005,"[HouseAge, AveRooms, Population, AveOccup, Latitude, Longitude, HouseEval_nice, HouseEval_not_nice, WallColors_gray, WallColors_white]",0.821877,0.586078,0.76493,0.025402,0.788823,0.042269,0.535367,0.0557
3,"[MedInc, HouseAge, AveRooms, AveOccup, Latitude, Longitude, WallColors_gray, WallColors_white]",0.782399,"[HouseAge, AveRooms, Population, AveOccup, Latitude, Longitude, WallColors_gray, WallColors_white]",0.829749,0.591471,0.753303,0.042888,0.787817,0.04478,0.51661,0.075515
4,"[MedInc, HouseAge, AveRooms, AveOccup, Latitude, Longitude]",0.780797,"[HouseAge, AveRooms, Population, Latitude, Longitude, WallColors_gray, WallColors_white]",0.826139,0.566378,0.749392,0.044615,0.783741,0.049784,0.503428,0.076939
5,"[MedInc, HouseAge, AveOccup, Latitude, Longitude]",0.776399,"[HouseAge, AveRooms, Population, Latitude, Longitude]",0.828125,0.582425,0.743861,0.045557,0.761082,0.082902,0.469425,0.125381
6,"[MedInc, AveOccup, Latitude, Longitude]",0.767023,"[HouseAge, AveRooms, Latitude, Longitude]",0.827896,0.557864,0.71762,0.073182,0.753872,0.083753,0.418438,0.165025
7,"[MedInc, AveOccup, Latitude]",0.745134,"[HouseAge, Latitude, Longitude]",0.830325,0.49331,0.68527,0.076862,0.74209,0.08606,0.371728,0.163367
8,"[MedInc, AveOccup]",0.738155,"[Latitude, Longitude]",0.840012,0.510645,0.615847,0.137868,0.666151,0.123696,0.257927,0.185833
9,[MedInc],0.691214,[Longitude],0.615832,0.220298,0.527757,0.163457,0.608412,0.00742,0.141172,0.079127


In [44]:
seeker_2.plot_from_results()

In [45]:
seeker_2.compare_n_best(n=9, test='mcnemar_chisquare')

Model 1: Results for No. 1 and No. 2 best predictions: P-value: 0.00000000. Test statistic: 461.12049861.
Model 2: Results for No. 1 and No. 2 best predictions: P-value: 0.00000000. Test statistic: 87.98825503.
Model 1: Results for No. 2 and No. 3 best predictions: P-value: 0.00000000. Test statistic: 708.06786704.
Model 2: Results for No. 2 and No. 3 best predictions: P-value: 0.00000000. Test statistic: 296.35263158.
Model 1: Results for No. 3 and No. 4 best predictions: P-value: 0.00000000. Test statistic: 665.32369146.
Model 2: Results for No. 3 and No. 4 best predictions: P-value: 0.77131199. Test statistic: 0.08448276.
Model 1: Results for No. 4 and No. 5 best predictions: P-value: 0.00000000. Test statistic: 578.77672530.
Model 2: Results for No. 4 and No. 5 best predictions: P-value: 0.00000000. Test statistic: 309.86188811.
Model 1: Results for No. 5 and No. 6 best predictions: P-value: 0.00000000. Test statistic: 233.39973788.
Model 2: Results for No. 5 and No. 6 best predict

### Working Code for 3D Plot

```python

# Create a 3D scatter plot
fig = go.Figure(data=[
    go.Scatter3d(
        x=df.index + 1,
        y=df.iloc[:, 4],
        z=df.iloc[:, 1], 
        mode='lines+markers',
        name=f'{df.columns[1]}',
        text=df['Summary_M1'],
        hoverinfo='text'
    ),
    go.Scatter3d(
        x=df.index + 1,
        y=df.iloc[:, 4],
        z=df.iloc[:, 3], 
        mode='lines+markers',
        name=f'{df.columns[3]}',
        text=df['Summary_M2'],
        hoverinfo='text'
    )
])

# Update layout
fig.update_layout(
    title='Agreeability Coefficients and Model Scores Over Algorithm Iterations',
    scene=dict(
        xaxis_title='Iteration',
        yaxis_title='Agreeability',
        zaxis_title='Model Scores'
    ),
    hovermode='closest'
)

# Show the plot
fig.show()

```

### Dependencies

In [30]:
import numpy as np
import pandas as pd
import statsmodels
import sklearn
import scipy
import plotly
import matplotlib
import seaborn as sns
import tensorflow as tf

In [32]:
print(f'pandas: {pd.__version__}')
print(f'numpy: {np.__version__}')
print(f'statsmodels: {statsmodels.__version__}')
print(f'sklearn: {sklearn.__version__}')
print(f'scipy: {scipy.__version__}')
print(f'plotly: {plotly.__version__}')
print(f'matplotlib: {matplotlib.__version__}')
print(f'seaborn: {sns.__version__}')
print(f'tensorflow: {tf.__version__}')


pandas: 1.5.3
numpy: 1.20.3
statsmodels: 0.13.5
sklearn: 1.2.2
scipy: 1.10.0
plotly: 5.18.0
matplotlib: 3.3.4
seaborn: 0.11.1
tensorflow: 2.10.1
