This notebook handles training and performance evaluation of the experimental datasets. The models that are trained here are all classifiers, specifically including Support Vector Machine, Decision Tree, and Random Forest algorithms. The validation methods include 10-fold cross validation using accuracy and area under the ROC curve. Since deployement of unsupervised discretization was utilized to reduce the potential model complexity of large, continuous, and often highly correlated variables, the goal is to determine the efficacy of different unsupervised approaches to discretization in the building of classifier models. To that end, the experimental groups were devised by optimizing bin numbers per variable and per discretization technique experimentally using either inter-variable variance or intervariable-entropy (cite). To further optimize the dicretization process, ensembles of discretization techniques were deployed where for each continuous predictive variable, a distinct sequence of tranformations were created using constant frequency, constant width, and kmeans discretization techniques. For instance the first continuous variable might have been binned (i.e., discretized) using kmeans, with its optimized number of bins for that given continuous variable. The next continuous variable might have been binned using width and so on. Each potential sequence's correlation was compared to that of the continuous data and the sequence with the greatest reduction of correlation was selected. This process was repeated using mutual information as well. Ultimately 6 experimental datasets were created and were grouped based on whether variance or entropy were used to optimize bin number. In group 1, variance was used for finding bin numbers so that those potential transformations of continuous variables were assessed first with correlation analysis (1a) and then mutual information (1b). (1c) consisted of randomly selected bin numbers developed with randomly selected techniques. This is meant to determine if any impprovements might be explained by random chance. The second group followed an identical procedure, but with entropy as the measure used to create bin sizes for potential transformations. (2a) and (2b) transformed variable sequences were also determined using correlation and mutual information, respectively, while the randomly derived dataset was the same one used in group one in order to allow performances between the two groups to be compared relative to it. For point of comparison, the models will also be trained on the the original continuous data offering a baseline for performance. 

importing dependencies

In [2]:
from sklearn import tree
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from glob import glob
from collections import OrderedDict
import os
from prettytable import PrettyTable
import pickle

importing data

In [23]:
control_df = pd.read_csv("../Dataset/preprocessed_data.csv")
y = control_df['class']
control_df.drop('class', axis=1, inplace= True)
control_df.head()

Unnamed: 0,alpha,delta,u,g,r,i,z,redshift
0,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,0.634794
1,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,0.779136
2,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,0.644195
3,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,0.932346
4,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,0.116123


In [29]:
file_list = glob('../Dataset/*.csv')

print(file_list)

file_list = [f for f in file_list if "TestGroup_" in f]

print(file_list)
# general comprehension: [expression for item in iterable if condition] 

['../Dataset/TestGroup_2c.csv', '../Dataset/TestGroup_2b.csv', '../Dataset/TestGroup1.csv', '../Dataset/Control_Group.csv', '../Dataset/TestGroup2.csv', '../Dataset/TestGroup_1c.csv', '../Dataset/TestGroup_2a.csv', '../Dataset/TestGroup_1a.csv', '../Dataset/preprocessed_data.csv', '../Dataset/TestGroup_1b.csv', '../Dataset/star_classification.csv']
['../Dataset/TestGroup_2c.csv', '../Dataset/TestGroup_2b.csv', '../Dataset/TestGroup_1c.csv', '../Dataset/TestGroup_2a.csv', '../Dataset/TestGroup_1a.csv', '../Dataset/TestGroup_1b.csv']


In [None]:
df_dict = {}
for file in file_list:
    file_name = os.path.splitext(os.path.basename(file))[0]
    df = pd.read_csv(file)
    predict = df.iloc[:,0:len(df)+1]

    df_dict[f"{file_name}_df"] = {'df' : df,
                                  'x' : predict,
                                  'y' : y}
    
# orders df_dict objects by key name
df_dict = OrderedDict(sorted(df_dict.items()))


drops in control group

In [31]:
df_dict["control_df"] = {'df' : control_df,
                                'x' : control_df,
                                'y' : y}

In [34]:
print(df_dict.keys())

# saves key names for iterating through
key_names = list(df_dict.keys())

odict_keys(['TestGroup_1a_df', 'TestGroup_1b_df', 'TestGroup_1c_df', 'TestGroup_2a_df', 'TestGroup_2b_df', 'TestGroup_2c_df', 'control_df'])


Creates training/testying groups for each model so that they are constant for all tests. The standard 80%/20% training to testing allocationis used. Because some of the binning techniques have variable bin sizes, stratified "sampling" is used meaning that 20% of each bin was used for testing and 80% of each bin for training to ensure that the overall ditributions of the training/testing groups are consistent with the full set of source data for the model.


In [35]:
for group_name, values in df_dict.items():
    X = values['x']
    y = values['y']

    X_train, X_test, y_train, y_test = train_test_split(
        X,y,test_size=0.2, random_state = 33, stratify= y
    )

    values.update(
        {'X_train': X_train,
         'X_test': X_test,
         'y_train': y_train,
         "y_test": y_test}
    )

___
### decision tree model training 
___

tests purity measures to determine if there accuracy varies

In [37]:
test_object1 = df_dict[key_names[1]]

criterion = ['gini', 'entropy','log_loss']
max_depth = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30]
score = {
    'gini' : [],
    'entropy' : [],
    'log_loss' : []
}
for criteria in criterion:  
    for depth in max_depth:
        X = test_object1['X_train']
        Y =  test_object1['y_train']
        clf = tree.DecisionTreeClassifier(criterion=criteria, max_depth=depth)
        score[criteria].append(np.mean(cross_val_score(clf, X, Y, cv=10)))
    
print(f"gini: {score['gini']}")
print(f"entropy: {score['entropy']}")
print(f"log_loss: {score['log_loss']}")


gini: [np.float64(0.9362691723908754), np.float64(0.953508593048975), np.float64(0.9600516227958087), np.float64(0.9646138790253508), np.float64(0.9648439289368037), np.float64(0.962492523896343), np.float64(0.9598344046918236), np.float64(0.9568056964211211), np.float64(0.9533169209867329), np.float64(0.9522178810861478), np.float64(0.9510677474691146), np.float64(0.9507227109765187), np.float64(0.9502881980193816), np.float64(0.950761043102822), np.float64(0.9504671133903617)]
entropy: [np.float64(0.9362691723908754), np.float64(0.9537769473263937), np.float64(0.9627864160507), np.float64(0.9652911740907879), np.float64(0.9659556994012748), np.float64(0.9638087704701865), np.float64(0.9610995526503364), np.float64(0.9579174652526312), np.float64(0.9556043890725515), np.float64(0.9539302872623328), np.float64(0.952984647717243), np.float64(0.9522945502376367), np.float64(0.9518728494924348), np.float64(0.9519750663186283), np.float64(0.9523584382034491)]
log_loss: [np.float64(0.936269

because the scores for all purity measures are comparable for each criterion gini will be used

In [38]:
for group_name, values in df_dict.items():

    dt_model = tree.DecisionTreeClassifier(criterion = 'gini',random_state = 33)
    dt_model.fit(values['X_train'], values['y_train'])

    values.update(
        {'dt_model': dt_model}
    )

accuracy, AUC, and complexity is evaluated and appended for each test data set. This includes the distribution of AUC scores and their means

In [39]:
for group_name, values in df_dict.items():
    # Get the trained decision tree model and test data
    dt_model = values['dt_model']
    X_test = values['X_test']
    y_test = values['y_test']

    # Predict the labels and probabilities for the test set
    y_pred = dt_model.predict(X_test)
    y_proba = dt_model.predict_proba(X_test)

    # Evaluate accuracy and AUC for multi-class
    acc = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')

    # Model Complexity Measures for Decision Trees
    tree_depth = dt_model.get_depth()  # Depth of the decision tree
    num_nodes = dt_model.tree_.node_count  # Number of nodes in the tree
    num_features_used = len(dt_model.feature_importances_[dt_model.feature_importances_ > 0])  # Number of features used
  

    # Create a DataFrame for complexity measures
    complexity_df = pd.DataFrame({
        'tree_depth': [tree_depth],
        'num_nodes': [num_nodes],
        'num_features_used': [num_features_used]
    })

    # Add performance and complexity data to the values dictionary
    values.update({
        'dt_y_pred': y_pred,
        'dt_y_proba': y_proba,
        'dt_accuracy': acc,
        'dt_auc': auc,
        'dt_complexity_metrics': complexity_df  
    })

In [40]:
table = PrettyTable()
table.field_names = ["Group", "Tree Depth", "Node Count", "Features Used"]

for group_name, values in df_dict.items():
    dt_model = values['dt_model']
    table.add_row([
        group_name,
        dt_model.get_depth(),
        dt_model.tree_.node_count,
        len(dt_model.feature_importances_[dt_model.feature_importances_ > 0])
    ])

print(table)

+-----------------+------------+------------+---------------+
|      Group      | Tree Depth | Node Count | Features Used |
+-----------------+------------+------------+---------------+
| TestGroup_1a_df |     33     |   17299    |       8       |
| TestGroup_1b_df |     30     |    8585    |       8       |
| TestGroup_1c_df |     30     |   14625    |       8       |
| TestGroup_2a_df |     25     |    9691    |       8       |
| TestGroup_2b_df |     30     |    8585    |       8       |
| TestGroup_2c_df |     30     |   14625    |       8       |
|    control_df   |     33     |    3807    |       8       |
+-----------------+------------+------------+---------------+


___
### Random Forest model training 
___

tests purity measures for random forrests

In [41]:
test_object1 = df_dict[key_names[1]]

criterion = ['gini', 'entropy','log_loss']
estimators = [20,50,100]
max_depth = [2,4,6,8,10,12,14,16,18,20]
score = {
    'gini' : [],
    'entropy' : [],
    'log_loss' : []
}
for criteria in criterion:
    for estimator in estimators:  
        for depth in max_depth:
            X = test_object1['X_train']
            Y =  test_object1['y_train']
            rf_model = RandomForestClassifier(n_estimators=estimator, criterion=criteria, max_depth=depth, n_jobs=24)
            score[criteria].append(np.mean(cross_val_score(rf_model, X, Y, cv=10)))
    
print(f"gini: {score['gini']}")
print(f"entropy: {score['entropy']}")
print(f"log_loss: {score['log_loss']}")

gini: [np.float64(0.8508005819872972), np.float64(0.9467227518005437), np.float64(0.9584669321316918), np.float64(0.9651250431714062), np.float64(0.967847045442855), np.float64(0.968588256234441), np.float64(0.9688182979810888), np.float64(0.968843858719481), np.float64(0.9681409669905101), np.float64(0.9679492818645802), np.float64(0.8398356424762546), np.float64(0.945636480021947), np.float64(0.9602177210559706), np.float64(0.9657128846337555), np.float64(0.9682304271254416), np.float64(0.9692655708954095), np.float64(0.9692655676294878), np.float64(0.9693805787050456), np.float64(0.9687671650735773), np.float64(0.9684860263445596), np.float64(0.8573690728619029), np.float64(0.9461348417538327), np.float64(0.9600260408289236), np.float64(0.9659045779244902), np.float64(0.968434891804087), np.float64(0.9692399987262906), np.float64(0.969738398016279), np.float64(0.9693550195996143), np.float64(0.9689077417864104), np.float64(0.9690355405794889)]
entropy: [np.float64(0.8577272497915933

again, the measures are comprable but worth checking since this is not guarenteed. Gini used again for simplicity in training.

In [42]:
for group_name, values in df_dict.items():
    rf_model = RandomForestClassifier(
        n_estimators=100,
        criterion='gini',
        max_depth=None,
        random_state=33,
        n_jobs=24  # use up to 24 cores; yay memory update!
    )
    rf_model.fit(values['X_train'], values['y_train'])

    values.update({'rf_model': rf_model})

In [43]:
df_dict[key_names[3]]['rf_model']

accuracy, AUC, and complexity is evaluated and appended for each test data set. This includes the distribution of AUC scores and their means

In [44]:
for group_name, values in df_dict.items():
    # Get the trained random forest model and test data
    rf_model = values['rf_model']
    X_test = values['X_test']
    y_test = values['y_test']

    # Predict the labels and probabilities for the test set
    y_pred = rf_model.predict(X_test)
    y_proba = rf_model.predict_proba(X_test)

    # Evaluate accuracy and AUC for multi-class
    acc = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')

    # Model Complexity Measures for Random Forests
    tree_depths = [estimator.tree_.max_depth for estimator in rf_model.estimators_]
    num_nodes_list = [estimator.tree_.node_count for estimator in rf_model.estimators_]

    avg_depth = np.mean(tree_depths)
    avg_nodes = np.mean(num_nodes_list)
    num_features_used = len([i for i in rf_model.feature_importances_ if i > 0])
    num_estimators = rf_model.n_estimators

    # Create a DataFrame for complexity measures
    complexity_df = pd.DataFrame({
        'avg_tree_depth': [avg_depth],
        'avg_num_nodes': [avg_nodes],
        'num_features_used': [num_features_used],
        'num_estimators': [num_estimators]
    })

    # Add performance and complexity data to the values dictionary
    values.update({
        'rf_y_pred': y_pred,
        'rf_y_proba': y_proba,
        'rf_accuracy': acc,
        'rf_auc': auc,
        'rf_complexity_metrics': complexity_df
    })

In [45]:
table = PrettyTable()
table.field_names = ["Group", "Ave Tree Depth", " Ave Node Count", "Features Used", "Estimators"]

for group_name, values in df_dict.items():
    rf_compl_df = values['rf_complexity_metrics'].iloc[0]
    table.add_row([
        group_name,
        rf_compl_df['avg_tree_depth'],
        rf_compl_df['avg_num_nodes'],
        rf_compl_df['num_features_used'],
        rf_compl_df['num_estimators'],
    ])

print(table)

+-----------------+----------------+-----------------+---------------+------------+
|      Group      | Ave Tree Depth |  Ave Node Count | Features Used | Estimators |
+-----------------+----------------+-----------------+---------------+------------+
| TestGroup_1a_df |     31.73      |     15270.56    |      8.0      |   100.0    |
| TestGroup_1b_df |     29.69      |     8674.02     |      8.0      |   100.0    |
| TestGroup_1c_df |     29.41      |     13026.78    |      8.0      |   100.0    |
| TestGroup_2a_df |     23.95      |      8291.8     |      8.0      |   100.0    |
| TestGroup_2b_df |     29.69      |     8674.02     |      8.0      |   100.0    |
| TestGroup_2c_df |     29.41      |     13026.78    |      8.0      |   100.0    |
|    control_df   |     30.27      |      4343.3     |      8.0      |   100.0    |
+-----------------+----------------+-----------------+---------------+------------+


___
# Support Vector Machine Learning
___

fitting models

In [None]:
for group_name, values in df_dict.items():

    svm_model = svm.SVC(probability=True, kernel='rbf', random_state=33)

    svm_model.fit(values['X_train'], values['y_train'])

    values.update({'svm_model': svm_model})

evaluating models

In [None]:
for group_name, values in df_dict.items():

    svm_model = values['svm_model']
    X_test = values['X_test']
    y_test = values['y_test']


    # Predict on the test data
    y_pred = svm_model.predict(X_test)
    y_proba = svm_model.predict_proba(X_test)

    # Evaluate on the test set: accuracy and AUC
    acc = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')
    # model complexity
    num_support_vecs = len(svm_model.support_vectors_)
    # Store the results
    values.update({
        'svm_y_pred': y_pred,
        'svm_y_proba': y_proba,
        'svm_accuracy': acc,
        'svm_auc': auc,
        'svm_complexity_metrics': num_support_vecs
    })

In [None]:
with open('../Dataset/Testgroup_results.pkl', 'wb') as f:
    pickle.dump(df_dict, f)