Christian Erikson

Math 76 Homework 4

# Decision trees, interpretability, and algorithmic bias

## Objective

In this week's project, you will explore the COMPAS data set. COMPAS stands for "Correctional Offender Management Profiling for Alternative Sanctions". It is a software/algorithm that is used to assess the risk of a registered offender is going to commit another offense. Although researchers and journalists have pointed to [various problems](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) of this algorithm over many years, the algorithm is still used to inform sentences and parole decisions in several US states. 
You can learn more about the COMPAS data set [here](https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis). 

Through this project, you will practice fitting and validating several classification models and you will explore some distinct benefits of using decision trees in machine learning. As part of that exploration, you are going to audit your model for demographic biases via a "closed box" and an "open box" approach.

The COMPAS data set is a favorite example among critics of machine learning because it demonstrates several shortcomings and failure modes of machine learning techniques. The lessons learned from this project might be discouraging, and they are important. Keep in mind, however, that what you see here does not generalize to all data sets. 

This project has four parts.

### Part 1: Prepare the COMPAS data set  (PARTIALLY YOU TO COMPLETE)

In this part, you will load the COMPAS data set, explore its content, and select several variables as features (i.e., queries) or class labels (i.e., responses). Some of these features are not numerical, so you will need to replace some categorical values with zeros and ones. Your features will include categorical variable with more than two categories. You will uses 1-hot encoding to include this feature in your data set. 

This part includes four steps:
1. Load and explore data set
2. Select features and response variables
3. Construct numerical coding for categorical features
4. Split the data

### Part 2: Train and validate a decision tree  (PARTIALLY YOU TO COMPLETE)

In this part, you will fit a decision tree to your data. You will examine the effect of tuning the complexity of the tree via the "maximum number of leaves" parameter and use 5-fold cross-validation to find an optimal value.

This part includes three steps:

1. Fit a decision tree on the training data
2. Tune the parameter "maximum number of leaves"
3. Calculate the selected model's test performance


### Part 3: Auditing a decision tree for demographic biases  (PARTIALLY YOU TO COMPLETE)

Your training data includes several demographic variables (i.e., age, sex, race). A crude way to assess whether a model has some demographic bias is to remove the corresponding variables from your training data and explore how that removal affects your model's performance. Decision trees have the advantage of being interpretable machine learning models. By going through the decision nodes (i.e., branching points), you can "open the black box and look inside". Specifically, you can assess how each feature is used in the decision making process.

This part includes three steps:

1. Fit a decision tree
2. Check for racial bias via performance assessment
3. Check for racial bias via decision rules

### Part 4: Comparison to other linear classifiers (FOR YOU TO COMPLETE)

For some types of data, decision trees tend to achieve lower prediction accuracies In this part, you will train and tune several classifiers on the COMPAS data. You will then compare their performance on your test set.

This part includes three steps:

1. Fit LDA and logistic regression
2. Tune and fit ensemble methods
3. Tune and fit SVC
4. Compare performance metrics for all models 

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

## Part 1: Prepare the COMPAS data set

>In this part, you will load the COMPAS data set, explore its content, and select several variables as features (i.e., queries) or class labels (i.e., responses). Some of these features are not numerical, so you will need to replace some categorical values with zeros and ones. Your features will include categorical variable with more than two categories. You will uses 1-hot encoding to include this feature in your data set.
>
>This part includes four steps:
>1. Load and explore data set
>2. Select features and response variables
>3. Construct numerical coding for categorical features
>4. Split the data



### Part 1, Step 1: Load and explore data set

This folder includes the 'compas-scores-two-years.csv' file. The COMPAS data that you will use for this project is in this file. It is always a good idea to look at the raw data before proceeding with one's machine learning pipeline.

In [2]:
# load data
raw_data = pd.read_csv('compas-scores-two-years.csv')
# print a list of variable names
print(raw_data.columns)
# look at the first 5 rows 
raw_data.head(5)

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')


Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0


The data set includes 53 variables. There are different types of information. Some variables
* personal data (e.g., name, first name ("first"), last name ("last")) 
* demographic data (i.e., sex, age, age category ("age_cat"), and race)
* related to the person's history of commited offenses (e.g., juvenile felony count ("juv_fel_count"), juvenile misdemeanor count ("juv_misd_count"), and prior offenses count ("priors-count"))
* related to the charge against the person (e.g., charge offense date ("c_offense_date"), charge arrest date ("c_arrest_date"), charge degree ("c_charge_degree"), and description of charge ("c_charge_desc"))
* recidivism scores assigned by the COMPAS algorithm (e.g., "decile_score", "score_text", "v_decile_score", "v_score_text")
* related to an actual recidivism charge (e.g., degree of recidivism charge ("r_charge_degree"), data of recidivism offense ("r_offense_date"), description of recidivism charge ("r_charge_desc"))
* related to an actual violent recidivism charge (e.g., degree of violent recidivism charge ("vr_charge_degree"), data of violent recidivism offense ("vr_offense_date"), description of violent recidivism charge ("vr_charge_desc")).

### Part 1, Step 2: Select features and response variables

The ProPublica article was assessing bias in the COMPAS scores. Here, you will ignore the COMPAS scores and instead explore the challenges of predicting recidivism based on the survey data. What variables seem like sensible predictors? What variables would be sensible outcome variables? The code in the cell below selects some numerical and categorical variables for you to include in your model.

In [3]:
# Select features and response variables

# Features by type
numerical_features = ['juv_misd_count', 'juv_other_count', 'juv_fel_count', 
    'priors_count', 'age']
binary_categorical_features = ['sex', 'c_charge_degree']
other_categorical_features = ['race']
all_features = binary_categorical_features + other_categorical_features + numerical_features

# Possible esponse variables
response_variables = ['is_recid', 'is_violent_recid', 'two_year_recid']

# Variables that are used for data cleaning
check_variables = ['days_b_screening_arrest']

ProPublica filtered some observations (i.e., rows in the data frame). See their explanation below. Let's follow their procedure.


> There are a number of reasons remove rows because of missing data:
>
> * If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested, we assume that because of data quality reasons, that we do not have the right offense.
> * We coded the recidivist flag -- is_recid -- to be -1 if we could not find a compas case at all.
> * In a similar vein, ordinary traffic offenses -- those with a c_charge_degree of 'O' -- will not result in Jail time are removed (only two of them).
> * We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility.


In [4]:
# Subselect data
df = raw_data[all_features+response_variables+check_variables]

# Apply filters
df = df[(df['days_b_screening_arrest'] <= 30) & 
        (df['days_b_screening_arrest'] >= -30) & 
        (df['is_recid'] != -1) & 
        (df['c_charge_degree'] != 'O')]

df = df[all_features+response_variables]
print('Dataframe has {} rows and {} columns.'.format(df.shape[0], df.shape[1]))

Dataframe has 6172 rows and 11 columns.


### Part 1, Step 3: Construct numerical coding for categorical features

Some of these features in the subselected data are not numerical, so you will need to replace some categorical values with zeros and ones. Your features will include "race", which was surveyed as a one categorical variable with more than two categories. You will uses [1-hot encoding](https://en.wikipedia.org/wiki/One-hot) to include this feature in your data set. 

In [5]:
# Code binary features as 0 and 1
for x in binary_categorical_features:
    for new_value, value in enumerate(set(df[x])):
        print("Replace {} with {}.".format(value, new_value))
        df = df.replace(value, new_value)

Replace Male with 0.
Replace Female with 1.
Replace M with 0.
Replace F with 1.


  df = df.replace(value, new_value)


In [6]:
# Use 1-hot encoding for other categorical variables
one_hot_features = []
for x in other_categorical_features:
    for new_feature, value in enumerate(set(df[x])):
        feature_name = "{}_is_{}".format(x,value)
        df.insert(3, feature_name, df[x]==value)
        one_hot_features += [feature_name]

# Check what the data frame looks like now
df.head(10)

Unnamed: 0,sex,c_charge_degree,race,race_is_Caucasian,race_is_Asian,race_is_Native American,race_is_African-American,race_is_Other,race_is_Hispanic,juv_misd_count,juv_other_count,juv_fel_count,priors_count,age,is_recid,is_violent_recid,two_year_recid
0,0,1,Other,False,False,False,False,True,False,0,0,0,0,69,0,0,0
1,0,1,African-American,False,False,False,True,False,False,0,0,0,0,34,1,1,1
2,0,1,African-American,False,False,False,True,False,False,0,1,0,4,24,1,0,1
5,0,0,Other,False,False,False,False,True,False,0,0,0,0,44,0,0,0
6,0,1,Caucasian,True,False,False,False,False,False,0,0,0,14,41,1,0,1
7,0,1,Other,False,False,False,False,True,False,0,0,0,3,43,0,0,0
8,1,0,Caucasian,True,False,False,False,False,False,0,0,0,0,39,0,0,0
10,0,1,Caucasian,True,False,False,False,False,False,0,0,0,0,27,0,0,0
11,0,0,African-American,False,False,False,True,False,False,0,0,0,3,23,1,0,1
12,1,0,Caucasian,True,False,False,False,False,False,0,0,0,0,37,0,0,0


### Part 1, Step 4: Split the data

Let's collect the features in one data frame and the responses in another data frame. After that, you will set a small portion of the data set aside for testing.

In [7]:
# list of features
features = numerical_features + binary_categorical_features + one_hot_features

# features data frame
X = df[features]

# responses data frame
Y = df[response_variables]

# Split the data into a training set containing 90% of the data
# and test set containing 10% of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=7)

# Part 2: Train and validate a decision tree

>In this part, you will fit a decision tree to your data. You will examine the effect of tuning the complexity of the tree via the "maximum number of leaves" parameter and use 5-fold cross-validation to find an optimal value.
>
>This part includes three steps:
>
>1. Fit a decision tree on the training data
>2. Tune the parameter "maximum number of leaves"
>3. Calculate the selected model's test performance

### Part 2, Step 1: Fit a decision tree on the training data

Start by fitting a decision tree to your training data. Assess its training accuracy and its size.

In [8]:
# Create a model
model = DecisionTreeClassifier(random_state=7)
    
# Fit model to training data
model.fit(X_train, Y_train)

# Evaluate training accuracy
y_pred = model.predict(X_train)

accuracy = accuracy_score(Y_train, y_pred)

# Check size of decision tree
num_leaves = model.get_n_leaves()

# Report results
print('Trained decision tree with {} leaves and training accuracy {:.2f}.'.format(num_leaves, accuracy))

Trained decision tree with 2038 leaves and training accuracy 0.79.


Your tree has a good training accuracy for the standards of tabular data prediction problems, but its size is enormous! It has so many leaves, that on average every 3 to 4 training observations get a leaf to themselves. It is very probable that this tree is overfitting.

### Part 2, Step 2: Tune the parameter "maximum number of leaves"

Let's try to constrain the complexity of a decision tree during training by setting a value for the argument ``maximum number of leaves``. You can use the sci-kit learn's `cross_val_score` function to quickly assess the out-of-sample performance of trees of varying complexity.

In [9]:
# Perform 5-fold cross-validation for different tree sizes

print('Leaves\tMean accuracy')
print('---------------------')
for num_leaves in range(10,1800,10):

    # Trees must have at least 2 leaves
    if num_leaves >= 2:

        # construct a classifier with a limit on its number of leaves
        tree = DecisionTreeClassifier(max_leaf_nodes=num_leaves, random_state=7)

        # Get validation accuracy via 5-fold cross-validation
        scores = cross_val_score(tree, X_train, Y_train, cv=5)
    
    print("{}\t{:.3f}".format(num_leaves,scores.mean()))

Leaves	Mean accuracy
---------------------
10	0.550
20	0.582
30	0.593
40	0.592
50	0.585
60	0.583
70	0.584
80	0.582
90	0.583
100	0.581
110	0.580
120	0.579
130	0.578
140	0.578
150	0.579
160	0.579
170	0.577
180	0.578
190	0.574
200	0.573
210	0.574
220	0.572
230	0.571
240	0.570
250	0.570
260	0.568
270	0.568
280	0.567
290	0.565
300	0.565
310	0.565
320	0.566
330	0.565
340	0.562
350	0.563
360	0.560
370	0.560
380	0.559
390	0.559
400	0.558
410	0.555
420	0.553
430	0.551
440	0.551
450	0.553
460	0.555
470	0.554
480	0.552
490	0.551
500	0.550
510	0.551
520	0.551
530	0.550
540	0.549
550	0.548
560	0.548
570	0.548
580	0.547
590	0.548
600	0.547
610	0.546
620	0.544
630	0.544
640	0.545
650	0.545
660	0.546
670	0.546
680	0.545
690	0.541
700	0.541
710	0.541
720	0.542
730	0.541
740	0.541
750	0.541
760	0.541
770	0.539
780	0.537
790	0.537
800	0.535
810	0.535
820	0.535
830	0.532
840	0.532
850	0.531
860	0.531
870	0.531
880	0.530
890	0.530
900	0.529
910	0.527
920	0.527
930	0.526
940	0.527
950	0.527
960	0.527
970	0.

Adjust the range of values for `max_leaf_nodes` in the cell above, to identify the best value.

### Part 2, Step 3: Calculate the selected model's test performance

Train a decision tree using your selected value of `max_leaf_nodes` on the full training set. Assess its accuracy on your test set.

In [10]:
# Create a model
model = DecisionTreeClassifier(max_leaf_nodes=30, random_state=7)
    
# Fit model to training data
model.fit(X_train, Y_train)

# Evaluate training accuracy
Y_pred = model.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)

# Check size of decision tree
num_leaves = model.get_n_leaves()

# Report results
print('Trained decision tree with {} leaves and test accuracy {:.2f}.'.format(num_leaves, accuracy))

Trained decision tree with 30 leaves and test accuracy 0.59.


# Part 3: Auditing a decision tree for demographic biases

>Your training data includes several demographic variables (i.e., age, sex, race). A crude way to assess whether a model has some demographic bias is to remove the corresponding variables from your training data and explore how that removal affects your model's performance. Decision trees have the advantage of being interpretable machine learning models. By going through the decision nodes (i.e., branching points), you can "open the black box and look inside". Specifically, you can assess how each feature is used in the decision making process.
>
>This part includes two steps:
>
>1. Check for racial bias via performance assessment
>2. Check for racial bias via decision rules
  
### Part 3, Step 2: Check for racial bias via performance assessment
A simple approach to identifying demographic biases in machine learning is the following: (i) Train and validate the model on the full training set, (ii) train and validate the model on a subset of training variables that excludes the variables related to a potential demographic bias, (iii) compare the results. 

You have noticed that the validation accuracy of your model can vary for different holdout set selections. To account for these variations, you are going to compare the mean validation accuracy over 100 trees. (You have completed (i) in the previous cell already. Continue now with (ii).)

In [11]:
# Create subset of training data without information on race. 
# (The information on race was encoded in the one-hot features.)
remaining_features = [v for v in X.columns if v not in one_hot_features]
X_train_sub = X_train[remaining_features]
X_test_sub = X_test[remaining_features]

# Create a model
dtc = DecisionTreeClassifier(max_leaf_nodes=39)
    
# Fit model to training data
dtc.fit(X_train_sub, Y_train['two_year_recid'])

# Evaluate training accuracy
y_pred = dtc.predict(X_test_sub)
accuracy = (y_pred == Y_test['two_year_recid']).mean()

# Check size of decision tree
num_leaves = dtc.get_n_leaves()

# Report results
print('Trained decision tree with {} leaves and test accuracy {:.2f}.'.format(num_leaves, accuracy))

Trained decision tree with 39 leaves and test accuracy 0.68.


Comparing the mean accuracy values on the all features versus the subselected feature set, what do you conclude about the importance of racial information in this classification problem?

The classification accuracy improves when racial information is removed from the dataset. Because it is not needed to achieve accurate scores, it must not be very important to the classification problem.

### Part 3, Step 3: Check for racial bias via decision rules
The interpretability of decision trees allows for an alternative approach to detecting racial bias. You can simply look at the decision rules. Use the scit-kit learn's function `export_text` to get your decision tree in text format. Compare the decision rules of the your tree with all features and your tree fitted on the subset without racial information. Do you find any indication of racial bias in the decision rules of the first tree?

In [12]:
rules = export_text(model, feature_names=X.columns.tolist())
print(rules)

|--- priors_count <= 2.50
|   |--- age <= 22.50
|   |   |--- age <= 20.50
|   |   |   |--- age <= 19.50
|   |   |   |   |--- class: 1
|   |   |   |--- age >  19.50
|   |   |   |   |--- priors_count <= 1.50
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- priors_count >  1.50
|   |   |   |   |   |--- class: 1
|   |   |--- age >  20.50
|   |   |   |--- sex <= 0.50
|   |   |   |   |--- priors_count <= 1.50
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- priors_count >  1.50
|   |   |   |   |   |--- class: 1
|   |   |   |--- sex >  0.50
|   |   |   |   |--- class: 1
|   |--- age >  22.50
|   |   |--- priors_count <= 0.50
|   |   |   |--- age <= 35.50
|   |   |   |   |--- sex <= 0.50
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- sex >  0.50
|   |   |   |   |   |--- age <= 33.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- age >  33.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |--- age >  35.50
|   |   |   |   |--- class: 1
|   |   |--- priors_

In [13]:
rules_sub = export_text(dtc, feature_names=X_test_sub.columns.tolist())
print(rules_sub)

|--- priors_count <= 2.50
|   |--- age <= 22.50
|   |   |--- age <= 20.50
|   |   |   |--- priors_count <= 1.50
|   |   |   |   |--- age <= 19.50
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- age >  19.50
|   |   |   |   |   |--- sex <= 0.50
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- sex >  0.50
|   |   |   |   |   |   |--- class: 0
|   |   |   |--- priors_count >  1.50
|   |   |   |   |--- class: 1
|   |   |--- age >  20.50
|   |   |   |--- sex <= 0.50
|   |   |   |   |--- priors_count <= 1.50
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- priors_count >  1.50
|   |   |   |   |   |--- class: 1
|   |   |   |--- sex >  0.50
|   |   |   |   |--- age <= 21.50
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- age >  21.50
|   |   |   |   |   |--- class: 0
|   |--- age >  22.50
|   |   |--- priors_count <= 0.50
|   |   |   |--- age <= 35.50
|   |   |   |   |--- sex <= 0.50
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- sex >  0.50
|   |   | 

There is a leaf in the origianl model that splits the data using the race_is_Other feature. Individuals who have Other listed for race are more likely to be classified by the model for recidivism. Thies leaf if not included in the model using a subset of the features.

# Part 4: Comparison to other linear classifiers

>For some types of data, decision trees tend to achieve lower prediction accuracies In this part, you will train and tune several classifiers on the COMPAS data. You will then compare their performance on your test set.
>
>This part includes three steps:
>
>1. Fit LDA and logistic regression
>2. Tune and fit ensemble methods
>3. Tune and fit SVC
>4. Compare test accuracy of all your models 

In [14]:
# list of features
features = numerical_features + binary_categorical_features + one_hot_features

# features data frame
X = df[features]

# responses data frame
Y = df[response_variables]

# Split the data into a training set containing 90% of the data
# and test set containing 10% of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=7)

In [22]:
# Select one outcome to predict from the Y set
Y_train_rec = Y_train.iloc[:,0]

In [28]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

# LDA
scaler = StandardScaler()
X_train_st = scaler.fit_transform(X_train)
X_test_st = scaler.transform(X_test)

lda = LinearDiscriminantAnalysis()

lda.fit(X_train_st, Y_train_rec)

y_pred_lda = lda.predict(X_test_st)

accuracy = accuracy_score(Y_test.iloc[:,0], y_pred_lda)

results = pd.DataFrame({
    'Method': ['LDA'],
    'Score': [accuracy]
})

In [30]:
from sklearn.linear_model import LogisticRegression

# Logistic
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_st, Y_train_rec)

# Predict the test set results
y_pred_log_reg = log_reg.predict(X_test_st)

# Evaluate the model
accuracy_log_reg = accuracy_score(Y_test.iloc[:,0], y_pred_log_reg)

new_results = pd.DataFrame({
    'Method': ['Logistic Regression'],
    'Score': [accuracy_log_reg]
})

results = pd.concat([results, new_results], ignore_index=True)


In [32]:
from sklearn.model_selection import GridSearchCV

# Ensemble Methods
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

param_grid_gb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5]
}

param_grid_bg = {
    'n_estimators': [10, 50, 100],
    'max_samples': [0.5, 1.0],
    'max_features': [0.5, 1.0]
}

ensemble_methods = {
    'Random Forest': (RandomForestClassifier(), param_grid_rf),
    'Gradient Boosting': (GradientBoostingClassifier(), param_grid_gb),
    'Bagging': (BaggingClassifier(), param_grid_bg)
}

# 5-fold cross-validation
for method_name, (model, param_grid) in ensemble_methods.items():
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train_st, Y_train_rec)
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test_st)
    accuracy = accuracy_score(Y_test.iloc[:,0], y_pred)
    print(f'{method_name} Accuracy: {accuracy * 100:.2f}%')

    # Append the results to the DataFrame
    new_results = pd.DataFrame({
        'Method': [method_name],
        'Score': [accuracy]
    })
    results = pd.concat([results, new_results], ignore_index=True)


Random Forest Accuracy: 66.83%
Gradient Boosting Accuracy: 67.64%
Bagging Accuracy: 65.37%


In [34]:
#SVC
param_grid_svc = {
    'C': [0.1, 1, 10],
    'gamma': [0.001, 0.01, 0.1],
    'kernel': ['rbf', 'linear']
}

# Perform grid search with 5-fold cross-validation
svc = SVC()
grid_search = GridSearchCV(svc, param_grid_svc, cv=5, scoring='accuracy')
grid_search.fit(X_train_st, Y_train.iloc[:,0])

best_svc = grid_search.best_estimator_

y_pred_svc = best_svc.predict(X_test_st)

# Evaluate the model
accuracy_svc = accuracy_score(Y_test.iloc[:,0], y_pred_svc)
print(f'SVC Accuracy: {accuracy_svc * 100:.2f}%')

# Create a DataFrame to store the method and score
new_results = pd.DataFrame({
    'Method': ['SVC'],
    'Score': [accuracy_svc]
})
results = pd.concat([results, new_results], ignore_index=True)

SVC Accuracy: 67.48%


In [35]:
results

Unnamed: 0,Method,Score
0,LDA,0.66343
1,Logistic Regression,0.673139
2,Random Forest,0.668285
3,Gradient Boosting,0.676375
4,Bagging,0.653722
5,SVC,0.674757


Gradient Boosting has the highest test accuracy, although SVC and Logistic Regression perform similarly.