# Lab 4: Dry Beans Classification

## Objectives
**The purpose of this lab is to gain knowledge training and evaluating multiple models for a classification problem.**


Gabriel Eze

Nolan Johnson

DSC 340 S25

Lab 4: Dry Beans Classifications

In [None]:
# Import Python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

## Exploratory Data Analysis

Read in the file by calling the absolute file path of the dataset from the work environment.

In [None]:
f_file = pd.read_excel('C:/Users/HP/OneDrive/Desktop/DSC 340/labs/Lab4/DryBeanDataset/Dry_Bean_Dataset.xlsx')
f_file
# f_file.info() # Check for missing data
# f_file.duplicated().sum()

Using the histplot function from the seaborn library, we can plot the density distribution of the class labels.

In [None]:
# Check class distribution
plt.figure(figsize = (9, 5))
sns.histplot(x = 'Class', data = f_file, stat = 'density', kde = True)
plt.title('Density Plot for Classes')
plt.show()


# Inspect possible outliers
plt.figure(figsize = (9, 5))
sns.violinplot(x = 'Class', y = 'ShapeFactor1', data = f_file)
plt.title('ShapeFactor1 Distribution by Bean Type')
plt.show()

plt.figure(figsize = (9, 5))
sns.violinplot(x = 'Class', y = 'Area', data = f_file)
plt.title('Area Distribution by Bean Type')
plt.show()

Furthermore, using the hist function from the pandas library, we can produce a tally distribution plot for all the features.

In [None]:
f_numeric_features = f_file.select_dtypes(exclude=object) # Extract numeric portion of dataset

f_numeric_features.hist(figsize = (15, 10), bins = 20) 
plt.show()



## Data Preparation

Prepare data for model building by encoding categorical data.

In [None]:
# Clean dataframe
f_file.drop_duplicates()

# Encode class labels mutated as 'Class_encoded' variable
f_file['Class_encoded'] = LabelEncoder().fit_transform(f_file['Class'])




Perform train-test split by stratified sampling.

In [None]:
# Collect variables
X = f_file.drop(columns = ['Class', 'Class_encoded']) # Features
y = f_file['Class_encoded'] # Target

# Perform split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, stratify = y, random_state = 0)


## Pipeline tool

Define pipelines.

In [None]:


pipelines = {
    'knn' : Pipeline([('scaler', StandardScaler()), ('classifier', KNeighborsClassifier())]),
    'logistic_regression' : Pipeline([('scaler', StandardScaler()), ('classifier', LogisticRegression(max_iter = 1000))]),
    'decision_tree' : Pipeline([('scaler', StandardScaler()), ('classifier', DecisionTreeClassifier(random_state = 0))]),
    'random_forest' : Pipeline([('scaler', StandardScaler()), ('classifier', RandomForestClassifier())]),
    'svm' : Pipeline([('scaler', StandardScaler()), ('classifier', SVC())])
}




## Gridsearch with K-fold CV

Define parameter grid. 

In [None]:

grid_params = {
    'knn' : {'classifier__n_neighbors' : range(4, 7), 
            'classifier__weights' : ['distance', 'uniform']  
            },
    'logistic_regression' : {'classifier__C' : [.1, 10], 
                            'classifier__solver' : ['liblinear', 'lbfgs'],
                            'classifier__random_state' : [0] 
                            },
    'decision_tree' : {'classifier__max_leaf_nodes' : [5, None],
                       'classifier__max_depth' : [3, 4, 13],
                       'classifier__min_samples_split' : [2, 4, 8],
                       'classifier__min_samples_leaf' : [1, 3, 7]
                       },
    'random_forest' : {'classifier__n_estimators' : [3, 10, 30], 
                      'classifier__max_depth' : [5, None],
                      'classifier__min_samples_leaf' : [4, 15, 100], 
                      'classifier__max_leaf_nodes' : [4, 16], 
                      'classifier__random_state' : [0] 
                      }, 
    'svm' : {'classifier__C' : [.1, 1, 10], 
            'classifier__kernel' : ['linear', 'rbf'] 
            }
}



Perform Grid Search with cross-validation for each pipeline.

In [None]:
best_models = {} # Initialize dictionary to store the best models for each classifier

for name, pipeline in pipelines.items():
    print()
    grid_search = GridSearchCV(pipeline, grid_params[name], cv = 4, scoring = 'accuracy', n_jobs = -1)
    grid_search.fit(X_train, y_train)
    best_models[name] = grid_search.best_estimator_ # Get the best combination of model hyperparameters
    print('Best parameters for ' + name + ': ' + str(grid_search.best_params_))
   


When using GridSearchCV with a k-fold cross-validation, the training data is split roughly equally into k parts. For every model hyperparameter combination in a given pipeline, the estimator is trained k times. 

An average of the scoring metric (accuracy score) is computed from the different folds which represents how well that hyperparameter combination performed across all folds. Once all candidates have been evaluated across all folds, GridSearchCV compares their average accuracy scores and selects the best estimator with the highest achieved aggregated score.

## Evaluating Models

Evaluate each 'top' model on holdout set.

In [None]:

# Original target class
l_class = f_file['Class'].unique()
l_class.sort()


# Evaluate and compare model performances
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average = 'weighted') # Aggregate F1 weighted by class support  
    report = classification_report(y_test, y_pred, target_names = l_class)

    print('\n')
    print(name, 'Results:')
    print('Accuracy:', round(accuracy, 4))
    print('F1:', round(f1, 3))

    print(report)
    
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize = (7.8, 7.8)) # Specify confusion matrix plot figure and axes size
    cm_disp = ConfusionMatrixDisplay(cm, display_labels = l_class) 
    cm_disp.plot(ax = ax) # Plot confusion matrix unto given axis
    
    plt.title('Confusion Matrix (' + name + ' model on test set)')
    plt.tight_layout() # Helps prevent further clumping
    plt.show()
    print()
    if name != 'svm':
        print('-' * 100)


Since LabelEncoder() class from the sckit-learn library works by assigning integer labels in ascending order of our string target classes, we can manually edit the integer label indexing on the classification report by referencing the target class.  

## Conclusion



For this lab, I utilized a pipeline tool with GridSearch cross-validation to train multiple classifiers for predicting dry bean species. With a limited set of hyperparameter combinations, a parallel 4-fold cross validation (three-quarters - training set, remainder - validation set) across 5 pipelines takes about a minute.

Both raw accuracy and weighted F1 scores is used to determine the best model amongst the top estimators after fitting hyperparameter grids for each pipeline and evaluating them on the test set. The confusion matrix and classification report for these top performing models were analyzed to assess specific instances where the model failed to predict the correct target class.

## References

https://www.google.com/search?q=how+to+use+read+csv+to+read+a+file+in+the+same+workspace+as+the+program+and+file+with+the+csv+file+a+matrix+of+header+and+values+jupyter&oq=how+to+use+read+csv+to+read+a+file+in+the+same+workspace+as+the+program+and+file+with+the+csv+file+a+matrix+of+header+and+values+jupyter&gs_lcrp=EgZjaHJvbWUyBggAEEUYOdIBCTQxODU2ajBqN6gCALACAA&sourceid=chrome&ie=UTF-8

https://www.google.com/search?q=how+to+plot+a+density+plot+for+a+categorical+class+distribution+jupyter&sca_esv=71ecfd477e9beca5&sxsrf=AHTn8zqy9p7vNevOSRcyVPv1curowy2APA%3A1743701603457&ei=Y8buZ_zOG4-Ew8cPksmXmAI&ved=0ahUKEwi8jKzVsryMAxUPwvACHZLkBSMQ4dUDCBA&uact=5&oq=how+to+plot+a+density+plot+for+a+categorical+class+distribution+jupyter&gs_lp=Egxnd3Mtd2l6LXNlcnAiR2hvdyB0byBwbG90IGEgZGVuc2l0eSBwbG90IGZvciBhIGNhdGVnb3JpY2FsIGNsYXNzIGRpc3RyaWJ1dGlvbiBqdXB5dGVySPGeAVCyA1jcmAFwAngBkAEAmAGXAaAB6xaqAQQyOC42uAEDyAEA-AEBmAIGoALgA8ICChAAGLADGNYEGEfCAgcQIxiwAhgnwgIFEAAY7wXCAggQABiABBiiBMICBBAhGAqYAwCIBgGQBgiSBwM0LjKgB4VCsgcDMi4yuAfIAw&sclient=gws-wiz-serp

https://www.google.com/search?q=easiest+way+to+extract+by+columns+numeric+data+jupyter&oq=easiest+way+to+extract+by+columns+numeric+data+jupyter&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIHCAEQIRigATIHCAIQIRigAdIBCTE2Mzc2ajBqN6gCALACAA&sourceid=chrome&ie=UTF-8

https://stackoverflow.com/questions/23045318/grid-search-over-multiple-classifiers

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html

https://www.google.com/search?q=how+to+increase+area+of+confusion+matrix+plot+so+that+display_labels+is+properly+visible&sca_esv=49c9fbd1deebda4c&sxsrf=AHTn8zrv8Gkpim8jBm_Gxyxgs_NSYXdfLg%3A1744064913845&ei=kVH0Z5qxM6b9ptQPo_CVoAI&ved=0ahUKEwjavZ6N_MaMAxWmvokEHSN4BSQQ4dUDCBI&uact=5&oq=how+to+increase+area+of+confusion+matrix+plot+so+that+display_labels+is+properly+visible&gs_lp=Egxnd3Mtd2l6LXNlcnAiWGhvdyB0byBpbmNyZWFzZSBhcmVhIG9mIGNvbmZ1c2lvbiBtYXRyaXggcGxvdCBzbyB0aGF0IGRpc3BsYXlfbGFiZWxzIGlzIHByb3Blcmx5IHZpc2libGVIAFAAWABwAHgBkAEAmAEAoAEAqgEAuAEDyAEA-AEBmAIAoAIAmAMAkgcAoAcAsgcAuAcA&sclient=gws-wiz-serp

https://www.projectpro.io/recipes/optimize-hyper-parameters-of-decisiontree-model-using-grid-search-in-python