# Overview
This notebook is used to create a multi-class classification model that predicts the type of dry beans based on 16 features (12 dimensions and 4 shape forms). The dataset used is the "Dry Bean Dataset" which is available at https://archive.ics.uci.edu/dataset/602/dry+bean+dataset. Here is the UCI dataset description of the dataset:

*Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.*

More information about the dataset can be found in the paper "Multiclass classification of dry beans using computer vision and machine learning techniques" by Murat Koklu and Ilker Ali Ozkan in Computers and Electronics in Agriculture, 2020.

The dataset contains 13611 rows and 17 columns. The columns are as follows:
1. Area (integer, denote A): The area of a bean zone and the number of pixels within its boundaries.
2. Perimeter (float, denote P): Bean circumference is defined as the length of its border.
3. MajorAxisLength (float, denote L): The distance between the ends of the longest line that can be drawn from a bean.
4. MinorAxisLength (float, denote l): The longest line that can be drawn from the bean while standing perpendicular to the main axis.
5. AspectRatio (float, denote K): Defines the relationship between MajorAxisLength and MinorAxisLength. $K = \frac{L}{l}$
6. Eccentricity (float, denote Ec): Eccentricity of the ellipse having the same moments as the region.
7. ConvexArea (integer, denote C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. EquivDiameter (float, denote Ed): Equivalent diameter. The diameter of a circle having the same area as a bean seed area. $d = \sqrt{\frac{4*A}{\pi}}$
9. Extent (float, denote Ex): The ratio of the pixels in the bounding box to the bean area.
$$Ex = \frac{A}{A_B} where A_B = AreaOfBoundingRectangle$$
10. Solidity (float, denote S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans. $S = \frac{A}{C}$
11. Roundness (float, denote R): Calculated with the following formula: $R = \frac{4 \pi A}{P^2}$.
12. Compactness (float, denote CO): Measures the roundness of an object. $CO = \frac{Ed}{L}$
13. ShapeFactor1 (float, denote SF1): Defines the relationship between MajorAxisLength and Area. $\frac{L}{A}$
14. ShapeFactor2 (float, denote SF2): Defines the relationship between MinorAxisLength and Area. $\frac{l}{A}$
15. ShapeFactor3 (float, denote SF3): Defines the relationship between Area and area of a circle having the same MajorAxisLength as the bean. $\frac{A}{(\frac{L}{2}*\frac{L}{2}*\pi)}$
16. ShapeFactor4 (float, denote SF4): Defines the relationship between Area and area of a circle having the same MinorAxisLength as the bean. $\frac{A}{(\frac{L}{2}*\frac{l}{2}*\pi)}$
17. Class (string): Type of a bean. Value of class variable is one of: BARBUNYA, BOMBAY, CALI, DERMASON, HOROZ, SEKER, SIRA.

Here are some descriptions of each class, which might help us determine the features that are important for classification:


**BARBUNYA**; Beige-colored background with red stripes or variegated, speckled color, its seeds are large, physical shape is oval close to the round.

**BOMBAY**; It is white in color, its seeds are very big and its physical structure is oval and bulging.

**CALI**; It is white in color, its seeds are slightly plump and slightly larger than dry beans and in shape of kidney.

**DERMASON**; This type of dry beans, which are fuller flat, is white in color and one end is round and the other ends are round.

**HOROZ**; Dry beans of this type are long, cylindrical, white in color and generally medium in size.

**SEKER**; Large seeds, white in color, physical shape is round.

**SIRA**; Its seeds are small, white in color, physical structure is flat, one end is flat, and the other end is round.

The model is created using the following steps:
1. Data Exploration
2. Data Preprocessing
3. Model Training and Evaluation

# Install Python Packages/Libraries

Install the specified Python packages. Here's a breakdown of each package:

`joblib`: A set of tools for pipelining Python jobs. It provides utilities for saving and loading Python objects that make it possible to save scikit-learn models in a format that can be used in production.

`matplotlib`: A plotting library for creating visualizations in Python. It is often used in conjunction with other libraries for data analysis and machine learning.

`numpy`: A fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays.

`pandas`: A powerful data manipulation and analysis library. It provides data structures for efficiently storing and manipulating large datasets.

`seaborn`: A statistical data visualization library based on Matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics.

`scikit-learn`: A machine learning library that provides simple and efficient tools for data analysis and modeling. It includes various algorithms for classification, regression, clustering, and more.


In [None]:
%pip install joblib==1.3.2 matplotlib==3.7.1 numpy==1.23.5 pandas==1.5.3 plotly==5.15.0 scikit-learn==1.2.2 seaborn==0.13.1 "nbformat>=4.2.0"

# Import Packages/Libraries

In addition to the packages/libraries installed above, we will also imported

`typing`: A module that provides support for type hints. Type hints allow you to specify the type of a variable, function parameter, or return value. This helps improve the readability of your code and allows you to catch errors early.

In [None]:
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer


# Data Exploration

We want to explore the data to get a better understanding of the dataset. We will use the pandas library to load the dataset into a pandas DataFrame. A DataFrame is a two-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data, and more) in columns. It is similar to a spreadsheet or an SQL table. The DataFrame object also has powerful built-in methods for exploring and manipulating these data sets. We will first take a look at the structure of the dataset. Then, we want to explore the data to see if there are any missing values and to see if there are any skewness and outliers. We also want to see if there are any correlations between the features and the target variable. We will use the pandas library to load the dataset and the matplotlib and seaborn library to plot the data.

Load the `Dry_Bean_Dataset.csv` file into a Pandas DataFrame

In [None]:
csv_path = # TODO: create variable for dataset path
df =  # TODO: read dataset into dataframe

Display the first 5 rows of the dataframe using `head()` method of the dataframe.

We can see that the data contains 17 columns where the first 16 columns are numerical values and the last column is a categorical value. The target variable is the `Class` column.

In [None]:
# TODO: show the first 5 rows of dataframe


Obtain a concise summary of the dataframe. The `info()` method provides information about the dataframe, including the index range, the data types of each column, the number of non-null values, and memory usage.

In [None]:
# TODO: show the info of dataframe


we can retrieve the columns of the dataframe using the `columns` attribute of the dataframe.

In [None]:
# TODO: show the columns of dataframe


Retrieve the dimensions (rows, columns) of the dataframe using the `shape` attribute of the dataframe.

In [None]:
# TODO: retrieve the number of rows and columns


Check and handles missing values in the Pandas DataFrame.

We see that their is no missing values in the dataset.

In [None]:
# TODO: check the number of missing values in each column (fill in the blank)
print('Missing values in each column:\n', _________________, sep='')
print('Original data shape: ', df.shape)

Check the number of unique values in each column of dataframe using the `nunique()` method of the dataframe.

In [None]:
# TODO: check the number of unique values in each column


We will create a list to keep track of the feature columns and a string variable for the target column. This will be useful when we visualize the data.

In [None]:
# Create a variable to store the target column name
target = 'Class'

# Create a list to keep track of the feature columns
all_features = ['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
       'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent',
       'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2',
       'ShapeFactor3', 'ShapeFactor4']

We will display pie charts to visualize the distribution of the target `Class` column using the `plot.pie()` method of the dataframe.

In [None]:
# TODO: set the figure size and background color to white in case your editor GUI uses a dark theme (fill in the blank)
plt.figure(figsize=______, facecolor=_______)

# TODO: show the percentage of each type of bean using a pie chart
# plot.pie() plots a pie chart on the figure attached to the current cell
# autopct='%1.2f%%' shows the percentage of each category with 2 decimal places

plt.ylabel('')

plt.show()

We will use histograms to visualize the distribution of numerical feature columns using the `hist()` method of the dataframe.

In [None]:
# Display the distribution of each feature
df[all_features].hist(figsize=(10, 10), bins=100)

plt.tight_layout()
plt.show()

Check for correlation between the features using the `corr()` method of the dataframe. The `corr()` method computes pairwise correlation of columns, excluding NA/null values. The correlation coefficient ranges from -1 to 1. When it is close to 1, it means that there is a strong positive correlation; when the coefficient is close to -1, it means that there is a strong negative correlation; when it is close to zero, it means that there is no linear correlation.

Notice that the `Class` column is not included in the correlation matrix because it is a categorical value.

We will look for highly correlated features and determine which features to keep and which to drop.

In [None]:
# TODO: create a figure


# TODO: compute the correlation matrix using the `.corr()` method (fill in the blank)
correlation_matrix = _________

# Create a mask to block the upper triangle of the correlation matrix
# as it is a mirror image of the lower triangle
# use `np.triu()` to create an upper triangle matrix of 1s and 0s
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# TODO: plot the heatmap using seaborn's `heatmap()` function
# set `annot=True` to show the correlation values on the heatmap
# set `fmt='.2f'` to round the correlation values to 2 decimal places
# set `mask=mask` to block the upper triangle of the correlation matrix


# show the plot
plt.show()


A violin plot is used to visualize the distribution of the data and its probability density. This chart is a combination of a Box Plot and a Density Plot that is rotated and placed on each side, to show the distribution shape of the data. We will use the `violinplot()` method of the seaborn library to plot the violin plot.

We will use the violin plot to help us determine which features to keep and which to drop. We will drop the features that have similar distributions for all the classes. We will keep the features that have different distributions for different classes.

In [None]:
# TODO: sns.violinplot() plots a violin plot on the figure attached to the current cell (fill in the blank)
# x=label specifies the column to use for x axis
# y=feature specifies the column to use for y axis
# hue=label specifies the column to use for grouping
for feature in ['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'AspectRation',
                'Eccentricity', 'Extent', 'Solidity', 'roundness', 'ShapeFactor1', 'ShapeFactor2',
                'ShapeFactor3', 'ShapeFactor4']:
    sns.violinplot(________, _________, hue=target, data=df)
    plt.show()



# Data Preprocessing

Data splitting. We are given one CSV file containing all the data. We will split the data into training, validation, and test sets.
- The training dataset is used to train the machine learning model. The model learns the patterns and relationships within the data from this set. The model adjusts its parameters to minimize the difference between its predictions and the ground truth values provided with this dataset.
- The validation dataset is used to assess the model's performance during training and guide the adjustment of the hyperparameters. Hyperparameters are training/model configurations that the programmer can manually adjust. Since the validation dataset is an independent dataset not used in directly tuning the model's internal parameters, we can use it to assess whether a model is under-fitting or over-fitting.
- The test dataset is kept separate from both the training and validation sets. It is used to assess the final performance of the trained model on unseen data. The test set provides an unbiased evaluation of the model's generalization to new, previously unseen examples.

Building pipelines to preprocess the data. The preprocessing pipeline provides a systematic and efficient way to streamline and automate the data preprocessing steps. It ensures consistent application of preprocessing steps to training, validation, and test data, enhancing model reproducibility, readability, and ease of deployment. In this section, we will build a preprocessing pipeline to perform the following steps:
 - Use log transformation to reduce skewness
 - Standardize the data using StandardScaler
 - drop low correlation features

Reload the data into a Pandas DataFrame to reset the changes made to the dataframe in the previous section. Additionally, we will declare variables to store all feature columns and the label (target) column. We will use these variables to select the columns from the dataframe in the preprocessing pipeline.

In [None]:
# TODO: reload the dataset
csv_path =
df =

# Declare the feature columns and target column
label = 'Class'
all_features = ['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
       'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent',
       'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2',
       'ShapeFactor3', 'ShapeFactor4']

Split the data into training, validation, and test sets. The training set will be used to train the model, the validation set will be used to evaluate the model during the training process, and the test set will be used to test the final model.

We will be using Scikit-Learn's `train_test_split` function to split the data into training, validation, and test sets. The `train_test_split` function takes in the dataframe as argument and returns the training set and test set. We will further split the training set into training and validation sets. The final split is 60% training, 20% validation, and 20% test.

In [None]:
# Use a random seed so that we can reproduce the results
# this is important when you want to compare different models
random_seed = 42

# TODO: split the data into training and test sets. 80/20 split.
train_set, test_set =

# TODO: split the data into training and validation sets.
# 75/25 split of the training set, which is 60/20 of the original set.
train_set, valid_set =

Separate the feature columns from the label column

In [None]:
train_X, train_y = _________[____________], _________[_____]
valid_X, valid_y = _________[____________], _________[_____]
test_X, test_y = ________[____________], ________[_____]

Check the feature distribution in the training, validation, and test sets. We want to ensure that the label distribution is similar in all the sets.

In [None]:
# Check the distribution of the training set, validation set, and test set
# For each feature, overlay the histograms of the three sets on the same subplot

# TODO: create a figure with 3 rows and 6 columns
fig, axes =

# For each feature, plot the histograms of the three sets on the same subplot
for ax, col in zip(axes.flat, all_features):
    ax.hist(train_X[col], bins=50, alpha=0.2, label='train')
    ax.hist(valid_X[col], bins=50, alpha=0.4, label='valid')
    ax.hist(test_X[col], bins=50, alpha=0.3, label='test')
    ax.set_title(col)

# Add a legend to the figure anchor to the top right corner
handles, labels = ax.get_legend_handles_labels()
fig.legend(handles, labels, bbox_to_anchor=(1.05, 0.95))

# TODO: use `plt.tight_layout()` to adjust the padding between and around subplots.

plt.show()


Check the label distribution in the training, validation, and test sets.

In [None]:
# TODO: check the label distribution of the training set, validation set, and test set
# overlay the histograms of the three sets (fill in the blank)

plt.figure(figsize=(10, 5))
plt.hist(_______, bins=50, alpha=0.2, label=_______)
plt.hist(_______, bins=50, alpha=0.4, label=_______)
plt.hist(______, bins=50, alpha=0.3, label=______)
plt.title(label)
plt.legend()
plt.show()


## Create Preprocessing Pipeline

 - Use log transformation to reduce skewness
 - Standardize the data using StandardScaler
 - drop low correlation features

`ColumnTransformer` allows us to apply different preprocessing steps to different columns in the dataset. It is particularly useful when we have a dataset when we want to apply specific transformations to specific subsets of features.

In [None]:
# Create log_transformer to apply log transformation to the data
# using `FunctionTransformer()` and `np.log()` function
log_transformer = FunctionTransformer(np.log, validate=True)

In [None]:
# parameters for the preprocess pipelines
# A list of features to apply for log transformation and standardization
# Another list of features to apply for standardization
log_scale_features = ['Area']
scale_features = ['AspectRation', 'Eccentricity',  'roundness', 'ShapeFactor1', 'ShapeFactor2', 'ShapeFactor3']

# Create a preprocess pipeline
# use `ColumnTransformer()` to apply the log_transformer and StandardScaler() to the features
#  - use `log_std_scaler` as the name for the log_transformer and StandardScaler() pipeline
#  - use `std_scaler` as the name for the StandardScaler() only pipeline
# use `remainder='drop'` to drop the columns that are not specified in the lists above
preprocess_pipeline = ColumnTransformer(
    [
        ('log_std_scaler', Pipeline([('log_scaler', log_transformer),
                                     ('std_scaler', StandardScaler())]
                                   ),  log_scale_features),
        ("std_scaler", StandardScaler(), scale_features),
    ],
    remainder="drop")

Use the full pipeline to transform the training and validation data

In [None]:
# TODO: fit the pipeline to the training set and transform it
train_X =

# TODO: transform the validation
valid_X =

# Train and Evaluate Models

This perhaps is the easier part of the workflow in terms of coding. Scikit-learn provides a wide range of machine learning tools that we can use to train and evaluate our data. Their APIs are very consistent. We can use the same code to train and evaluate different models by simply changing the model class.

Before we start, we will implement an evaluation function. This function `predict_and_print_metrics` takes a multiclass classifier model, input features (X), target values (y), and a dataset name. It then predicts the target values using the model, calculates and prints the accuracy and confusion matrix.

Accuracy computes as the proportion of true results ($y_i == \hat{y_i}$) among the total number of samples examined. It is calculated as $\frac{\sum_i^N(y_i == \hat{y_i})}{total\_num\_samples}$.

Confusion matrix is a table that summarizes the performance of a classification model. It compares the actual values with the predicted values.


In [None]:
#TODO: import the accuracy_score and confusion_matrix functions (fill in the blank)


def plot_confusion_matrix(cm: np.ndarray, classes: list, cmap=plt.cm.Blues) -> None:
    """
    This function plots the confusion matrix using seaborn.
    """
    plt.figure(figsize=(5, 5))
    sns.heatmap(cm, annot=True, cmap=cmap, fmt='g', xticklabels=classes, yticklabels=classes)
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.show()

# TODO: create a function to predict the target values and print the metrics
# include accuracy and confusion matrix
def predict_and_print_metrics(model, X: np.ndarray, y: np.ndarray, name: str) -> None:
    # predict the target values
    y_pred =

    # compute accuracy
    accuracy =

    # compute confusion matrix
    classes = ['BARBUNYA', 'BOMBAY', 'CALI', 'DERMASON', 'HOROZ', 'SEKER', 'SIRA']
    cm =

    print(f"Model: {name}")
    print(f"Accuracy: {accuracy:.4f}")

    print(f"Confusion Matrix:\n")
    plot_confusion_matrix(cm, classes)

Train a Decision Tree model. The training can be done in three lines of code. First, we import the model class from the Scikit-Learn library. Then, we create an instance of the model class. Third, we call the `fit` method of the model instance to train the model. The `fit` method takes in the training features and labels as arguments. The model learns the patterns and relationships within the data. Then, we use the trained model to predict the target values for the training and validation set, and we evaluate the model using the evaluation metrics mentioned above.

In [None]:
# TODO: import the decision tree classifier


# TODO: instantiate and train the model
tree_clf =
tree_clf.fit(train_X, train_y)

# TODO: evaluate the model on the training set and validation set



We will train and evaluate other models (K-Nearest Neighbors Classifier and Logistic Regression). We will use the same code to train and evaluate different models by simply changing the model class.

In [None]:
# Import the k-nearest neighbors classifier
from sklearn.neighbors import KNeighborsClassifier

# TODO: instantiate and train the model
neigh =
neigh.fit(train_X, train_y)

# TODO: evaluate the model on the training set and validation set


In [None]:
# Import the logistic regression classifier
from sklearn.linear_model import LogisticRegression

# TODO: instantiate and train the model
log_reg =
log_reg.fit(train_X, train_y)

# TODO: evaluate the model on the training set and validation set


In the case where the model has hyperparameters, we can either write our own code to search for the best hyperparameters or use Scikit-Learn's `GridSearchCV` class to search for the best hyperparameters. We start off by writing our own code to search for the best hyperparameters.

In [None]:
# train a decision tree classifier with search for the best hyperparameters

train_scores = []
valid_scores = []
depths = range(2,30,2)
for depth in depths:
    tree_clf = DecisionTreeClassifier(max_depth=depth, random_state=0)
    tree_clf.fit(train_X, train_y)

    # evaluate the model on the train set
    train_y_pred = tree_clf.predict(train_X)
    train_acc = accuracy_score(train_y, train_y_pred)
    train_scores.append(train_acc)

    # evaluate the model on the validation set
    valid_y_pred = tree_clf.predict(valid_X)
    valid_acc = accuracy_score(valid_y, valid_y_pred)
    valid_scores.append(valid_acc)

# plot the learning curves
plt.plot(depths, train_scores, label='train')
plt.plot(depths, valid_scores, label='valid')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.legend()

# print the depths and the corresponding scores
for depth, train_score, valid_score in zip(depths, train_scores, valid_scores):
    print(f'depth: {depth}, train score: {train_score}, valid score: {valid_score}')

Use `GridSearchCV` to search for the best hyperparameters. `GridSearchCV` takes in the model class, hyperparameter grid, evaluation metric (scorer), and the number of folds as arguments. It then searches for the best hyperparameters based on the evaluation metric. We will use the same evaluation metric as before, accuracy. We will combine the training and evaluation data into one dataset. This is because `GridSearchCV` uses cross-validation to evaluate the model. We will use 5-fold cross-validation, which means that the dataset will be split into 5 folds. The model will be trained and evaluated 5 times, each time using a different fold as the evaluation set. The final evaluation metric will be the average of the 5 evaluations.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Combine the train and validation sets for GridSearchCV
train_valid_X = np.concatenate([train_X, valid_X])
train_valid_y = np.concatenate([train_y, valid_y])

# TODO: Specify the parameter grid for the grid search (fill in the blank)
param_grid = {
    'max_depth': range(_,__,_),
}


# TODO: Make the metric a scorer using make_scorer (fill in the blank)
scorer = make_scorer(accuracy_score, greater_is_better=____)


# TODO: Create the DecisionTreeClassifier
tree_clf =

# TODO: Create the GridSearchCV object with custom scoring (fill in the blank)
grid_search = GridSearchCV(
    ________,
    __________,
    scoring=______,
    cv=5  # You can specify the number of folds for cross-validation
)

# TODO: Fit the grid search to your data


# Access the best model and its parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Print the results
print(f"Best Model: {best_model}")
print(f"Best Parameters: {best_params}")

# You can also access other information like grid search results, etc.
best_index = grid_search.best_index_
print(grid_search.cv_results_['mean_test_score'])

# Putting it all together in a pipeline and export pipeline

We will now combine the preprocessing pipeline and the best model into one pipeline.

In [None]:
# TODO: combine the preprocessing pipeline and the best model into a new pipeline
pipeline = Pipeline([
    ('preprocess', ___________________),
    ('model', __________)
])

Save the pipeline using `joblib.dump` and load it back for predicting the test set

In [None]:
# TODO: save the pipeline
joblib.dump(________, ____________________________________)

In [None]:
# TODO: load the pipeline
pipeline = joblib.load(____________________________________)

# TODO: evaluate the pipeline on the test set

