# Dry Beans Classification

## Source of Data

> https://www.kaggle.com/datasets/muratkokludataset/dry-bean-dataset

## About Dataset

**Data Set Name**: Dry Bean Dataset

**Abstract**:

Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

**Relevant Information:**

Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

## Data Dictioary

* **Area (A)**: The area of a bean zone and the number of pixels within its boundaries.
* **Perimeter (P)**: Bean circumference is defined as the length of its border.
* **Major axis length (L)**: The distance between the ends of the longest line that can be drawn from a bean.
* **Minor axis length (l)**: The longest line that can be drawn from the bean while standing perpendicular to the main axis.
* **Aspect ratio (K)**: Defines the relationship between L and l.
* **Eccentricity (Ec)**: Eccentricity of the ellipse having the same moments as the region.
* **Convex area (C)**: Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
* **Equivalent diameter (Ed)**: The diameter of a circle having the same area as a bean seed area.
* **Extent (Ex)**: The ratio of the pixels in the bounding box to the bean area.
* **Solidity (S)**: Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
* **Roundness (R)**: Calculated with the following formula: (4piA)/(P^2)
* **Compactness (CO)**: Measures the roundness of an object: Ed/L
* **ShapeFactor1 (SF1)**
* **ShapeFactor2 (SF2)**
* **ShapeFactor3 (SF3)**
* **ShapeFactor4 (SF4)**
* **Class**: (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)


## Problem Statement

1. **Model Training**: The objective of classification is to evaluate the class of dry beans.
2. **Model Evaluation**: The objective of model evaluation is to evaluate the peroformance of the trained model using accuracy, precision, recall ans F1 score
3. **Hyperparameter Tuning** The objective of hyperparameter tuning is to find the optimal model by setting different hyperparameters of the model.

## Load Libraries

In [19]:
# Load General Libraries
import pandas as pd
import numpy as np
import warnings

# Preprocessing libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Load Model Libraries
from sklearn.linear_model import LogisticRegression

# Hyperparameter tuning Labraries
from sklearn.model_selection import GridSearchCV

# Load metrics libraries
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Settings

In [20]:
warnings.filterwarnings("ignore")

## Load Dataset

In [34]:
# Load Data
# file_path = "db_class_selected.csv"
file_path = "db_class_oversampled.csv"
df = pd.read_csv(file_path)

In [36]:
# Show 1st 5 rows
df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,roundness,Compactness,ShapeFactor3,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.958027,0.913358,0.834222,5
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.887034,0.953861,0.909851,5
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.947849,0.908774,0.825871,5
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.903936,0.928329,0.861794,5
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.984877,0.970516,0.9419,5


## Split the Data

In [37]:
# Separate independent and target features
X= df.iloc[:, :-1]
y = df.iloc[:, -1]

In [38]:
# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [39]:
# Scaling the data(Standardize)
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

In [40]:
# Scaling the data(Normalize)
ms = MinMaxScaler()
X_train= ms.fit_transform(X_train)
X_test = ms.transform(X_test)

## Model Building and Evaluation

In [41]:
def build_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Predict
    y_train_pred = model.predict(X_train)
    y_test_pred= model.predict(X_test)

    # Evaluate scores
    print("=" * 60)
    print("Model Evaluation on Train Data")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred): .2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred, average='macro'): .2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred, average='macro'): .2f}")
    print(f"F1 Score: {f1_score(y_train, y_train_pred, average='macro'): .2f}")
    print("=" * 60)
    print("Model Evaluation on Test Data")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred): .2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred, average='macro'): .2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred, average='macro'): .2f}")
    print(f"F1 Score: {f1_score(y_test, y_test_pred, average='macro'): .2f}")

In [42]:
# Try Logistic Regression
lr = LogisticRegression()
build_evaluate(lr)

Model Evaluation on Train Data
Accuracy:  0.93
Precision:  0.93
Recall:  0.93
F1 Score:  0.93
Model Evaluation on Test Data
Accuracy:  0.93
Precision:  0.93
Recall:  0.93
F1 Score:  0.93


## Hyperparameter Tuning

In [43]:
def tune_hyperparameter_gs(model, param):
  # Setup tuner
  gs = GridSearchCV(estimator= model,
                    param_grid= param,
                    cv= 5,
                    verbose= 1)
  # Train the model
  gs.fit(X_train, y_train)

  print(f"Best Score:{gs.best_score_: .2f}")

  return gs.best_params_

In [44]:
param_dict= {
    "penalty": ["l1", "l2", "none"],
    "max_iter": [100, 150, 200, 300, 500, 800],
    "solver": ["lbfgs", "liblinear", "sag"]
}
lr = LogisticRegression()
best_params = tune_hyperparameter_gs(lr, param_dict)
print(best_params)

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Best Score: 0.94
{'max_iter': 800, 'penalty': 'none', 'solver': 'lbfgs'}


In [45]:
model = LogisticRegression(**best_params)
build_evaluate(model)

Model Evaluation on Train Data
Accuracy:  0.94
Precision:  0.94
Recall:  0.94
F1 Score:  0.94
Model Evaluation on Test Data
Accuracy:  0.94
Precision:  0.94
Recall:  0.94
F1 Score:  0.94
