# Dry Beans Classification

## Source of Data

> https://www.kaggle.com/datasets/muratkokludataset/dry-bean-dataset

## About Dataset

**Data Set Name**: Dry Bean Dataset

**Abstract**:

Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

**Relevant Information:**

Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

## Data Dictioary

* **Area (A)**: The area of a bean zone and the number of pixels within its boundaries.
* **Perimeter (P)**: Bean circumference is defined as the length of its border.
* **Major axis length (L)**: The distance between the ends of the longest line that can be drawn from a bean.
* **Minor axis length (l)**: The longest line that can be drawn from the bean while standing perpendicular to the main axis.
* **Aspect ratio (K)**: Defines the relationship between L and l.
* **Eccentricity (Ec)**: Eccentricity of the ellipse having the same moments as the region.
* **Convex area (C)**: Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
* **Equivalent diameter (Ed)**: The diameter of a circle having the same area as a bean seed area.
* **Extent (Ex)**: The ratio of the pixels in the bounding box to the bean area.
* **Solidity (S)**: Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
* **Roundness (R)**: Calculated with the following formula: (4piA)/(P^2)
* **Compactness (CO)**: Measures the roundness of an object: Ed/L
* **ShapeFactor1 (SF1)**
* **ShapeFactor2 (SF2)**
* **ShapeFactor3 (SF3)**
* **ShapeFactor4 (SF4)**
* **Class**: (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)


## Problem Statement

1. **Feature Engineering**: The objective of feature engineering is to encode the categorical features into numerical values.
2. **Feature Selection**: The objective of feature selection is to select the most influential features for prediction.

## Load Necessary Libraries

In [45]:
import pandas as pd
import numpy as np
# Preprocessing
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2
from imblearn.over_sampling import SMOTE

## Load Dataset

In [9]:
# Load data
file_path = "Dry_Bean_Dataset.xlsx"
df= pd.read_excel(file_path)

In [19]:
df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,SEKER
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,SEKER


In [10]:
# Encode the target
le = LabelEncoder()
df_en = df.copy()
df_en["Class"] = le.fit_transform(df_en["Class"])

### Feature Selection

Select the features those are most influencial to predict the classes

In [12]:
# Separate Input and output features for feature selection
X = df_en.iloc[:, :-1]
y = df_en.iloc[:, -1]

In [40]:
fs = SelectKBest(score_func= chi2, k = 12)
fs.fit(X, y)

In [41]:
selected_feature_indices = fs.get_support(indices=True)
selected_feature_indices

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 14])

In [42]:
# Get the Best selected features
selected_features = list(df_en.columns[selected_feature_indices])
selected_features.append("Class")
selected_features

['Area',
 'Perimeter',
 'MajorAxisLength',
 'MinorAxisLength',
 'AspectRation',
 'Eccentricity',
 'ConvexArea',
 'EquivDiameter',
 'Extent',
 'roundness',
 'Compactness',
 'ShapeFactor3',
 'Class']

In [43]:
# Dataframe with selected features
df_selected = df_en[selected_features]
df_selected.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,roundness,Compactness,ShapeFactor3,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.958027,0.913358,0.834222,5
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.887034,0.953861,0.909851,5
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.947849,0.908774,0.825871,5
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.903936,0.928329,0.861794,5
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.984877,0.970516,0.9419,5


In [44]:
# Write to CSV
df_selected.to_csv("db_class_selected.csv", index=False)

### Data Oversampling

To correct the imbalance in the data set we can perform an unbalanced oversampling technique, in order to counteract the bias in the data and to expand the number of instance that will feed our model.

For this case we will use de Synthetic Minority Over-sampling Technique (SMOTE) to up-sample the minority classes while avoiding overfitting.

In [47]:
s = SMOTE(k_neighbors= 2)
X_o, y_o = s.fit_resample(X, y)

In [48]:
# Create oversample dataframe
df_y = pd.DataFrame(y_o,columns=["Class"])
df_oversampled = pd.concat([X_o, df_y], axis=1)

In [49]:
df_oversampled.shape

(24822, 17)

In [50]:
df_oversampled["Class"].value_counts()

Class
5    3546
0    3546
1    3546
2    3546
4    3546
6    3546
3    3546
Name: count, dtype: int64

In [53]:
# Save the oversampled data
df_oversampled_selected = df_oversampled[selected_features]
df_oversampled_selected.to_csv("db_class_oversampled.csv", index=False)