# **Overview**

In this project, I will identify best hyperparameters to classify two types of raisins. There are two raisin grain types in this dataset, Kecimen and Besni and seven numerical predictor variables associated with each of the 900 samples in the data. I am going to use this dataset to implement the two hyperparameter tuning methods:

1. Grid Search method to tune a Decision Tree Classifier

2. Random Search method to tune a Logistic Regression Classifier

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [21]:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving Raisin_Dataset.xlsx to Raisin_Dataset.xlsx
User uploaded file "Raisin_Dataset.xlsx" with length 84629 bytes


In [22]:
df = pd.read_excel('Raisin_Dataset.xlsx')
df.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,Kecimen


In [23]:
# prompt: Kecimen = 0 Besni =1

df['Class'] = df['Class'].map({'Kecimen': 0, 'Besni': 1})

In [24]:
# Prepare the data
X = df.drop('Class', axis=1)
y = df['Class']

In [27]:
print('Predictors: ', X.columns)
print('Number of predictors: ', X.shape[1])
print('Total number of samples: ', len(X))
print('Samples belonging to class 1 (Besni): ', df['Class'].sum())

Predictors:  Index(['Area', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity',
       'ConvexArea', 'Extent', 'Perimeter'],
      dtype='object')
Number of predictors:  7
Total number of samples:  900
Samples belonging to class 1 (Besni):  450


In [28]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Grid Search selection with Decision Tree Classifier**

In [30]:
# Decision Tree Classifier with GridSearchCV
tree_clf = DecisionTreeClassifier()
parameters = {'max_depth': [3,5,7], 'min_samples_split': [2,3,4]}

grid_search_tree = GridSearchCV(estimator=tree_clf, param_grid=parameters)
grid_search_tree.fit(X_train, y_train)

print("Best hyperparameters for Decision Tree:", grid_search_tree.best_params_)
print("Best score for Decision Tree:", grid_search_tree.best_score_)
print('the accuracy of the final model: ', grid_search_tree.score(X_test, y_test))

Best hyperparameters for Decision Tree: {'max_depth': 3, 'min_samples_split': 2}
Best score for Decision Tree: 0.8597222222222223
the accuracy of the final model:  0.8444444444444444


The desicion tree model predicts well at depth of 3 tree nodes and at least 2 sample split. We also see that model predict rate is 85.6% on training dataset while 84.4% on test dataset, showing its credibility.

In [31]:
#a table summarizing the results of GridSearchCV
results = grid_search_tree.cv_results_
params = pd.DataFrame(results['params'])
acc_scores= pd.DataFrame(results['mean_test_score'], columns=['Accuracy'])
summary = pd.concat([params, acc_scores], axis=1)
summary

Unnamed: 0,max_depth,min_samples_split,Accuracy
0,3,2,0.859722
1,3,3,0.859722
2,3,4,0.859722
3,5,2,0.854167
4,5,3,0.851389
5,5,4,0.848611
6,7,2,0.831944
7,7,3,0.834722
8,7,4,0.834722


# **Random Selection with Logistic Regression**

In [32]:
# 10. The logistic regression model
lr= LogisticRegression(solver= 'liblinear', max_iter = 1000)

In [33]:
from scipy.stats import uniform
distributions = {'penalty': ['l1', 'l2'], 'C': uniform(loc = 0, scale= 100)}

clf = RandomizedSearchCV(estimator=lr, param_distributions=distributions)
clf.fit(X_train, y_train)

In [35]:
print("Best hyperparameters for Logistic Regression:", clf.best_params_)
print("Best score for Logistic Regression:", clf.best_score_)
print('the accuracy of the final model: ', clf.score(X_test, y_test))

Best hyperparameters for Logistic Regression: {'C': 93.38939210254031, 'penalty': 'l2'}
Best score for Logistic Regression: 0.8541666666666666
the accuracy of the final model:  0.8555555555555555


The optimization shows that the model performs best at ridge regression and C parameter at 93.3. With this setup, the model accuracy is 85.5 for both training and test datasets.

In [43]:
#a table summarizing the results of RandomSearchCV
results_clf = clf.cv_results_
params_clf = pd.DataFrame(results_clf['params'])
acc_scores_clf= pd.DataFrame(results_clf['mean_test_score'], columns=['Accuracy'])
summary_clf = pd.concat([params_clf, acc_scores_clf], axis=1)
summary_clf = summary_clf.sort_values(by='Accuracy', ascending= False)
summary_clf

Unnamed: 0,C,penalty,Accuracy
2,93.389392,l2,0.854167
4,37.203343,l2,0.854167
6,24.827243,l2,0.854167
9,48.192054,l2,0.854167
5,59.665956,l1,0.852778
7,23.628031,l1,0.852778
8,65.064191,l1,0.852778
0,82.686802,l1,0.85
1,88.067091,l1,0.85
3,40.174126,l1,0.85


***Overall***

Decision Tree Classifier with Grid Search:

Best hyperparameters: max_depth: 3, min_samples_split: 2

Best score (cross-validation accuracy): 0.856

Test accuracy: 0.844

Reasoning: The Grid Search explored different combinations of max_depth and min_samples_split and found that a depth of 3 and a minimum of 2 samples to split a node resulted in the highest cross-validation accuracy. This suggests that a relatively shallow tree with early stopping criteria performs well on this dataset. The test accuracy is close to the cross-validation accuracy, indicating good generalization.

Logistic Regression Classifier with Random Search:

Best hyperparameters: penalty: 'l2', C: 93.3

Best score (cross-validation accuracy): 0.855

Test accuracy: 0.856

Reasoning: The Random Search explored different values for penalty and C. It found that using L2 regularization (ridge regression) with a relatively high regularization strength (C around 93.3) resulted in the highest cross-validation accuracy. The test accuracy is very close to the cross-validation accuracy, indicating good generalization. L2 regularization helps prevent overfitting by penalizing large coefficients. A high C value indicates less regularization, which might be suitable for this dataset.