# RiceGraininator 5000 Deluxe
## Introduction

Congratulations, aspiring data scientists! You’ve been recruited by none other than Dr. Ingrain Doofennutrientz, the legendary (and slightly misunderstood) inventor, to help bring his latest masterpiece to life: the **RiceGraininator 5000 Deluxe**! This marvelous machine is designed to solve one of humanity’s most pressing issues—accurately identifying rice varieties.  

Why? Because Dr. Doofennutrientz once lost an all-you-can-eat sushi contest after mixing up Arborio and Jasmine rice. Determined to never face such a mix-up again, he vowed to build the ultimate rice-classifying robot. But alas, the **RiceGraininator 5000 Deluxe** is only as good as its algorithms—and that’s where you come in!  

## Dataset Description

Your task is to train and evaluate the *RiceGraininator 5000 Deluxe* using a carefully curated dataset. This subset consists of **5,000 observations**, with approximately **1,000 grains per rice species**: Jasmine, Basmati, Arborio, Karacadag, and Ipsala. Each grain was meticulously processed to extract **106 features** derived from advanced image processing techniques:  

- **12 morphological features**  
- **4 shape features** derived from morphological characteristics  
- **90 color features** from five different color spaces (RGB, HSV, Lab*, YCbCr, XYZ)  

The objective is to first perform a logistic regression to classify the rice varieties. Then, you will apply **Principal Component Analysis (PCA)** to reduce feature complexity and compare the results. Will your algorithm achieve culinary excellence and secure Dr. Doofennutrientz's legacy? Time—and your coding—will tell!  

## References
1. KOKLU, M., CINAR, I., & TASPINAR, Y. S. (2021). Classification of rice varieties with deep learning methods. *Computers and Electronics in Agriculture, 187,* 106285. DOI: [10.1016/j.compag.2021.106285](https://doi.org/10.1016/j.compag.2021.106285)  
2. CINAR, I., & KOKLU, M. (2021). Determination of Effective and Specific Physical Features of Rice Varieties by Computer Vision In Exterior Quality Inspection. *Selcuk Journal of Agriculture and Food Sciences, 35*(3), 229-243. DOI: [10.15316/SJAFS.2021.252](https://doi.org/10.15316/SJAFS.2021.252)  
3. CINAR, I., & KOKLU, M. (2022). Identification of Rice Varieties Using Machine Learning Algorithms. *Journal of Agricultural Sciences, 28*(2), 307-325. DOI: [10.15832/ankutbd.862482](https://doi.org/10.15832/ankutbd.862482)  
4. CINAR, I., & KOKLU, M. (2019). Classification of Rice Varieties Using Artificial Intelligence Methods. *International Journal of Intelligent Systems and Applications in Engineering, 7*(3), 188-194. DOI: [10.18201/ijisae.2019355381](https://doi.org/10.18201/ijisae.2019355381)  

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [24]:
# target variable is called CLASS

df = pd.read_csv('https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/Unsupervised%20Learning/Dimensionality%20reduction/data/rice.csv')

In [25]:
# Define the target variable and features
target = 'CLASS'  # Adjust this if your target column has a different name
X = df.drop(columns=[target])
y = df[target]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
# Calculate the accuracy_score
# print classification report to have an idea of the results
logreg = LogisticRegression(max_iter = 1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")



Accuracy: 0.84


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
# do the PCA
# remeber chose number of components and fit PCA only on train datast
# do the transform in both train and test dataset
# train again
# Evaluate the results
pca = PCA(n_components=23)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
logreg_pca = LogisticRegression(max_iter = 1000)
logreg_pca.fit(X_train_pca, y_train)
y_pred_pca = logreg_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"Accuracy with PCA: {accuracy_pca}")



Accuracy with PCA: 0.92


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Other approaches

In [27]:
# Now lets use another model which is not susceptible to the amount of variables like random forest and see the results
# So what you conclude from this ? it is possible to use PCA for feature engineering but sometimes it is just easir to
#use a model that is robust to amount f variables

random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy with Random Forest: {accuracy_rf}")

#PCA random forest

random_forest_pca = RandomForestClassifier()
random_forest_pca.fit(X_train_pca, y_train)
y_pred_rf_pca = random_forest_pca.predict(X_test_pca)
accuracy_rf_pca = accuracy_score(y_test, y_pred_rf_pca)
print(f"Accuracy with PCA and Random Forest: {accuracy_rf_pca}")



Accuracy with Random Forest: 0.997
Accuracy with PCA and Random Forest: 0.989


In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,AREA,PERIMETER,MAJOR_AXIS,MINOR_AXIS,ECCENTRICITY,EQDIASQ,SOLIDITY,CONVEX_AREA,EXTENT,...,ALLdaub4L,ALLdaub4a,ALLdaub4b,ALLdaub4Y,ALLdaub4Cb,ALLdaub4Cr,ALLdaub4XX,ALLdaub4YY,ALLdaub4ZZ,CLASS
0,57306,6918,311.803,121.2235,73.0577,0.798,93.8524,0.9842,7029,0.7161,...,110.1504,65.3372,60.2958,101.2892,67.2338,63.6733,0.3402,0.3501,0.4305,Karacadag
1,11882,7511,343.753,141.4876,68.468,0.8751,97.7921,0.9751,7703,0.6209,...,107.9923,64.1808,62.3097,98.9438,65.276,63.5397,0.3161,0.3305,0.3793,Arborio
2,66977,5698,284.073,108.1013,67.7707,0.7791,85.1758,0.9845,5788,0.6967,...,104.4907,66.1045,58.0732,96.1341,69.0929,63.4813,0.3022,0.3064,0.4078,Karacadag
3,39053,14220,481.78,200.911,91.1209,0.8912,134.5565,0.9796,14516,0.6511,...,123.1018,63.7406,64.1887,113.1888,63.8173,63.7732,0.4341,0.4583,0.4953,Ipsala
4,13516,6868,303.003,115.2292,76.3037,0.7493,93.5126,0.9882,6950,0.7927,...,105.8298,65.6465,58.888,97.2428,68.4333,63.3245,0.3075,0.3142,0.4064,Karacadag


In [28]:
copy = df.copy()

copy.drop(columns=['AREA', 'PERIMETER','MAJOR_AXIS','MINOR_AXIS','SOLIDITY'], inplace=True)
target = 'CLASS'
X = copy.drop(columns=[target])
y = copy[target]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy with Random Forest with less 5 features: {accuracy_rf}")



Accuracy with Random Forest with less 5 features: 0.995
