## Orange/Grapefruit Classification

Given *data about citrus fruits*, let's try to classify the **type** of a given fruit.

We will use a logistic regression model to make our predictions.

Data source: https://www.kaggle.com/datasets/joshmcadams/oranges-vs-grapefruit

### Importing Libraries

In [2]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

In [3]:
data = pd.read_csv('citrus.csv')
data

Unnamed: 0,name,diameter,weight,red,green,blue
0,orange,2.96,86.76,172,85,2
1,orange,3.91,88.05,166,78,3
2,orange,4.42,95.17,156,81,2
3,orange,4.47,95.60,163,81,4
4,orange,4.48,95.76,161,72,9
...,...,...,...,...,...,...
9995,grapefruit,15.35,253.89,149,77,20
9996,grapefruit,15.41,254.67,148,68,7
9997,grapefruit,15.59,256.50,168,82,20
9998,grapefruit,15.92,260.14,142,72,11


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      10000 non-null  object 
 1   diameter  10000 non-null  float64
 2   weight    10000 non-null  float64
 3   red       10000 non-null  int64  
 4   green     10000 non-null  int64  
 5   blue      10000 non-null  int64  
dtypes: float64(2), int64(3), object(1)
memory usage: 468.9+ KB


### Encoding Labels

In [5]:
data['name'].value_counts()

name
orange        5000
grapefruit    5000
Name: count, dtype: int64

In [6]:
label_mapping = {'orange':0, 'grapefruit': 1}

data['name'] = data['name'].replace(label_mapping)

  data['name'] = data['name'].replace(label_mapping)


In [7]:
data

Unnamed: 0,name,diameter,weight,red,green,blue
0,0,2.96,86.76,172,85,2
1,0,3.91,88.05,166,78,3
2,0,4.42,95.17,156,81,2
3,0,4.47,95.60,163,81,4
4,0,4.48,95.76,161,72,9
...,...,...,...,...,...,...
9995,1,15.35,253.89,149,77,20
9996,1,15.41,254.67,148,68,7
9997,1,15.59,256.50,168,82,20
9998,1,15.92,260.14,142,72,11


### Splitting/Scaling

In [8]:
y = data['name'].copy()
X = data.drop('name', axis=1).copy()

In [9]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=56)

### Training/Results

In [11]:
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

base_acc = base_model.score(X_test, y_test)

print("Accuracy: {:.4f}".format(base_acc))

Accuracy: 0.9383


In [12]:
cv_model = LogisticRegressionCV()
cv_model.fit(X_train, y_train)

cv_acc = cv_model.score(X_test, y_test)

print("Accuracy: {:.4f}".format(cv_acc))

Accuracy: 0.9613
