# Model Details
Creator: Carlotta Kopietz </p>
The model is a logistic regression using the default hyperparameters in scikit-learn 0.24.1.

# Intended Use
This model should be used to predict differen variety of raisins (Kecminen and Besni) grown in Turkey. 

# Metrics
The model was evaluated using F1 score. The value is 0.8831168831168832. 

# Data
The data was obtained from kaggle (https://www.kaggle.com/datasets/muratkokludataset/raisin-dataset) 

# Bias
According to sliced F1 score the model show a strong inbalance in F1 score in the data 'MajorAxisLength'. If the value is over average the F1 score is 0. Another inbalance in F1 score is for the area. If the area is over average the F1 score only reaches 0.3636363636363636. 

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report, f1_score

In [9]:
# Read Data base
data = pd.read_csv("./data/Raisin_Dataset.csv")
data.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
0,87524,442.246011,253.291155,0.819738,90546,0.758651,1184.04,Kecimen
1,75166,406.690687,243.032436,0.801805,78789,0.68413,1121.786,Kecimen
2,90856,442.267048,266.328318,0.798354,93717,0.637613,1208.575,Kecimen
3,45928,286.540559,208.760042,0.684989,47336,0.699599,844.162,Kecimen
4,79408,352.19077,290.827533,0.564011,81463,0.792772,1073.251,Kecimen


In [10]:
# Seperate labels
y = data.pop("Class")

# Split the data into train and validation, stratifying on the target feature.
X_train, X_val, y_train, y_val = train_test_split(data, y, stratify=y, random_state=23)

In [11]:
# Get a high level overview of the data. This will be useful for slicing.
X_train.describe()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
count,675.0,675.0,675.0,675.0,675.0,675.0,675.0
mean,87210.494815,427.650555,254.414345,0.779895,90407.262222,0.701092,1159.625772
std,38388.571707,110.506268,49.752074,0.088938,39602.352484,0.050807,261.820857
min,25387.0,225.629541,144.618672,0.34873,26139.0,0.454189,619.074
25%,59032.5,343.732369,218.692197,0.740516,61466.5,0.671134,964.8355
50%,79057.0,405.936594,247.352044,0.797864,81779.0,0.709949,1117.107
75%,103790.5,493.185891,280.180509,0.840452,108022.5,0.735886,1302.4165
max,235047.0,843.956653,492.275279,0.92377,239093.0,0.830632,2253.557


In [12]:
X_train.head()

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter
248,62064,352.36867,227.864144,0.762775,64811,0.650566,1004.245
383,66797,358.198918,240.782694,0.740366,68732,0.697663,1006.375
149,66568,342.250361,249.5505,0.684358,68078,0.759917,993.455
595,80481,481.063953,217.561151,0.891891,85153,0.714974,1219.105
401,39368,296.655948,171.208165,0.816654,41361,0.619969,798.546


In [13]:
lr = LogisticRegression(max_iter=1000, random_state=23)
lb = LabelBinarizer()

# Binarize the target feature.
y_train = lb.fit_transform(y_train)
y_val = lb.transform(y_val)

# Train Logistic Regression.
lr.fit(X_train, y_train.ravel())

LogisticRegression(max_iter=1000, random_state=23)

In [14]:
# Calculate F1 score overall 
y_val_pred = lr.predict(X_val)
f1_score_overall = f1_score(y_val, y_val_pred)
print(f1_score_overall)

0.8831168831168832


In [25]:
# Calculate F1 score for 'Area' split by average
area_mean = X_val['Area'].mean()
X_val_area_high = X_val[X_val['Area']>area_mean]
X_val_area_low = X_val[X_val['Area']<=area_mean]
y_val_area_high = y_val[X_val['Area']>area_mean]
y_val_area_low = y_val[X_val['Area']<=area_mean]
f1_score_area_high = f1_score(y_val_area_high, lr.predict(X_val_area_high))
f1_score_area_low = f1_score(y_val_area_low, lr.predict(X_val_area_low))
print(f'For areas over average ({area_mean}) the F1 score is {f1_score_area_high} and for lower average {f1_score_area_low}.')

For areas over average (89585.02666666667) the F1 score is 0.3636363636363636 and for lower average 0.9090909090909091.


In [26]:
# Calculate F1 score for 'MajorAxisLength' split by average
major_mean = X_val['MajorAxisLength'].mean()
X_val_major_high = X_val[X_val['MajorAxisLength']>major_mean]
X_val_major_low = X_val[X_val['MajorAxisLength']<=major_mean]
y_val_major_high = y_val[X_val['MajorAxisLength']>major_mean]
y_val_major_low = y_val[X_val['MajorAxisLength']<=major_mean]
f1_score_major_high = f1_score(y_val_major_high, lr.predict(X_val_major_high))
f1_score_major_low = f1_score(y_val_major_low, lr.predict(X_val_major_low))
print(f'For major axis length over average ({major_mean}) the F1 score is {f1_score_major_high} and for lower average {f1_score_major_low}.')

For areas over average (440.76813653333335) the F1 score is 0.0 and for lower average 0.9066666666666666.


In [27]:
# Calculate F1 score for 'MajorAxisLength' split by average
eccentricity_mean = X_val['Eccentricity'].mean()
X_val_eccentricity_high = X_val[X_val['Eccentricity']>eccentricity_mean]
X_val_eccentricity_low = X_val[X_val['Eccentricity']<=eccentricity_mean]
y_val_eccentricity_high = y_val[X_val['Eccentricity']>eccentricity_mean]
y_val_eccentricity_low = y_val[X_val['Eccentricity']<=eccentricity_mean]
f1_score_eccentricity_high = f1_score(y_val_eccentricity_high, lr.predict(X_val_eccentricity_high))
f1_score_eccentricity_low = f1_score(y_val_eccentricity_low, lr.predict(X_val_eccentricity_low))
print(f'For major axis length over average ({eccentricity_mean}) the F1 score is {f1_score_eccentricity_high} and for lower average {f1_score_eccentricity_low}.')

For major axis length over average (0.7864845224844446) the F1 score is 0.8641975308641975 and for lower average 0.8933333333333333.
