# Machine Learning
---
## Assignment: Implementation of Logistic Regression and LinearSVM  
**Group 5**

---

### Student Details  
- **Name:** MUGOT, CHRIS JALLAINE S.   
- **Section:** DS3A  
- **Date of Submission:** MAY 6, 2025



### Instructions
[![image.png](https://i.postimg.cc/TP05x7F7/image.png)](https://postimg.cc/WDD16n2g)


#### Objectives:
1. Understand and implement Logistic Regression and Linear Support Vector Machine (SVM) classifiers.  
2. Evaluate the performance of both models using standard metrics such as accuracy, precision, recall, and F1-score.  
3. Compare and analyze the results between the two models on a given dataset.

#### Dataset Overview:

The dataset is consists of **59 entries** and **7 features** describing different types of fruits based on physical and visual characteristics.


### Feature Description

| Column Name     | Description                               | Data Type |
|-----------------|-------------------------------------------|-----------|
| `fruit_label`   | Numerical label representing fruit type   | int64     |
| `fruit_name`    | Name of the fruit (e.g., apple, lemon)    | object    |
| `fruit_subtype` | Subtype of the fruit                      | object    |
| `mass`          | Mass of the fruit in grams                | int64     |
| `width`         | Width of the fruit in cm                  | float64   |
| `height`        | Height of the fruit in cm                 | float64   |
| `color_score`   | Numerical representation of color quality | float64   |

---


> This dataset is ideal for classification tasks using models such as Logistic Regression and Linear SVM due to its structured numerical and categorical features.

In [97]:
#-----------------------
#    DEPENDENCIES
#-----------------------

import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import plotly.graph_objs as go
from tqdm import tqdm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.svm import LinearSVC
import plotly.figure_factory as ff

In [62]:
#------------------------------
#         LOAD DATA 
#------------------------------

df = pd.read_csv('fruit_data_with_colors.csv')

In [63]:
#-----------------------------
#   EXPLORE THE DATA
#-----------------------------

print("Dataset Information:")
print()
print(df.info())

Dataset Information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   fruit_label    59 non-null     int64  
 1   fruit_name     59 non-null     object 
 2   fruit_subtype  59 non-null     object 
 3   mass           59 non-null     int64  
 4   width          59 non-null     float64
 5   height         59 non-null     float64
 6   color_score    59 non-null     float64
dtypes: float64(3), int64(2), object(2)
memory usage: 3.4+ KB
None


In [64]:
print("\nSample data:")
print()
print(df.head())


Sample data:

   fruit_label fruit_name fruit_subtype  mass  width  height  color_score
0            1      apple  granny_smith   192    8.4     7.3         0.55
1            1      apple  granny_smith   180    8.0     6.8         0.59
2            1      apple  granny_smith   176    7.4     7.2         0.60
3            2   mandarin      mandarin    86    6.2     4.7         0.80
4            2   mandarin      mandarin    84    6.0     4.6         0.79


In [65]:
print("\nBasic statistics:")
print()
print(df.describe())


Basic statistics:

       fruit_label        mass      width     height  color_score
count    59.000000   59.000000  59.000000  59.000000    59.000000
mean      2.542373  163.118644   7.105085   7.693220     0.762881
std       1.208048   55.018832   0.816938   1.361017     0.076857
min       1.000000   76.000000   5.800000   4.000000     0.550000
25%       1.000000  140.000000   6.600000   7.200000     0.720000
50%       3.000000  158.000000   7.200000   7.600000     0.750000
75%       4.000000  177.000000   7.500000   8.200000     0.810000
max       4.000000  362.000000   9.600000  10.500000     0.930000


In [66]:
print("\nUnique values in target variable:")
print()
print(df['fruit_label'].value_counts())


Unique values in target variable:

fruit_label
1    19
3    19
4    16
2     5
Name: count, dtype: int64


In [67]:
print("\nFruit names corresponding to labels:")
print(df[['fruit_label', 'fruit_name']].drop_duplicates().sort_values('fruit_label'))


Fruit names corresponding to labels:
    fruit_label fruit_name
0             1      apple
3             2   mandarin
24            3     orange
43            4      lemon


In [68]:
#---------------------------
# VISUALIZE THE DATA 
# -------------------------

unique_fruits = df['fruit_name'].unique()
# Define pastel colors for each position
pastel_colors = {
    0: "#981b00",  
    1: "#ff8c3c",  
    2: "#ff9300",  
    3: "#7eef46"   
}
# Create the base histogram with custom color assignments
fig = px.histogram(
    df, 
    x='width', 
    color='fruit_name',
    facet_col='fruit_name',
    facet_col_wrap=2,
    title='Width Distribution per Fruit Type',
    nbins=20,
    opacity=0.75,
    color_discrete_map={fruit: pastel_colors[i] for i, fruit in enumerate(unique_fruits)}
)
fig.update_traces(marker_line_color='black', marker_line_width=1)
# No distribution curve lines
fig.update_layout(showlegend=False)
fig.update_xaxes(matches=None)
fig.show()

#### Analysis of Width Distribution per Fruit Type

> The histogram above illustrate the distribution of width measurements across four distinct fruit types: apple, mandarin, orange, and lemon. Each subplot offers a visual representation of the frequency of different width values observed for a specific fruit. Upon examination, notable variations in the width distributions are evident among the different categories. For instance, apples exhibit a relatively broad distribution, suggesting a greater variability in their width compared to mandarins, which appear to have a more concentrated distribution around a smaller range of width values. Oranges present a bimodal tendency, indicating the presence of two prominent clusters of width measurements within the sampled population. In contrast, lemons display a distribution skewed towards smaller width values, with a noticeable peak in the lower range and a rapid decline in frequency as width increases. These differences in the shape and spread of the distributions underscore the inherent morphological variations that exist between these fruit types with respect to their width dimension. Further statistical analysis could quantify these observed differences and provide a more rigorous understanding of the central tendencies and dispersions of these width measurements for each fruit category.

In [69]:
# create histogram for height distribution across all fruit kind

fig = px.histogram(
    df, 
    x='height', 
    color='fruit_name',
    facet_col='fruit_name',
    facet_col_wrap=2,
    title='Height Distribution per Fruit Type',
    nbins=20,
    opacity=0.75,
    color_discrete_map={fruit: pastel_colors[i] for i, fruit in enumerate(unique_fruits)}
)
fig.update_traces(marker_line_color='black', marker_line_width=1)
# No distribution curve lines
fig.update_layout(showlegend=False)
fig.update_xaxes(matches=None)
fig.show()

#### Analysis of Height Distribution per Fruit Type

> The subsequent set of histograms delineates the distribution of height measurements for the same four fruit types: apple, mandarin, orange, and lemon. Similar to the width analysis, a visual inspection of these height distributions reveals distinct patterns characteristic of each fruit. Apples exhibit a distribution that appears somewhat uniform across a specific range of height values, suggesting a consistent height profile within the sampled apples. Mandarins, conversely, show a distribution concentrated towards lower height values, with a less pronounced spread compared to apples. Oranges display a more dispersed height distribution, with observations spanning a wider range and exhibiting a less defined central tendency. Interestingly, lemons present a multimodal distribution for height, indicating the presence of multiple peaks in the frequency of observed height values. These variations in the distributional characteristics of height across the fruit types highlight the dimensional diversity present within this fruit dataset. Quantitative statistical measures, such as variance and skewness, could provide further clarity on the extent and nature of these differences in height profiles.

In [70]:
# create histogram for mass distribution across all fruit kind

fig = px.histogram(
    df, 
    x='mass', 
    color='fruit_name',
    facet_col='fruit_name',
    facet_col_wrap=2,
    title='Mass Distribution per Fruit Type',
    nbins=20,
    opacity=0.75,
    color_discrete_map={fruit: pastel_colors[i] for i, fruit in enumerate(unique_fruits)}
)
fig.update_traces(marker_line_color='black', marker_line_width=1)
# No distribution curve lines
fig.update_layout(showlegend=False)
fig.update_xaxes(matches=None)
fig.show()

#### Analysis of Mass Distribution per Fruit Type

> The third set of histograms presents the distribution of mass measurements for apples, mandarins, oranges, and lemons. Mass, a fundamental physical property, exhibits distinct distributional patterns across the different fruit categories. Apples demonstrate a mass distribution that is skewed towards higher values, indicating that the majority of the sampled apples possess a relatively greater mass, with a tail extending towards lower mass values. Mandarins, in contrast, are characterized by a mass distribution concentrated in the lower range, suggesting a generally lighter weight compared to apples. Oranges display a mass distribution with a prominent peak at a lower mass range, followed by a secondary, smaller peak at a higher mass value, hinting at potential subgroups within the orange samples. Lemons exhibit a mass distribution that is predominantly concentrated at the lower end of the mass spectrum, signifying their relatively lighter weight compared to the other fruit types considered. The observed differences in mass distributions are crucial for understanding the overall physical characteristics of these fruits and could be relevant in various applications, including sorting, packaging, and quality assessment.

In [71]:
# create histogram for color_score distribution across all fruit kind

# Create the base histogram with custom color assignments
fig = px.histogram(
    df, 
    x='color_score', 
    color='fruit_name',
    facet_col='fruit_name',
    facet_col_wrap=2,
    title='Color Score Distribution per Fruit Type',
    nbins=20,
    opacity=0.75,
    color_discrete_map={fruit: pastel_colors[i] for i, fruit in enumerate(unique_fruits)}
)
fig.update_traces(marker_line_color='black', marker_line_width=1)
# No distribution curve lines
fig.update_layout(showlegend=False)
fig.update_xaxes(matches=None)
fig.show()

#### Analysis of Color Score Distribution per Fruit Type

> The final set of histograms illustrates the distribution of a "color score" for apples, mandarins, oranges, and lemons. This color score, likely a quantitative measure of fruit coloration, reveals interesting differences in the color profiles of the four fruit types. Apples exhibit a color score distribution that appears somewhat scattered across a moderate range, suggesting variability in their coloration. Mandarins show a distribution concentrated towards the lower end of the color score spectrum, indicating a tendency towards a specific color range. Oranges display a bimodal distribution of color scores, suggesting the presence of two distinct clusters of coloration within the orange samples. Lemons, notably, present a color score distribution highly concentrated within a narrow, relatively low range, implying a more uniform and distinct coloration profile compared to the other fruits. The variations in these color score distributions underscore the differences in the visual attributes of these fruits, which can be an important factor in consumer perception and quality grading. Further investigation into the specific scale and meaning of the color score would provide deeper insights into these observed patterns.

#### Visualize Relationships between features

In [72]:
# These are only numeric columns.
numeric_cols = ['mass', 'width', 'height', 'color_score']

# Create a scatter matrix
fig = px.scatter_matrix(
    df,
    dimensions=numeric_cols,
    color='fruit_name',
    title="Scatterplot Matrix of Fruit Features",
    color_discrete_map={fruit: pastel_colors[i] for i, fruit in enumerate(unique_fruits)},
    height=800,
    width=800
)

# Update marker style
fig.update_traces(diagonal_visible=True, marker=dict(size=7, opacity=0.7, line=dict(width=0.5, color='black')))
fig.update_layout(showlegend=True)
fig.show()

#### Rigorous Essayistic Analysis of the Scatterplot Matrix of Fruit Features

> The presented scatterplot matrix furnishes a multifaceted perspective on the interrelationships among four continuous biometric attributes—namely, mass, width, height, and a quantitative color score—across a cohort comprising four distinct fruit cultivars: *apple*, *mandarin*, *orange*, and *lemon*. Each constituent off-diagonal subplot within this matrix serves as a bivariate Cartesian plane, meticulously plotting the joint distribution of paired features, wherein individual data points are chromatically differentiated to denote their respective fruit type. While the principal diagonal of the matrix does not explicitly depict univariate density estimations, it implicitly represents the marginal distribution of each solitary feature. This structural arrangement facilitates a simultaneous and integrated appraisal of pairwise correlational tendencies and the inherent separability of fruit categories predicated upon specific feature combinations.

> Upon meticulous examination of the bivariate scatterplots, several salient patterns and potential associations emerge. The relationship between **mass** and **width**, as depicted in the upper-left subplot, evinces a general positive covariation, suggesting a tendency for fruits with greater lateral dimensions to also possess a larger mass. However, the strength and linearity of this association appear to be modulated by the specific fruit variety. For instance, the cluster representing apples manifests a relatively robust positive linear trend, spanning a considerable range across both the abscissa (width) and the ordinate (mass). Conversely, mandarins are predominantly localized within the lower strata of both dimensional scales, exhibiting a more constrained distribution. Oranges present a more dispersed configuration within this bivariate space, demonstrating a degree of overlap with both apples and mandarins, potentially indicating a greater inherent variability within this cultivar or the presence of sub-groups. Finally, lemons are largely concentrated within the lower echelons of mass measurements but exhibit a more expansive range of width values in comparison to mandarins, hinting at a potentially different allometric scaling relationship.

> Akin to the mass-width nexus, the bivariate plot of **mass** against **height** (upper-middle-left) reveals a positive correlative trend, wherein fruits of greater vertical extent tend to exhibit larger mass values. Apples once again occupy the upper echelons of both axes, demonstrating a discernible positive linear association. Mandarins are consistently clustered within the lower quadrant of this feature space. Oranges display a more scattered disposition, intermingling with both apple and mandarin data points. Lemons, while generally characterized by lower mass values, span a moderate range of heights, suggesting that height might be a more discriminating feature for lemons compared to mass when distinguishing them from mandarins.

> The bivariate relationship between mass and the quantitative color score (upper-right) appears less straightforward and potentially non-linear. Apples exhibit a relatively broad dispersion of color scores across their mass spectrum. Mandarins are concentrated at lower mass values with a comparatively restricted range of color scores. Oranges demonstrate a wider range of color scores, with a possible subtle inclination towards higher color scores at greater mass values. Lemons are predominantly found at lower mass values and within a specific, relatively low interval of the color score, suggesting that color score might be a key differentiator for this fruit type.

> Examining the association between **width** and **height** (middle-left), a strong positive linear correlation is evident across the fruit dataset. Fruits with larger widths tend to also possess greater heights. Apples and oranges populate the higher ranges of both dimensions, exhibiting a clear positive linear trajectory. Mandarins are tightly clustered at lower width and height values. Lemons present a more nuanced relationship, with some overlap with mandarins at the lower end of the spectrum but also extending to slightly greater width values relative to their height compared to mandarins.

> The bivariate plot of **width** against **color score** (middle-right) reveals a less distinct pattern. Apples exhibit a moderate range of color scores across their width distribution. Mandarins are clustered at lower width values with a specific color score range. Oranges display a more extensive distribution of color scores across their width range. Lemons are generally located at lower width values and within a narrow, low range of color scores.

> Finally, the relationship between height and color score (bottom-right) similarly lacks a strong linear trend across all fruit types. Apples show a moderate dispersion of color scores across their height range. Mandarins are clustered at lower height values with a specific color score range. Oranges exhibit a broader distribution of color scores across their height range. Lemons are concentrated at lower height values and within a distinct, low range of color scores.

In terms of the separability of fruit types within this multidimensional feature space, the scatterplot matrix offers valuable preliminary insights. Apples tend to occupy a region characterized by higher mass, width, and height, with a more heterogeneous distribution of color scores. Mandarins are generally confined to a lower-dimensional subspace defined by smaller mass, width, and height, and a specific color score range. Oranges exhibit a more intermediate and often overlapping distribution, potentially reflecting a greater intra-species variability or the presence of distinct sub-varieties. Lemons appear to be somewhat distinguishable by their lower mass values and a consistently low range of color scores, although their width and height may overlap with those of mandarins at the lower end of the spectrum.

### Model Implementation

#### KNN REGRESSION <br>

**Context :** Predicting the mass based on other features.

In [73]:
# ------------------------
#   KNN REGRESSION 
# ------------------------

#Prepare Data

# Drop non-numeric columns (they're not needed for regression on mass)
df_clean = df.drop(['fruit_name', 'fruit_subtype'], axis=1)

In [74]:
# Define features and target
X = df_clean.drop('mass', axis=1)
y = df_clean['mass']

In [75]:
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [76]:
# Hyperparameter range
k_range = range(1, 21)

In [77]:
# Initialize metrics storage

best_k_scores = []
train_r2_scores = []
test_r2_scores = []
train_mae_scores = []
test_mae_scores = []
train_rmse_scores = []
test_rmse_scores = []

In [78]:
# Run 50 randomized trials
for trial in tqdm(range(50), desc="Running Trials"):
    # Random split (no fixed random_state)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=None)

    best_score = float('-inf')
    best_k = None
    best_train_preds = None
    best_test_preds = None
    y_train_actual = y_train  # Needed for reuse

    # Try all k values
    for k in k_range:
        model = KNeighborsRegressor(n_neighbors=k)
        model.fit(X_train, y_train)

        y_train_pred = model.predict(X_train)
        y_test_pred = model.predict(X_test)

        r2_test = r2_score(y_test, y_test_pred)

        if r2_test > best_score:
            best_score = r2_test
            best_k = k
            best_train_preds = y_train_pred
            best_test_preds = y_test_pred

# Store best K and metrics
    best_k_scores.append(best_k)

    train_r2_scores.append(r2_score(y_train_actual, best_train_preds))
    test_r2_scores.append(r2_score(y_test, best_test_preds))

    train_mae_scores.append(mean_absolute_error(y_train_actual, best_train_preds))
    test_mae_scores.append(mean_absolute_error(y_test, best_test_preds))

    train_rmse_scores.append(np.sqrt(mean_squared_error(y_train_actual, best_train_preds)))
    test_rmse_scores.append(np.sqrt(mean_squared_error(y_test, best_test_preds)))


Running Trials: 100%|██████████| 50/50 [00:19<00:00,  2.58it/s]


In [79]:
# -------------------------
#   MODEL EVALUATION
# --------------------------

# MAE Plot
fig_mae = go.Figure()
fig_mae.add_trace(go.Scatter(y=train_mae_scores, mode='lines+markers', name='Train MAE'))
fig_mae.add_trace(go.Scatter(y=test_mae_scores, mode='lines+markers', name='Test MAE'))
fig_mae.update_layout(title='KNN Regression: MAE across 50 Trials',
                      xaxis_title='Trial',
                      yaxis_title='Mean Absolute Error',
                      template='plotly_dark')
fig_mae.show()

#### Discussion:

> The line graph illustrates the Mean Absolute Error (MAE) of a K-Nearest Neighbors (KNN) regression model evaluated over 50 distinct trials. Two trend lines are depicted: one tracing the MAE on the training subset (rendered in blue) and the other charting the MAE on the held-out testing subset (rendered in red). The abscissa represents the sequential trial number, ranging from 0 to 50, while the ordinate quantifies the magnitude of the Mean Absolute Error, spanning a scale from 0 to 20.

A prominent feature of the visualization is the dynamic and oscillating behavior of both the training and testing MAE across the series of trials. As anticipated, the training MAE generally registers values lower than its testing counterpart, a common characteristic arising from the model's direct exposure to the training data during the learning phase. However, the substantial volatility observed in the training MAE suggests an inconsistent level of predictive accuracy on the data it was trained on, potentially attributable to variations in data splits or other trial-specific conditions.

Similarly, the testing MAE exhibits considerable fluctuation, with several instances of pronounced spikes, indicating a marked inconsistency in the model's capacity to generalize to unseen data across the different trials. The persistent disparity between the training and testing MAE values across numerous trials hints at the presence of overfitting in certain iterations, where the model becomes excessively attuned to the nuances of the training data, thereby compromising its performance on novel, unencountered data points. In essence, the graph underscores the performance instability of the KNN regression model across varying experimental runs, necessitating a deeper inquiry into the underlying factors contributing to this variability and the exploration of techniques aimed at enhancing the model's reliability and generalization capability.

In [80]:
# RMSE Plot
fig_rmse = go.Figure()
fig_rmse.add_trace(go.Scatter(y=train_rmse_scores, mode='lines+markers', name='Train RMSE'))
fig_rmse.add_trace(go.Scatter(y=test_rmse_scores, mode='lines+markers', name='Test RMSE'))
fig_rmse.update_layout(title='KNN Regression: RMSE across 50 Trials',
                       xaxis_title='Trial',
                       yaxis_title='Root Mean Squared Error',
                       template='plotly_dark')
fig_rmse.show()

#### Description:

> The line graph displays the Root Mean Squared Error (RMSE) of a KNN regression model over 50 trials, showing both training (blue) and testing (red) RMSE. Similar to the MAE graph, both training and testing RMSE fluctuate considerably across trials, indicating inconsistent performance. The training RMSE generally remains lower than the testing RMSE, but both exhibit substantial variability, suggesting instability in the model's learning and generalization. Spikes in the testing RMSE point to instances of poor performance on unseen data. The persistent gap between training and testing RMSE suggests potential overfitting in some trials. Overall, the model's performance, as measured by RMSE, is not stable across different runs.

In [81]:
# R² Plot
fig_r2 = go.Figure()
fig_r2.add_trace(go.Scatter(y=train_r2_scores, mode='lines+markers', name='Train R²'))
fig_r2.add_trace(go.Scatter(y=test_r2_scores, mode='lines+markers', name='Test R²'))
fig_r2.update_layout(title='KNN Regression: R² Score across 50 Trials',
                     xaxis_title='Trial',
                     yaxis_title='R² Score',
                     template='plotly_dark')
fig_r2.show()

The line graph above illustrates the R² score of a KNN regression model across 50 trials, displaying both training (blue) and testing (red) R² values. The R² score, representing the proportion of variance explained by the model, shows considerable variability across trials for both training and testing sets, indicating inconsistent model fit. The training R² generally hovers at a high level, often near 1, suggesting a good fit to the training data. However, the testing R² fluctuates more significantly, with drops to lower values, indicating inconsistent generalization performance on unseen data. The gap between training and testing R² in several trials suggests potential overfitting, where the model fits the training data well but fails to generalize effectively. The instability in the R² score across trials highlights the sensitivity of the KNN regression model to variations in the data or trial-specific conditions.

In [82]:
print(f"Average Best K: {np.mean(best_k_scores):.2f}")
print(f"Average Training R²: {np.mean(train_r2_scores):.3f}")
print(f"Average Testing R²: {np.mean(test_r2_scores):.3f}")
print(f"Average Training MAE: {np.mean(train_mae_scores):.3f}")
print(f"Average Testing MAE: {np.mean(test_mae_scores):.3f}")
print(f"Average Training RMSE: {np.mean(train_rmse_scores):.3f}")
print(f"Average Testing RMSE: {np.mean(test_rmse_scores):.3f}")

Average Best K: 2.54
Average Training R²: 0.959
Average Testing R²: 0.902
Average Training MAE: 6.230
Average Testing MAE: 10.075
Average Training RMSE: 9.229
Average Testing RMSE: 13.325


---

In [83]:
print(df['fruit_label'].value_counts())

fruit_label
1    19
3    19
4    16
2     5
Name: count, dtype: int64


#### LOGISTIC REGRESSION W/ L2 REGULARIZATION

**Context:** classify:

1 = Apple

0 = Not Apple

In [84]:
# -----------------------------
# LOGISTIC REGRESSION
# -----------------------------

# Create binary target: 1 if apple, 0 otherwise
df_binary = df.copy()
df_binary['is_apple'] = df_binary['fruit_name'].apply(lambda x: 1 if x == 'apple' else 0)


In [85]:
# Features and target setup
X = df_binary[['mass', 'width', 'height', 'color_score']]
y = df_binary['is_apple']

In [86]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [87]:
# C values to try (inverse of regularization strength)
C_values = [1e-8, 1e-4, 1e-3, 0.1, 0.2, 0.4, 0.75, 1, 1.5, 3, 5, 10, 15, 20, 100, 300, 1000, 5000]
lahat_train_lr = pd.DataFrame()
lahat_test_lr = pd.DataFrame()

In [88]:
# Metric storage
best_c_list = []
train_acc, test_acc = [], []
train_precision, test_precision = [], []
train_recall, test_recall = [], []
train_f1, test_f1 = [], []
train_r2, test_r2 = [], []

In [89]:
# 50 trials
for seedN in range(50):
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, stratify=y, random_state=seedN)
    acc_train, acc_test = [], []
    for c in C_values:
        model = LogisticRegression(C=c, penalty='l2', solver='liblinear')
        model.fit(X_train, y_train)
        acc_train.append(model.score(X_train, y_train))
        acc_test.append(model.score(X_test, y_test))
    lahat_train_lr[seedN] = acc_train
    lahat_test_lr[seedN] = acc_test

    best_index = np.argmax(acc_test)
    best_c = C_values[best_index]
    best_c_list.append(best_c)

    best_model = LogisticRegression(C=best_c, penalty='l2', solver='liblinear')
    best_model.fit(X_train, y_train)
    y_train_pred = best_model.predict(X_train)
    y_test_pred = best_model.predict(X_test)

    train_acc.append(accuracy_score(y_train, y_train_pred))
    test_acc.append(accuracy_score(y_test, y_test_pred))
    train_precision.append(precision_score(y_train, y_train_pred, zero_division=0))
    test_precision.append(precision_score(y_test, y_test_pred, zero_division=0))
    train_recall.append(recall_score(y_train, y_train_pred, zero_division=0))
    test_recall.append(recall_score(y_test, y_test_pred, zero_division=0))
    train_f1.append(f1_score(y_train, y_train_pred, zero_division=0))
    test_f1.append(f1_score(y_test, y_test_pred, zero_division=0))
    train_r2.append(r2_score(y_train, y_train_pred))
    test_r2.append(r2_score(y_test, y_test_pred))

In [90]:
# ------------------------
#    MODEL EVALUTAION
# -----------------------

# Accuracy Plot
fig_lr = go.Figure()
fig_lr.add_trace(go.Scatter(x=C_values, y=lahat_train_lr.mean(axis=1),
                            error_y=dict(type='data', array=lahat_train_lr.var(axis=1)),
                            mode='lines+markers', name='Training Accuracy'))
fig_lr.add_trace(go.Scatter(x=C_values, y=lahat_test_lr.mean(axis=1),
                            error_y=dict(type='data', array=lahat_test_lr.var(axis=1)),
                            mode='lines+markers', name='Testing Accuracy'))
fig_lr.update_layout(title='Logistic Regression (L2) Accuracy vs C (50 Trials)',
                     xaxis_title='C (Regularization)',
                     yaxis_title='Accuracy',
                     xaxis_type='log',
                     template='plotly_dark')
fig_lr.show()


### Analysis of Logistic Regression (L2) Accuracy vs. C

> This graph illustrates the relationship between the regularization parameter 'C' (inverse of regularization strength, $\lambda$) in an L2-regularized Logistic Regression model and its accuracy on training and testing datasets across 50 trials. The x-axis represents the value of 'C', spanning from very strong regularization ($10^{-9}$) to very weak regularization ($10^4$). The y-axis displays the accuracy. The blue line shows the training accuracy, while the red line represents the testing accuracy, with error bars indicating the standard deviation across the 50 trials.

**Observations:**

The training accuracy generally increases as 'C' increases (regularization weakens). This is expected, as a less constrained model can better fit the training data. It eventually plateaus at a high accuracy.

The testing accuracy initially improves with increasing 'C', suggesting that reducing strong regularization helps the model generalize better. However, beyond a certain point (around C=1), the testing accuracy plateaus or even slightly decreases. This indicates the onset of overfitting, where the model starts to learn the noise in the training data, hindering its performance on unseen data.

The error bars for the testing accuracy tend to widen at higher 'C' values. This suggests that the model's performance on unseen data becomes more variable and less reliable when regularization is weak and overfitting is more likely.

**Interpretation:**

The graph demonstrates the crucial bias-variance trade-off in machine learning. Strong regularization (low 'C') leads to high bias and underfitting, where the model is too simple to capture the underlying patterns. Weak regularization (high 'C') leads to low bias but high variance and overfitting, where the model learns the training data too well, including its noise, and performs poorly on new data.

The optimal value of 'C' lies in the region where the testing accuracy is maximized, representing a balance between bias and variance that allows for good generalization to unseen data. This visualization emphasizes the importance of hyperparameter tuning to find the appropriate level of regularization for a given problem.

In [91]:
# R² Plot
fig_r2 = go.Figure()
fig_r2.add_trace(go.Scatter(y=train_r2, mode='lines+markers', name='Training R²'))
fig_r2.add_trace(go.Scatter(y=test_r2, mode='lines+markers', name='Testing R²'))
fig_r2.update_layout(title='Logistic Regression (L2): Training vs Testing R² across 50 Trials',
                     xaxis_title='Trial',
                     yaxis_title='R² Score',
                     template='plotly_dark')
fig_r2.show()

print("Highest Test Set Accuracy (LogReg L2):", np.max(lahat_test_lr.mean(axis=1)))
print("Best C Parameter (LogReg L2):", C_values[np.argmax(lahat_test_lr.mean(axis=1))])

Highest Test Set Accuracy (LogReg L2): 0.7533333333333337
Best C Parameter (LogReg L2): 100


### Analysis of Logistic Regression (L2): Training vs. Testing R² across 50 Trials

> This graph displays the R² scores for both the training (blue line) and testing (red line) datasets of an L2-regularized Logistic Regression model over 50 independent trials. The x-axis represents each of the 50 trials, while the y-axis shows the corresponding R² score, a measure of the proportion of the variance in the dependent variable that is predictable from the independent variables. An R² score closer to 1 indicates a better fit.

**Observations:**

Both the training and testing R² scores exhibit considerable variability across the 50 trials. The scores fluctuate significantly, ranging from negative values to positive values below 0.6.

There isn't a consistent pattern of the training R² being significantly higher than the testing R² across all trials. In some trials, the training R² is higher, as expected, while in others, the testing R² approaches or even exceeds the training R².

The negative R² scores observed in several trials, particularly for the testing set, indicate that the model performs worse than simply predicting the mean of the target variable. This suggests a poor fit for those specific trials.

It has also indicated that the highest test set accuracy achieved was approximately 0.753, and this occurred with a best 'C' parameter (inverse of regularization strength) of 100.

In [92]:
# F1 Score
fig_f1 = go.Figure()
fig_f1.add_trace(go.Scatter(y=train_f1, mode='lines+markers', name='Training F1'))
fig_f1.add_trace(go.Scatter(y=test_f1, mode='lines+markers', name='Testing F1'))
fig_f1.update_layout(title='Logistic Regression: F1 Score across 50 Trials',
                     xaxis_title='Trial',
                     yaxis_title='F1 Score',
                     template='plotly_dark')
fig_f1.show()

In [93]:
# Precision Score
fig_f1 = go.Figure()
fig_f1.add_trace(go.Scatter(y=train_precision, mode='lines+markers', name='Training Precision'))
fig_f1.add_trace(go.Scatter(y=train_precision, mode='lines+markers', name='Testing Precision'))
fig_f1.update_layout(title='Logistic Regression: Precision Score across 50 Trials',
                     xaxis_title='Trial',
                     yaxis_title='Precision Score',
                     template='plotly_dark')
fig_f1.show()

In [94]:
# Recall Score
fig_f1 = go.Figure()
fig_f1.add_trace(go.Scatter(y=train_recall, mode='lines+markers', name='Training Recall'))
fig_f1.add_trace(go.Scatter(y=train_recall, mode='lines+markers', name='Testing Recall'))
fig_f1.update_layout(title='Logistic Regression: Recall Score across 50 Trials',
                     xaxis_title='Trial',
                     yaxis_title='Recall Score',
                     template='plotly_dark')
fig_f1.show()

In [96]:
# Final Model Performance Summary:

print(f"Average Best C: {np.mean(best_c_list):.4f}")
print(f"Average Training Accuracy: {np.mean(train_acc):.3f}")
print(f"Average Testing Accuracy: {np.mean(test_acc):.3f}")
print(f"Average Testing Precision: {np.mean(test_precision):.3f}")
print(f"Average Testing Recall: {np.mean(test_recall):.3f}")
print(f"Average Testing F1: {np.mean(test_f1):.3f}")
print()
print("Highest Test Set Accuracy (LogReg L2):", np.max(lahat_test_lr.mean(axis=1)))
print("Best C Parameter (LogReg L2):", C_values[np.argmax(lahat_test_lr.mean(axis=1))])

Average Best C: 3.7980
Average Training Accuracy: 0.816
Average Testing Accuracy: 0.785
Average Testing Precision: 0.792
Average Testing Recall: 0.548
Average Testing F1: 0.611

Highest Test Set Accuracy (LogReg L2): 0.7533333333333337
Best C Parameter (LogReg L2): 100


#### Linear SVM W/ L2 REGULARIZATION

In [98]:
# Features and target setup (same as before)
X = df_binary[['mass', 'width', 'height', 'color_score']]
y = df_binary['is_apple']

In [99]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [100]:
lahat_train_svm = pd.DataFrame()
lahat_test_svm = pd.DataFrame()

In [101]:
# Metric storage for SVM
best_c_list_svm = []
train_acc_svm, test_acc_svm = [], []
train_precision_svm, test_precision_svm = [], []
train_recall_svm, test_recall_svm = [], []
train_f1_svm, test_f1_svm = [], []
train_r2_svm, test_r2_svm = [], []

In [102]:
# Run 50 randomized trials
for seedN in range(50):
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, stratify=y, random_state=seedN)
    acc_train, acc_test = [], []
    for c in C_values:
        model = LinearSVC(C=c, penalty='l2', dual=False, max_iter=10000)
        model.fit(X_train, y_train)
        acc_train.append(model.score(X_train, y_train))
        acc_test.append(model.score(X_test, y_test))
    lahat_train_svm[seedN] = acc_train
    lahat_test_svm[seedN] = acc_test

    best_index = np.argmax(acc_test)
    best_c = C_values[best_index]
    best_c_list_svm.append(best_c)

    best_model = LinearSVC(C=best_c, penalty='l2', dual=False, max_iter=10000)
    best_model.fit(X_train, y_train)
    y_train_pred = best_model.predict(X_train)
    y_test_pred = best_model.predict(X_test)

    train_acc_svm.append(accuracy_score(y_train, y_train_pred))
    test_acc_svm.append(accuracy_score(y_test, y_test_pred))
    train_precision_svm.append(precision_score(y_train, y_train_pred, zero_division=0))
    test_precision_svm.append(precision_score(y_test, y_test_pred, zero_division=0))
    train_recall_svm.append(recall_score(y_train, y_train_pred, zero_division=0))
    test_recall_svm.append(recall_score(y_test, y_test_pred, zero_division=0))
    train_f1_svm.append(f1_score(y_train, y_train_pred, zero_division=0))
    test_f1_svm.append(f1_score(y_test, y_test_pred, zero_division=0))
    train_r2_svm.append(r2_score(y_train, y_train_pred))
    test_r2_svm.append(r2_score(y_test, y_test_pred))

In [103]:
# Accuracy Plot
fig_svm = go.Figure()
fig_svm.add_trace(go.Scatter(x=C_values, y=lahat_train_svm.mean(axis=1),
                             error_y=dict(type='data', array=lahat_train_svm.var(axis=1)),
                             mode='lines+markers', name='Training Accuracy'))
fig_svm.add_trace(go.Scatter(x=C_values, y=lahat_test_svm.mean(axis=1),
                             error_y=dict(type='data', array=lahat_test_svm.var(axis=1)),
                             mode='lines+markers', name='Testing Accuracy'))
fig_svm.update_layout(title='Linear SVM (L2) Accuracy vs C (50 Trials)',
                      xaxis_title='C (Regularization)',
                      yaxis_title='Accuracy',
                      xaxis_type='log',
                      template='plotly_dark')
fig_svm.show()

### Analysis of Linear SVM (L2) Accuracy vs. C (50 Trials)

? The graphical representation above elucidates the intricate relationship between the regularization parameter 'C' and the generalization capability of a Linear Support Vector Machine employing L2 regularization. By plotting the training and testing accuracies across 50 independent trials against a spectrum of 'C' values, we gain direct insights into the model's learning behavior and its ability to perform on unseen data.

Observing the trajectory of the training accuracy (depicted in blue), a clear trend emerges: as the value of 'C' increases, signifying a reduction in the strength of the L2 penalty, the model's performance on the training data generally improves. Starting from a moderate accuracy level under conditions of strong regularization (very low 'C'), the training accuracy exhibits a notable surge around a 'C' value of 0.01. Subsequently, it plateaus at a high accuracy level, hovering around 0.84, as 'C' continues to increase and the regularization constraint weakens. This behavior is consistent with the expectation that a less constrained model possesses a greater capacity to fit the intricacies of the training dataset.

The testing accuracy (illustrated in red) presents a more nuanced picture concerning the model's ability to generalize. Similar to the training accuracy, an initial increase in 'C' from very small values corresponds to an improvement in the testing performance, with a significant rise also observed around C=0.01. However, the testing accuracy reaches its apex, approximately at 0.76, within the 'C' range of roughly 1 to 100. Beyond this optimal zone, further increases in 'C' do not yield substantial improvements in testing accuracy; instead, a plateau or even a slight decline appears, suggesting the onset of overfitting where the model begins to learn noise specific to the training data.

In [104]:
# R² Plot
fig_r2_svm = go.Figure()
fig_r2_svm.add_trace(go.Scatter(y=train_r2_svm, mode='lines+markers', name='Training R²'))
fig_r2_svm.add_trace(go.Scatter(y=test_r2_svm, mode='lines+markers', name='Testing R²'))
fig_r2_svm.update_layout(title='Linear SVM (L2): Training vs Testing R² across 50 Trials',
                         xaxis_title='Trial',
                         yaxis_title='R² Score',
                         template='plotly_dark')
fig_r2_svm.show()

print("Highest Test Set Accuracy (Linear SVM):", np.max(lahat_test_svm.mean(axis=1)))
print("Best C Parameter (Linear SVM):", C_values[np.argmax(lahat_test_svm.mean(axis=1))])

Highest Test Set Accuracy (Linear SVM): 0.7600000000000002
Best C Parameter (Linear SVM): 300


### Analysis of Linear SVM (L2): Training vs. Testing R² across 50 Trials

> The most striking observation from the graph above is the substantial variability in both the training and testing R² scores across the 50 trials. The scores fluctuate widely, spanning from negative values to positive values below 0.6. This indicates a considerable inconsistency in the model's performance as measured by R² across different trials.

There is no clear and consistent pattern of the training R² scores being significantly higher than the testing R² scores across all trials. In several instances, the testing R² score either approaches or even surpasses the corresponding training R² score. This lack of a consistent gap suggests that the model's ability to generalize, as indicated by R², is not uniformly worse than its fit to the training data.

The presence of negative R² scores, particularly prominent in the testing set, is a critical observation. A negative R² implies that the model's predictions are worse than simply predicting the average value of the target variable for those specific trials. This strongly suggests a poor model fit and a failure to capture the underlying relationships in the data for those instances.

It has also been indicated that the highest test set accuracy achieved by this Linear SVM was approximately 0.76, and this occurred with a best 'C' parameter (again as an inverse of regularization strength) set to 300. 


In [105]:
# Average Best C and other metrics
print(f"Average Best C: {np.mean(best_c_list_svm):.4f}")
print(f"Average Training Accuracy: {np.mean(train_acc):.3f}")
print(f"Average Testing Accuracy: {np.mean(test_acc):.3f}")
print(f"Average Testing Precision: {np.mean(test_precision):.3f}")
print(f"Average Testing Recall: {np.mean(test_recall):.3f}")
print(f"Average Testing F1: {np.mean(test_f1):.3f}")

Average Best C: 12.3470
Average Training Accuracy: 0.816
Average Testing Accuracy: 0.785
Average Testing Precision: 0.792
Average Testing Recall: 0.548
Average Testing F1: 0.611


---

### Summary

| Model                  | Accuracy Score                               | Best Parameter(s)   |
|------------------------|----------------------------------------------|---------------------|
| KNN Regression         | Average Training R²: 0.959 <br> Average Testing R²: 0.902 | Average Best K: 2.54|
| Logistic Regression (L2)| Accuracy: 0.753                              | Best C Param =100  |
| Linear SVM (L2)        | Average Training Accuracy: 0.816 <br> Average Testing Accuracy: 0.785 | Average Best C: 12.3470 |

### Discussion: The Comparison of Results



In evaluating the three models, It was observed that **KNN regression** achieved the highest average R² scores for both training (0.959) and testing (0.902), but its performance was highly unstable across trials, indicating sensitivity to data splits and potential overfitting. 

**Logistic Regression** with L2 regularization reached a peak test accuracy of 0.753 at \( C = 100 \), but the R² scores fluctuated widely, with several negative values, pointing to inconsistent generalization. 

Similarly, the **Linear SVM** model demonstrated strong training accuracy (~0.816) and reached a peak test accuracy of 0.76 at \( C = 300 \), though its R² scores were also erratic, with negative values appearing frequently. Across all models, the training performance was generally stronger than the testing performance, which is indicative of overfitting tendencies, especially when regularization was weak. The variability in R² across trials reinforces the importance of cross-validation and robust data partitioning strategies. Ultimately, while KNN showed superior average predictive power, the SVM offered a more balanced performance between training and testing accuracies at a moderately regularized setting.


---