# PenguinDemo: Machine Learning Training üêß

This notebook demonstrates machine learning model training for penguin species classification using the Palmer Penguins dataset.

## Models
- Neural Network (Flux.jl)
- Logistic Regression (Flux.jl)
- K-Nearest Neighbors (NearestNeighbors.jl)


## 1. Setup and Imports


In [1]:
using DataFrames, CSV, Plots, Statistics, LinearAlgebra, Random, Flux, StatsPlots, Printf, NearestNeighbors

# Set random seed for reproducibility
Random.seed!(42)

# Include utility modules
include("src/data_loading.jl")
include("src/ml_models.jl")
include("src/utils.jl")


get_predictions (generic function with 1 method)


get_predictions

## 2. Load and Prepare Data


In [None]:
# Load the dataset
df = load_penguin_data()

# Display basic info
println("Columns: $(names(df))")
println("Missing values per column:")
for col in names(df)
    missing_count = count(ismissing, df[!, col])
    if missing_count > 0
        println("  $col: $missing_count")
    end
end


Loaded 344 penguin records from Palmer Station LTER!
Columns: ["species", "island", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex", "year"]
Missing values per column:
  bill_length_mm: 2
  bill_depth_mm: 2
  flipper_length_mm: 2
  body_mass_g: 2
  sex: 11


## 3. Preprocess Data


In [None]:
# Prepare data for machine learning
required_cols = [:body_mass_g, :flipper_length_mm, :bill_length_mm, :bill_depth_mm, :species]
df_clean = preprocess_data(df, required_cols)

# Prepare features and split data
numeric_cols = [:bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g]
X_train, y_train, X_val, y_val, species_unique, species_dict = prepare_features(df_clean, numeric_cols)


Dataset split:
  Training samples: 273
  Validation samples: 69
  Features: 4
  Species: 3
(Float32[0.6370578 -0.7916226 1.5162457 -0.26044658 0.25241306 -0.62677485 0.39894438 0.28904587 -0.84657186 -1.5425956 0.4172608 1.0766517 0.21578021 0.8751711 -0.077282414 -0.5168764 -0.97478676 0.3073623 0.65537417 -0.60845846 0.014299657 -1.4510136 -0.5168764 -1.4510136 0.96675324 -0.5168764 0.27072945 -0.5718256 -1.1213181 0.54547566 1.882574 -0.13223167 -0.97478676 0.9301204 -0.60845846 0.21578021 1.2414994 1.3147651 -0.7916226 -0.38866147 1.2598158 -1.1579509 -1.7074434 -0.3153958 0.78358907 0.49052644 -1.1579509 0.34399512 -0.5168764 -0.11391525 1.4979292 1.754359 -0.48024353 -1.0480524 -0.5168764 0.38062796 -1.2312165 1.0583353 1.0217025 0.98506963 -0.15054807 1.131601 -0.5535092 -1.9821895 -0.22381374 -0.5168764 -0.35202864 0.83853835 -0.7916226 -1.0663688 1.2781323 0.96675324 0.28904587 1.351398 1.0949681 0.4538936 0.54547566 -0.20549732 1.4796128 1.1499174 0.7103234 -0.6634077 1.04001

## 4. Train Models


### 4.1 Neural Network


In [None]:
# Train Neural Network
num_species = length(species_unique)
nn_model = train_neural_network(X_train, y_train, X_val, y_val, num_species, 20)

# Evaluate Neural Network
nn_train_pred = get_predictions(nn_model(X_train))
nn_val_pred = get_predictions(nn_model(X_val))
nn_train_true = get_predictions(y_train)
nn_val_true = get_predictions(y_val)

nn_train_acc = calculate_accuracy(nn_train_pred, nn_train_true)
nn_val_acc = calculate_accuracy(nn_val_pred, nn_val_true)

println("\nNeural Network Results:")
println("  Train Accuracy: $(round(nn_train_acc*100, digits=1))%")
println("  Validation Accuracy: $(round(nn_val_acc*100, digits=1))%")



MODEL 1: Neural Network (Flux)
Architecture: 4 ‚Üí 16 ‚Üí 8 ‚Üí 3 (ReLU + Softmax)
Epoch 5: Train Loss: 1.064, Val Loss: 1.098
Epoch 10: Train Loss: 1.048, Val Loss: 1.082
Epoch 15: Train Loss: 1.032, Val Loss: 1.066
Epoch 20: Train Loss: 1.016, Val Loss: 1.051

Neural Network Results:
  Train Accuracy: 55.3%
  Validation Accuracy: 49.3%


### 4.2 Logistic Regression


In [None]:
# Train Logistic Regression
lr_W, lr_b = train_logistic_regression(X_train, y_train)

# Evaluate Logistic Regression
lr_train_pred = predict_lr_flux(X_train, lr_W, lr_b)
lr_val_pred = predict_lr_flux(X_val, lr_W, lr_b)
lr_train_true = get_predictions(y_train)
lr_val_true = get_predictions(y_val)

lr_train_acc = calculate_accuracy(lr_train_pred, lr_train_true)
lr_val_acc = calculate_accuracy(lr_val_pred, lr_val_true)

println("\nLogistic Regression Results:")
println("  Train Accuracy: $(round(lr_train_acc*100, digits=1))%")
println("  Validation Accuracy: $(round(lr_val_acc*100, digits=1))%")



MODEL 2: Logistic Regression (Flux)
Training logistic regression (Flux style)...

Logistic Regression Results:
  Train Accuracy: 97.4%
  Validation Accuracy: 100.0%


### 4.3 K-Nearest Neighbors


In [None]:
# Train K-Nearest Neighbors
kdtree = train_knn(X_train, y_train, num_species, 5)

# Evaluate KNN
knn_train_pred = predict_knn(kdtree, X_train, y_train, X_train, num_species, 5)
knn_val_pred = predict_knn(kdtree, X_train, y_train, X_val, num_species, 5)
knn_train_true = get_predictions(y_train)
knn_val_true = get_predictions(y_val)

knn_train_acc = calculate_accuracy(knn_train_pred, knn_train_true)
knn_val_acc = calculate_accuracy(knn_val_pred, knn_val_true)

println("\nK-Nearest Neighbors Results:")
println("  Train Accuracy: $(round(knn_train_acc*100, digits=1))%")
println("  Validation Accuracy: $(round(knn_val_acc*100, digits=1))%")



MODEL 3: K-Nearest Neighbors (NearestNeighbors.jl)
Building KDTree for efficient nearest neighbor search...
Making predictions with k=5...
Making predictions with k=5...

K-Nearest Neighbors Results:
  Train Accuracy: 99.3%
  Validation Accuracy: 100.0%


## 5. Compare Models


In [None]:
# Store results for comparison
results = Dict(
    "Neural Network" => (nn_train_acc, nn_val_acc),
    "Logistic Regression (Flux)" => (lr_train_acc, lr_val_acc),
    "K-Nearest Neighbors" => (knn_train_acc, knn_val_acc)
)

# Compare models
best_model_name, best_val_acc = compare_models(results)



MODEL COMPARISON RESULTS
Model                    | Train Acc | Val Acc  |
--------------------------------------------------
K-Nearest Neighbors       |     99.3% |   100.0% |
Logistic Regression (Flux) |     97.4% |   100.0% |
Neural Network            |     55.3% |    49.3% |

üèÜ Best Model: K-Nearest Neighbors (Validation Accuracy: 100.0%)
("K-Nearest Neighbors", 1.0)


## 6. Sample Predictions


In [None]:
# Show sample predictions from best model
println("\nSample predictions from best model ($best_model_name):")

if best_model_name == "Neural Network"
    best_predictions = nn_val_pred
    best_true_labels = nn_val_true
elseif best_model_name == "Logistic Regression (Flux)"
    best_predictions = lr_val_pred
    best_true_labels = lr_val_true
else
    best_predictions = knn_val_pred
    best_true_labels = knn_val_true
end

for i in 1:min(5, length(best_predictions))
    pred_species = species_unique[best_predictions[i]]
    true_species = species_unique[best_true_labels[i]]
    correct = pred_species == true_species ? "‚úì" : "‚úó"
    println("  Sample $i: Predicted: $pred_species, Actual: $true_species $correct")
end

println("\nML training complete! üêß")



Sample predictions from best model (K-Nearest Neighbors):
  Sample 1: Predicted: Gentoo, Actual: Gentoo ‚úì
  Sample 2: Predicted: Chinstrap, Actual: Chinstrap ‚úì
  Sample 3: Predicted: Gentoo, Actual: Gentoo ‚úì
  Sample 4: Predicted: Gentoo, Actual: Gentoo ‚úì
  Sample 5: Predicted: Gentoo, Actual: Gentoo ‚úì

ML training complete! üêß
