# üå≤ Random Forest on MFCC Features ‚Üí ONNX Export
**Dataset:** `mfcc_with_labels_balanced_2723_samples.csv`  
**Task:** Binary classification (already balanced ‚Äî no extra resampling needed)  
**Goal:** Train, evaluate, save model, then export to ONNX safely.

## 1Ô∏è‚É£ Install Dependencies

In [None]:
# Install skl2onnx for sklearn ‚Üí ONNX conversion
!pip install -q skl2onnx onnxruntime

## 2Ô∏è‚É£ Upload Your Dataset
Run this cell to upload `mfcc_with_labels_balanced_2723_samples.csv` from your computer.

In [None]:
from google.colab import files
uploaded = files.upload()  # Upload: mfcc_with_labels_balanced_2723_samples.csv

## 3Ô∏è‚É£ Load & Explore Data

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('mfcc_with_labels_balanced_2723_samples.csv')

print('Shape:', df.shape)
print('\nLabel distribution:')
print(df['label'].value_counts())
print('\nFirst 3 rows:')
df.head(3)

## 4Ô∏è‚É£ Prepare Features & Labels

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop(columns=['label']).values.astype(np.float32)
y = df['label'].values

# Train/val/test split: 70% / 15% / 15%
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp
)

print(f'Train: {X_train.shape[0]} | Val: {X_val.shape[0]} | Test: {X_test.shape[0]}')
print(f'Features: {X_train.shape[1]}')

## 5Ô∏è‚É£ Train Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

rf = RandomForestClassifier(
    n_estimators=200,       # number of trees
    max_depth=None,         # grow full trees (let it learn)
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',    # standard for classification
    class_weight='balanced',# extra safety since data is balanced
    random_state=42,
    n_jobs=-1               # use all CPU cores
)

rf.fit(X_train, y_train)
print('‚úÖ Training complete!')

## 6Ô∏è‚É£ Evaluate the Model

In [None]:
# --- Validation Set ---
y_val_pred = rf.predict(X_val)
print('=== Validation Set ===')
print(classification_report(y_val, y_val_pred, digits=4))

# --- Test Set ---
y_test_pred = rf.predict(X_test)
print('=== Test Set ===')
print(classification_report(y_test, y_test_pred, digits=4))

# Confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title('Test Set Confusion Matrix')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()

## 7Ô∏è‚É£ Feature Importance

In [None]:
feature_names = [f'mfcc_{i}' for i in range(X_train.shape[1])]
importances = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)

plt.figure(figsize=(12, 4))
importances.plot(kind='bar')
plt.title('Feature Importances')
plt.ylabel('Importance')
plt.tight_layout()
plt.savefig('feature_importances.png', dpi=150)
plt.show()
print(importances.head(10))

## 8Ô∏è‚É£ Save the Sklearn Model (Pickle)
Always save the original sklearn model **before** ONNX conversion ‚Äî this is your safe backup.

In [None]:
import joblib

joblib.dump(rf, 'random_forest_mfcc.pkl')
print('‚úÖ Sklearn model saved ‚Üí random_forest_mfcc.pkl')

# Verify it loads correctly
rf_loaded = joblib.load('random_forest_mfcc.pkl')
assert (rf_loaded.predict(X_test) == y_test_pred).all(), 'Model integrity check FAILED!'
print('‚úÖ Integrity check passed ‚Äî sklearn model is intact.')

## 9Ô∏è‚É£ Export to ONNX
We use `skl2onnx` which converts the sklearn model to ONNX **without** modifying the original.

In [None]:
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Reload fresh from disk just to be safe
rf_for_export = joblib.load('random_forest_mfcc.pkl')

n_features = X_train.shape[1]
initial_type = [('mfcc_input', FloatTensorType([None, n_features]))]

onnx_model = convert_sklearn(
    rf_for_export,
    initial_types=initial_type,
    target_opset=17,   # opset 17 is widely supported
    options={id(rf_for_export): {'zipmap': False}}  # output plain numpy arrays, not dicts
)

with open('random_forest_mfcc.onnx', 'wb') as f:
    f.write(onnx_model.SerializeToString())

print('‚úÖ ONNX model saved ‚Üí random_forest_mfcc.onnx')

## üîü Validate ONNX Output Matches Sklearn
This is the critical step ‚Äî confirms the ONNX model predicts **identically** to the original.

In [None]:
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession('random_forest_mfcc.onnx')
input_name = sess.get_inputs()[0].name

# Run inference on test set
onnx_preds = sess.run(None, {input_name: X_test.astype(np.float32)})
onnx_labels = onnx_preds[0]          # predicted class
onnx_probs  = onnx_preds[1]          # probabilities [N, 2]

sklearn_labels = rf.predict(X_test)

match = (onnx_labels == sklearn_labels).mean() * 100
print(f'‚úÖ ONNX vs Sklearn agreement: {match:.2f}%')

if match == 100.0:
    print('üéâ Perfect match ‚Äî ONNX model is verified safe to deploy!')
else:
    print(f'‚ö†Ô∏è  {100-match:.2f}% mismatch ‚Äî check opset version or data types.')

# Peek at ONNX outputs
print('\nSample ONNX output labels:', onnx_labels[:10])
print('Sample ONNX probabilities:\n', onnx_probs[:5].round(4))

## 1Ô∏è‚É£1Ô∏è‚É£ Download All Files

In [None]:
from google.colab import files

files.download('random_forest_mfcc.pkl')       # Original sklearn model
files.download('random_forest_mfcc.onnx')      # ONNX model for deployment
files.download('confusion_matrix.png')         # Evaluation plots
files.download('feature_importances.png')

## 1Ô∏è‚É£2Ô∏è‚É£ (Optional) Hyperparameter Tuning with RandomizedSearchCV
Run this if you want to squeeze out extra performance ‚Äî takes a few minutes on Colab.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators':      [100, 200, 300, 500],
    'max_depth':         [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf':  [1, 2, 4],
    'max_features':      ['sqrt', 'log2', 0.5],
}

rf_search = RandomForestClassifier(class_weight='balanced', random_state=42, n_jobs=-1)

search = RandomizedSearchCV(
    rf_search,
    param_distributions=param_dist,
    n_iter=30,
    cv=5,
    scoring='f1_weighted',
    verbose=1,
    random_state=42,
    n_jobs=-1
)
search.fit(X_train, y_train)

print('Best params:', search.best_params_)
print('Best CV F1:', search.best_score_.round(4))

best_rf = search.best_estimator_
print('\nTest set report with best model:')
print(classification_report(y_test, best_rf.predict(X_test), digits=4))