# Machine Learning Exercise Template

This notebook provides a template for machine learning exercises in the Advanced ML course.

**Exercise Name:** [Your Exercise Name]

**Date:** [Date]

**Objective:** [Brief description of the exercise objective]

## 1. Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Import utility functions
import sys
sys.path.append('..')
from utils import *

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## 2. Load Data

Load your dataset here. You can use CSV, Excel, or other formats.

In [None]:
# Load data
# df = pd.read_csv('../data/your_dataset.csv')

# For demonstration, create a sample dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, random_state=RANDOM_STATE)
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y

print(f"Dataset shape: {df.shape}")
df.head()

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Basic information
print("Dataset Info:")
print(df.info())
print("\nBasic Statistics:")
df.describe()

In [None]:
# Check for missing values
missing = df.isnull().sum()
if missing.sum() > 0:
    print("Missing Values:")
    print(missing[missing > 0])
else:
    print("No missing values found!")

In [None]:
# Target variable distribution
plt.figure(figsize=(8, 5))
df['target'].value_counts().plot(kind='bar')
plt.title('Target Variable Distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 4. Data Preprocessing and Feature Engineering

In [None]:
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Handle missing values (if any)
# X = handle_missing_values(X, strategy='mean')

# Encode categorical features (if any)
# X = encode_categorical_features(X, columns=['cat_col1', 'cat_col2'], method='onehot')

# Scale features
scaler = StandardScaler()
X_scaled = pd.DataFrame(
    scaler.fit_transform(X),
    columns=X.columns,
    index=X.index
)

print("Features preprocessed successfully!")

## 5. Train-Test Split

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

## 6. Model Training

In [None]:
# Import models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(random_state=RANDOM_STATE, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=RANDOM_STATE),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)
}

In [None]:
# Train and compare models
results = compare_models(models, X_train, y_train, X_test, y_test, task='classification')

## 7. Model Tuning

In [None]:
# Define hyperparameter grid for best model
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=RANDOM_STATE),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

## 8. Final Model Evaluation

In [None]:
# Get best model
best_model = grid_search.best_estimator_

# Predictions
y_pred = best_model.predict(X_test)

# Evaluation
from sklearn.metrics import confusion_matrix, classification_report

print("Test Set Performance:")
print("="*50)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [None]:
# Visualize confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test, ax=ax, cmap='Blues')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

## 9. Feature Importance (if applicable)

In [None]:
# For tree-based models, visualize feature importance
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': X.columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    plt.figure(figsize=(10, 6))
    plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
    plt.xlabel('Importance')
    plt.title('Top 10 Feature Importances')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    print("Top Features:")
    print(feature_importance.head(10))

## 10. Save Model

In [None]:
# Save the trained model
# import joblib
# joblib.dump(best_model, '../models/exercise_name/best_model.pkl')
# print("Model saved successfully!")

## 11. Conclusions

**Key Findings:**
- [Finding 1]
- [Finding 2]
- [Finding 3]

**Model Performance:**
- [Summary of best model performance]

**Next Steps:**
- [Potential improvements]
- [Additional experiments to try]