# 🤖 Notebook 05: Machine Learning with scikit-learn

**Week 1-2: Python & ML Foundations**  
**Gen AI Masters Program**

---

## 📋 Objectives

By the end of this notebook, you will master:
1. ✅ ML fundamentals and workflow
2. ✅ Classification algorithms
3. ✅ Regression models
4. ✅ Model evaluation metrics
5. ✅ Cross-validation and hyperparameter tuning
6. ✅ Building a complete ML pipeline

**Estimated Time:** 3-4 hours

---

## 📚 Why scikit-learn?

scikit-learn is the foundation of classical ML:
- 🎯 **Simple API**: Easy to learn and use
- 🔧 **Complete Toolkit**: Preprocessing, models, evaluation
- 📊 **Industry Standard**: Used in production worldwide
- 🚀 **Fast Prototyping**: Test ideas quickly

Let's build our first ML models! 🚀

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("✅ Libraries imported successfully!")

## 1️⃣ Understanding Machine Learning

### ML Workflow

```
1. Data Collection → 2. Data Preparation → 3. Model Selection
                                ↓
6. Deployment ← 5. Evaluation ← 4. Training
```

### Types of ML
- **Supervised**: Learn from labeled data (Classification, Regression)
- **Unsupervised**: Find patterns in unlabeled data (Clustering)
- **Reinforcement**: Learn from rewards/penalties

Today we'll focus on **Supervised Learning**!

## 2️⃣ Classification: Quality Defect Prediction

### Load and Prepare Data

In [None]:
# Create manufacturing quality dataset
np.random.seed(42)
n_samples = 500

# Features: temperature, pressure, humidity, vibration
temperature = np.random.normal(75, 10, n_samples)
pressure = np.random.normal(120, 15, n_samples)
humidity = np.random.normal(50, 10, n_samples)
vibration = np.random.normal(0.5, 0.2, n_samples)

# Target: defective (1) or not (0)
# Higher temperature, extreme pressure, high vibration → more defects
defect_probability = (
    0.1 + 
    0.01 * (temperature - 75) + 
    0.005 * abs(pressure - 120) + 
    0.2 * vibration
)
defect_probability = np.clip(defect_probability, 0, 1)
is_defective = (np.random.random(n_samples) < defect_probability).astype(int)

# Create DataFrame
df = pd.DataFrame({
    'temperature': temperature,
    'pressure': pressure,
    'humidity': humidity,
    'vibration': vibration,
    'is_defective': is_defective
})

print("📊 Manufacturing Quality Dataset")
print(f"Total samples: {len(df)}")
print(f"Defective: {df['is_defective'].sum()} ({df['is_defective'].mean():.1%})")
print(f"Non-defective: {(1-df['is_defective']).sum()}")
print("\nFirst 5 rows:")
print(df.head())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Class distribution
df['is_defective'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Class Distribution', fontweight='bold')
axes[0].set_xlabel('Is Defective')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No', 'Yes'], rotation=0)

# Feature relationship
for defect in [0, 1]:
    mask = df['is_defective'] == defect
    axes[1].scatter(df[mask]['temperature'], df[mask]['vibration'], 
                   label=f'Defective={defect}', alpha=0.6, s=50)
axes[1].set_title('Temperature vs Vibration', fontweight='bold')
axes[1].set_xlabel('Temperature (°C)')
axes[1].set_ylabel('Vibration')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Train-Test Split and Preprocessing

In [None]:
# Separate features and target
X = df[['temperature', 'pressure', 'humidity', 'vibration']]
y = df['is_defective']

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("📊 Data Split:")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nTraining set defect rate: {y_train.mean():.1%}")
print(f"Test set defect rate: {y_test.mean():.1%}")

# Feature scaling (important for many ML algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\n✅ Data preprocessed and ready for training!")

### Model 1: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# Train model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred_train = log_reg.predict(X_train_scaled)
y_pred_test = log_reg.predict(X_test_scaled)

# Evaluate
train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("🎯 Logistic Regression Results")
print("=" * 50)
print(f"Training Accuracy: {train_acc:.2%}")
print(f"Test Accuracy: {test_acc:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_test, target_names=['Non-defective', 'Defective']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Non-defective', 'Defective'],
            yticklabels=['Non-defective', 'Defective'])
plt.title('Confusion Matrix - Logistic Regression', fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

### Model 2: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
rf_clf.fit(X_train_scaled, y_train)

# Predictions
y_pred_rf = rf_clf.predict(X_test_scaled)

# Evaluate
rf_acc = accuracy_score(y_test, y_pred_rf)

print("🌳 Random Forest Results")
print("=" * 50)
print(f"Test Accuracy: {rf_acc:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Non-defective', 'Defective']))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_clf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature', palette='viridis')
plt.title('Feature Importance - Random Forest', fontweight='bold')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

print("\n📊 Feature Importance:")
print(feature_importance)

### Model Comparison

In [None]:
# Compare multiple models
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

results = []

for name, model in models.items():
    # Train
    model.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'Accuracy': accuracy
    })

results_df = pd.DataFrame(results).sort_values('Accuracy', ascending=False)

print("🏆 Model Comparison")
print("=" * 50)
print(results_df.to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
colors = ['gold', 'silver', '#CD7F32', 'lightblue', 'lightgreen']
sns.barplot(data=results_df, y='Model', x='Accuracy', palette=colors)
plt.title('Model Performance Comparison', fontweight='bold', fontsize=14)
plt.xlabel('Accuracy', fontweight='bold')
plt.ylabel('Model', fontweight='bold')
plt.xlim(0.5, 1.0)
plt.axvline(x=0.9, color='red', linestyle='--', label='90% threshold')
plt.legend()
plt.tight_layout()
plt.show()

## 3️⃣ Regression: Production Time Prediction

### Create Regression Dataset

In [None]:
# Create production time prediction dataset
np.random.seed(42)
n_samples = 300

# Features
complexity = np.random.uniform(1, 10, n_samples)
material_quality = np.random.uniform(5, 10, n_samples)
worker_experience = np.random.uniform(1, 15, n_samples)
machine_age = np.random.uniform(0, 20, n_samples)

# Target: production time (in minutes)
# More complex, older machine, less experience → longer time
production_time = (
    20 + 
    5 * complexity + 
    2 * machine_age - 
    3 * worker_experience - 
    1 * material_quality + 
    np.random.normal(0, 5, n_samples)
)

# Create DataFrame
df_reg = pd.DataFrame({
    'complexity': complexity,
    'material_quality': material_quality,
    'worker_experience': worker_experience,
    'machine_age': machine_age,
    'production_time': production_time
})

print("📊 Production Time Prediction Dataset")
print(df_reg.describe())

# Visualize relationships
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
features = ['complexity', 'material_quality', 'worker_experience', 'machine_age']

for idx, feature in enumerate(features):
    row = idx // 2
    col = idx % 2
    axes[row, col].scatter(df_reg[feature], df_reg['production_time'], alpha=0.6)
    axes[row, col].set_xlabel(feature.replace('_', ' ').title(), fontweight='bold')
    axes[row, col].set_ylabel('Production Time (min)', fontweight='bold')
    axes[row, col].set_title(f'{feature.replace("_", " ").title()} vs Production Time', 
                            fontweight='bold')
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Train Regression Models

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

# Prepare data
X_reg = df_reg[['complexity', 'material_quality', 'worker_experience', 'machine_age']]
y_reg = df_reg['production_time']

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Scale features
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

# Train models
regression_models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

regression_results = []

for name, model in regression_models.items():
    # Train
    model.fit(X_train_reg_scaled, y_train_reg)
    
    # Predict
    y_pred_train = model.predict(X_train_reg_scaled)
    y_pred_test = model.predict(X_test_reg_scaled)
    
    # Evaluate
    train_r2 = r2_score(y_train_reg, y_pred_train)
    test_r2 = r2_score(y_test_reg, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_test))
    
    regression_results.append({
        'Model': name,
        'Train R²': train_r2,
        'Test R²': test_r2,
        'RMSE': test_rmse
    })

reg_results_df = pd.DataFrame(regression_results).sort_values('Test R²', ascending=False)

print("📊 Regression Model Comparison")
print("=" * 70)
print(reg_results_df.to_string(index=False))

# Visualize predictions vs actual
best_model = regression_models['Random Forest']
y_pred_best = best_model.predict(X_test_reg_scaled)

plt.figure(figsize=(10, 6))
plt.scatter(y_test_reg, y_pred_best, alpha=0.6, edgecolors='black')
plt.plot([y_test_reg.min(), y_test_reg.max()], 
         [y_test_reg.min(), y_test_reg.max()], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Production Time (min)', fontweight='bold')
plt.ylabel('Predicted Production Time (min)', fontweight='bold')
plt.title('Actual vs Predicted Production Time\n(Random Forest)', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 4️⃣ Cross-Validation and Hyperparameter Tuning

### Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation
model = RandomForestClassifier(n_estimators=100, random_state=42)

cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')

print("🔄 5-Fold Cross-Validation Results")
print("=" * 50)
print(f"Fold scores: {cv_scores}")
print(f"Mean Accuracy: {cv_scores.mean():.2%}")
print(f"Std Deviation: {cv_scores.std():.4f}")

# Visualize
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, marker='o', markersize=10, linewidth=2)
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', 
            label=f'Mean: {cv_scores.mean():.2%}')
plt.fill_between(range(1, 6), 
                 cv_scores.mean() - cv_scores.std(), 
                 cv_scores.mean() + cv_scores.std(), 
                 alpha=0.2, color='blue')
plt.xlabel('Fold', fontweight='bold')
plt.ylabel('Accuracy', fontweight='bold')
plt.title('Cross-Validation Scores', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 6))
plt.tight_layout()
plt.show()

### Hyperparameter Tuning with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

# Create GridSearchCV object
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf, param_grid, cv=5, scoring='accuracy', 
    n_jobs=-1, verbose=1
)

print("🔍 Performing Grid Search...")
grid_search.fit(X_train_scaled, y_train)

print("\n✅ Grid Search Complete!")
print("=" * 50)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.2%}")

# Test best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)
best_test_acc = accuracy_score(y_test, y_pred_best)

print(f"Test accuracy with best model: {best_test_acc:.2%}")

# Compare with default model
default_model = RandomForestClassifier(random_state=42)
default_model.fit(X_train_scaled, y_train)
y_pred_default = default_model.predict(X_test_scaled)
default_acc = accuracy_score(y_test, y_pred_default)

print(f"\n📊 Comparison:")
print(f"Default model accuracy: {default_acc:.2%}")
print(f"Tuned model accuracy: {best_test_acc:.2%}")
print(f"Improvement: {(best_test_acc - default_acc):.2%}")

## 5️⃣ Complete ML Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict
y_pred_pipeline = pipeline.predict(X_test)
pipeline_acc = accuracy_score(y_test, y_pred_pipeline)

print("🔧 ML Pipeline Results")
print("=" * 50)
print(f"Accuracy: {pipeline_acc:.2%}")

# Make predictions on new data
new_data = pd.DataFrame({
    'temperature': [80, 70, 85],
    'pressure': [130, 115, 140],
    'humidity': [55, 45, 60],
    'vibration': [0.8, 0.3, 1.0]
})

predictions = pipeline.predict(new_data)
probabilities = pipeline.predict_proba(new_data)

print("\n🔮 Predictions on New Data:")
for i, (pred, prob) in enumerate(zip(predictions, probabilities), 1):
    status = "DEFECTIVE" if pred == 1 else "NON-DEFECTIVE"
    confidence = prob[pred] * 100
    print(f"Sample {i}: {status} (confidence: {confidence:.1f}%)")

## 6️⃣ Model Persistence

In [None]:
import joblib

# Save model
model_filename = 'quality_defect_classifier.pkl'
joblib.dump(pipeline, model_filename)
print(f"✅ Model saved as {model_filename}")

# Load model
loaded_model = joblib.load(model_filename)
print(f"✅ Model loaded successfully")

# Test loaded model
test_prediction = loaded_model.predict(new_data)
print(f"\n🔮 Test prediction with loaded model: {test_prediction}")

print("\n💾 Model can now be deployed to production!")

## 🎉 Summary

Congratulations! You've mastered classical machine learning with scikit-learn!

### Key Concepts Learned

#### Classification
- ✅ Logistic Regression
- ✅ Decision Trees & Random Forests
- ✅ SVM and KNN
- ✅ Model evaluation (accuracy, precision, recall, F1)
- ✅ Confusion matrices

#### Regression
- ✅ Linear, Ridge, and Lasso Regression
- ✅ Random Forest Regressor
- ✅ Evaluation metrics (R², RMSE, MAE)

#### Best Practices
- ✅ Train-test split
- ✅ Feature scaling
- ✅ Cross-validation
- ✅ Hyperparameter tuning
- ✅ ML pipelines
- ✅ Model persistence

### ML Workflow Checklist
1. ✅ Load and explore data
2. ✅ Preprocess and scale features
3. ✅ Split into train/test sets
4. ✅ Train multiple models
5. ✅ Evaluate and compare
6. ✅ Tune hyperparameters
7. ✅ Create production pipeline
8. ✅ Save model for deployment

---

### 📚 Week 1-2 Complete!

You've completed:
- ✅ Environment setup
- ✅ Python essentials
- ✅ NumPy & Pandas
- ✅ Data visualization
- ✅ Machine learning fundamentals

**Next:** Complete the **Week 1-2 Homework** to apply everything you've learned!

<div align="center">
<b>🎓 Foundation Complete! Ready for Deep Learning! 🚀</b>
</div>