# Fundamentals of Machine Learning

## Learning Objectives
- Understand data components (features, labels, datasets)
- Learn about bias vs variance tradeoff
- Understand underfitting vs overfitting
- Master train-validation-test splits and cross-validation

## Understanding Data

### Key Components
- **Features (X)**: Input variables used to make predictions
- **Labels (y)**: Target variable we want to predict
- **Training Set**: Data used to train the model
- **Test Set**: Data used to evaluate final model performance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression, make_classification

# Example dataset
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)

# Create DataFrame for better visualization
df = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
df['Target'] = y

print("Dataset Structure:")
print(f"Features (X): {X.shape}")
print(f"Labels (y): {y.shape}")
print("\nFirst 5 rows:")
print(df.head())

## Bias vs. Variance

### Bias
- **Definition**: Error due to overly simplistic assumptions
- **High Bias**: Model consistently misses relevant patterns (underfitting)
- **Low Bias**: Model captures underlying patterns well

### Variance
- **Definition**: Error due to sensitivity to small fluctuations in training data
- **High Variance**: Model varies significantly with different training sets (overfitting)
- **Low Variance**: Model is consistent across different training sets

### The Tradeoff
Total Error = Bias² + Variance + Irreducible Error

In [None]:
# Visualizing Bias-Variance Tradeoff
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Generate simple dataset
np.random.seed(42)
X_simple = np.linspace(0, 1, 50).reshape(-1, 1)
y_true = 1.5 * X_simple.ravel() + 0.5 * np.sin(2 * np.pi * X_simple.ravel())
y_simple = y_true + np.random.normal(0, 0.1, X_simple.shape[0])

# Different model complexities
degrees = [1, 4, 15]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, degree in enumerate(degrees):
    # Fit polynomial model
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    poly_model.fit(X_simple, y_simple)
    
    # Predictions
    X_plot = np.linspace(0, 1, 100).reshape(-1, 1)
    y_pred = poly_model.predict(X_plot)
    
    # Plot
    axes[i].scatter(X_simple, y_simple, alpha=0.6, label='Data')
    axes[i].plot(X_plot, y_pred, 'r-', label=f'Degree {degree}')
    axes[i].set_title(f'Polynomial Degree {degree}\n' + 
                     ('High Bias' if degree == 1 else 
                      'Good Fit' if degree == 4 else 'High Variance'))
    axes[i].legend()
    axes[i].set_xlabel('X')
    axes[i].set_ylabel('y')

plt.tight_layout()
plt.show()

## Underfitting vs. Overfitting

### Underfitting (High Bias)
- Model is too simple to capture underlying patterns
- Poor performance on both training and test data
- **Solutions**: Increase model complexity, add features, reduce regularization

### Overfitting (High Variance)
- Model memorizes training data including noise
- Good performance on training data, poor on test data
- **Solutions**: Reduce model complexity, add regularization, get more data

### Good Fit
- Model captures underlying patterns without memorizing noise
- Good performance on both training and test data

In [None]:
# Demonstrating overfitting with learning curves
from sklearn.model_selection import validation_curve

# Generate dataset
X_curve, y_curve = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Validation curve for polynomial degrees
degrees = range(1, 16)
train_scores, val_scores = validation_curve(
    Pipeline([('poly', PolynomialFeatures()), ('linear', LinearRegression())]),
    X_curve, y_curve, param_name='poly__degree', param_range=degrees,
    cv=5, scoring='neg_mean_squared_error'
)

# Convert to positive MSE
train_mse = -train_scores.mean(axis=1)
val_mse = -val_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_std = val_scores.std(axis=1)

# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_mse, 'o-', label='Training Error', color='blue')
plt.fill_between(degrees, train_mse - train_std, train_mse + train_std, alpha=0.2, color='blue')
plt.plot(degrees, val_mse, 'o-', label='Validation Error', color='red')
plt.fill_between(degrees, val_mse - val_std, val_mse + val_std, alpha=0.2, color='red')

plt.xlabel('Polynomial Degree (Model Complexity)')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curve: Underfitting vs Overfitting')
plt.legend()
plt.grid(True, alpha=0.3)

# Add annotations
plt.annotate('Underfitting\n(High Bias)', xy=(2, train_mse[1]), xytext=(3, train_mse[1] + 500),
            arrowprops=dict(arrowstyle='->', color='black'), fontsize=10)
plt.annotate('Overfitting\n(High Variance)', xy=(12, val_mse[11]), xytext=(10, val_mse[11] + 500),
            arrowprops=dict(arrowstyle='->', color='black'), fontsize=10)

plt.show()

## Train-Validation-Test Splits

### Purpose of Each Set
- **Training Set (60-70%)**: Used to train the model
- **Validation Set (15-20%)**: Used for hyperparameter tuning and model selection
- **Test Set (15-20%)**: Used for final, unbiased evaluation

### Why Three Sets?
- Prevents data leakage
- Provides unbiased performance estimates
- Enables proper hyperparameter tuning

In [None]:
# Train-Validation-Test Split
from sklearn.model_selection import train_test_split

# Generate sample dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: separate train and validation (75% train, 25% validation of remaining)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print("Dataset Splits:")
print(f"Total samples: {len(X)}")
print(f"Training set: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Validation set: {len(X_val)} ({len(X_val)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

# Visualize the split
fig, ax = plt.subplots(figsize=(10, 2))
splits = ['Training', 'Validation', 'Test']
sizes = [len(X_train), len(X_val), len(X_test)]
colors = ['skyblue', 'lightgreen', 'lightcoral']

left = 0
for i, (split, size, color) in enumerate(zip(splits, sizes, colors)):
    ax.barh(0, size, left=left, color=color, label=f'{split} ({size})')
    ax.text(left + size/2, 0, f'{split}\n{size}', ha='center', va='center')
    left += size

ax.set_xlim(0, len(X))
ax.set_ylim(-0.5, 0.5)
ax.set_xlabel('Number of Samples')
ax.set_title('Train-Validation-Test Split')
ax.set_yticks([])
plt.tight_layout()
plt.show()

## Cross-Validation

### K-Fold Cross-Validation
- Divide data into k equal folds
- Train on k-1 folds, validate on 1 fold
- Repeat k times, average results
- More robust than single train-validation split

### Benefits
- Better use of limited data
- More reliable performance estimates
- Reduces variance in evaluation

In [None]:
# Cross-Validation Example
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

# Create model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-Fold Cross-Validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

print("5-Fold Cross-Validation Results:")
print(f"Fold scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Visualize CV folds
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fig, ax = plt.subplots(figsize=(12, 6))

for i, (train_idx, val_idx) in enumerate(kf.split(X_train)):
    # Create array to show train/validation split
    split_array = np.zeros(len(X_train))
    split_array[val_idx] = 1  # 1 for validation, 0 for training
    
    ax.imshow(split_array.reshape(1, -1), cmap='RdYlBu', aspect='auto', 
              extent=[0, len(X_train), i, i+1])
    ax.text(-20, i+0.5, f'Fold {i+1}', va='center', ha='right')

ax.set_xlim(0, len(X_train))
ax.set_ylim(0, 5)
ax.set_xlabel('Sample Index')
ax.set_ylabel('CV Fold')
ax.set_title('5-Fold Cross-Validation Splits\n(Blue: Training, Red: Validation)')
ax.set_yticks([])
plt.tight_layout()
plt.show()

## Best Practices

### Data Splitting
1. **Stratify**: Maintain class distribution in splits
2. **Random State**: Ensure reproducible results
3. **Time Series**: Use temporal splits for time-dependent data

### Model Evaluation
1. **Never touch test set** until final evaluation
2. **Use cross-validation** for model selection
3. **Monitor both training and validation performance**
4. **Consider multiple metrics** for comprehensive evaluation

In [None]:
# Complete workflow example
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

# 1. Hyperparameter tuning using cross-validation
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy'
)

# Fit on training data
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)

# 2. Final evaluation on test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"\nFinal test accuracy: {test_score:.3f}")

# 3. Detailed evaluation
y_pred = best_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## Summary

Understanding ML fundamentals is crucial for building effective models:

- **Data Understanding**: Know your features, labels, and data splits
- **Bias-Variance Tradeoff**: Balance model complexity
- **Overfitting Prevention**: Use proper validation techniques
- **Cross-Validation**: Get reliable performance estimates

## Next Steps
- Review Python libraries for ML (NumPy, Pandas, Scikit-learn)
- Learn mathematical foundations (calculus, linear algebra, statistics)
- Explore specific ML algorithms and their implementations