# ML/AI Learning Project - Interactive Tutorial

Welcome to the interactive ML/AI learning tutorial! This notebook will guide you through:
1. Understanding the data
2. Data preprocessing
3. Training multiple models
4. Evaluating and comparing models
5. Making predictions

## ðŸŽ¯ Learning Goals
- Learn the complete ML workflow
- Understand different ML algorithms
- Practice with real code examples
- Experiment with hyperparameters

## Setup
First, let's import all necessary libraries and modules.

In [None]:
import sys
from pathlib import Path

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Our custom modules
from ml_project.generate_data import generate_housing_data
from ml_project.data_loader import load_data
from ml_project.preprocessing import DataPreprocessor, split_data
from ml_project.models import LinearRegressionModel, RandomForestModel, NeuralNetworkModel
from ml_project.evaluate import evaluate_model, compare_models
from ml_project.visualize import (
    plot_feature_correlations,
    plot_predictions_vs_actual,
    plot_residuals,
    plot_feature_importance,
    plot_model_comparison,
    plot_price_distribution
)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("âœ… All imports successful!")

## Step 1: Generate and Explore Data

Let's create synthetic housing data and explore it.

In [None]:
# Generate housing data
data = generate_housing_data(n_samples=1000, seed=42)

print(f"Dataset shape: {data.shape}")
print("\nFirst few rows:")
data.head()

In [None]:
# Basic statistics
print("Descriptive Statistics:")
data.describe()

In [None]:
# Visualize price distribution
plot_price_distribution(data)

In [None]:
# Check for correlations
plot_feature_correlations(data)

## Step 2: Data Preprocessing

Before training models, we need to:
1. Encode categorical variables
2. Scale numerical features
3. Split data into train/test sets

In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor()

# Fit and transform data
X, y = preprocessor.fit_transform(data)

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature names: {preprocessor.feature_names}")

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = split_data(X, y, test_size=0.2, random_state=42)

# Further split training for validation
X_train, X_val, y_train, y_val = split_data(X_train, y_train, test_size=0.2, random_state=42)

print(f"Training set: {len(X_train)} samples")
print(f"Validation set: {len(X_val)} samples")
print(f"Test set: {len(X_test)} samples")

## Step 3: Train Models

Let's train three different types of models:
1. **Linear Regression**: Simple baseline model
2. **Random Forest**: Ensemble tree-based model
3. **Neural Network**: Deep learning approach

### 3.1 Linear Regression

In [None]:
# Train Linear Regression
lr_model = LinearRegressionModel()
lr_model.train(X_train, y_train)

# Evaluate
lr_metrics, lr_pred = evaluate_model(lr_model, X_test, y_test)

print(f"\nLinear Regression Results:")
print(f"RMSE: ${lr_metrics['rmse']:,.2f}")
print(f"RÂ² Score: {lr_metrics['r2']:.4f}")

### 3.2 Random Forest

In [None]:
# Train Random Forest
rf_model = RandomForestModel()
rf_model.train(X_train, y_train)

# Evaluate
rf_metrics, rf_pred = evaluate_model(rf_model, X_test, y_test)

print(f"\nRandom Forest Results:")
print(f"RMSE: ${rf_metrics['rmse']:,.2f}")
print(f"RÂ² Score: {rf_metrics['r2']:.4f}")

### 3.3 Neural Network

In [None]:
# Train Neural Network
nn_model = NeuralNetworkModel()
nn_model.build_model(X_train.shape[1])

# Display model architecture
print("Neural Network Architecture:")
nn_model.model.summary()

# Train (this may take a minute)
history = nn_model.train(X_train, y_train, X_val, y_val, verbose=1)

# Evaluate
nn_metrics, nn_pred = evaluate_model(nn_model, X_test, y_test)

print(f"\nNeural Network Results:")
print(f"RMSE: ${nn_metrics['rmse']:,.2f}")
print(f"RÂ² Score: {nn_metrics['r2']:.4f}")

## Step 4: Compare Models

Let's compare all three models side-by-side.

In [None]:
# Compile results
results = {
    'Linear Regression': lr_metrics,
    'Random Forest': rf_metrics,
    'Neural Network': nn_metrics
}

# Compare
best_model = compare_models(results)

# Visualize comparison
plot_model_comparison(results)

## Step 5: Analyze Best Model

Let's dive deeper into the best performing model.

In [None]:
# Use Random Forest for analysis (typically best)
best_model = rf_model
best_pred = rf_pred

# Predictions vs Actual
plot_predictions_vs_actual(y_test, best_pred, 'Random Forest')

In [None]:
# Residual analysis
plot_residuals(y_test, best_pred, 'Random Forest')

In [None]:
# Feature importance
importance = rf_model.get_feature_importance(preprocessor.feature_names)
plot_feature_importance(importance, top_n=8, model_name='Random Forest')

## Step 6: Make Predictions

Now let's use our trained model to predict prices for new houses.

In [None]:
# Create a new house
new_house = pd.DataFrame([{
    'square_feet': 2800,
    'bedrooms': 4,
    'bathrooms': 3,
    'year_built': 2018,
    'lot_size': 12000,
    'garage_spaces': 2,
    'neighborhood': 'Suburbs',
    'house_type': 'Single-Family'
}])

# Preprocess
new_house_processed = preprocessor.transform(new_house)

# Predict with all models
print("Price Predictions for New House:")
print("="*50)
print(f"Linear Regression:  ${lr_model.predict(new_house_processed)[0]:,.2f}")
print(f"Random Forest:      ${rf_model.predict(new_house_processed)[0]:,.2f}")
print(f"Neural Network:     ${nn_model.predict(new_house_processed)[0]:,.2f}")
print("="*50)

## ðŸŽ“ Exercises for Learning

Try these exercises to deepen your understanding:

1. **Experiment with hyperparameters**:
   - Change `n_estimators` in Random Forest
   - Modify neural network architecture
   - Adjust learning rate

2. **Feature engineering**:
   - Create new features (e.g., price per square foot)
   - Try removing features and see impact

3. **Different models**:
   - Try XGBoost or LightGBM
   - Implement ensemble methods

4. **Cross-validation**:
   - Implement k-fold cross-validation
   - Compare results across folds

5. **Real data**:
   - Use a real dataset from Kaggle
   - Apply the same workflow

## ðŸ“š Key Takeaways

1. **Data preprocessing is crucial**: Proper scaling and encoding can significantly impact model performance
2. **Different models have different strengths**: Tree-based models often work well without much tuning
3. **Always use a test set**: Never evaluate on training data
4. **Visualize your results**: Plots help identify issues and communicate findings
5. **Start simple**: Begin with simple models before moving to complex ones

## ðŸš€ Next Steps

- Explore the codebase in `src/ml_project/`
- Run the standalone scripts: `train.py`, `predict.py`
- Read about ML concepts in the resources listed in README
- Work on your own ML project!

**Happy Learning! ðŸŽ‰**