# Regression Model - FintelHub Capstone

**Objective**: Build and evaluate a regression model to predict continuous values in financial data.

**Choose Your Problem:**
- **Option A**: Transaction Amount Prediction (PaySim dataset)
- **Option B**: Loan Amount Prediction (Credit Risk dataset)
- **Option C**: Balance/Credit Score Prediction (Customer Churn or Credit Risk dataset)

---

## Table of Contents
1. [Setup & Imports](#setup)
2. [Data Loading & Exploration](#data-loading)
3. [Data Engineering](#data-engineering)
4. [Model Training](#model-training)
5. [Model Evaluation](#model-evaluation)
6. [Model Saving](#model-saving)
7. [Conclusions](#conclusions)

---
## 1. Setup & Imports <a id='setup'></a>

### Import Required Libraries

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Helper modules (in src folder)
import sys
sys.path.append('../src')
from data_engineering import *
from model_utils import *

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')

print("All libraries imported successfully!")

---
## 2. Data Loading & Exploration <a id='data-loading'></a>

### 2.1 Load Dataset

**Instructions**: Update the file path based on your chosen dataset:
- Transaction Amount: `'../data/raw/fraud_data.csv'`
- Loan Amount: `'../data/raw/credit_risk.csv'`
- Balance: `'../data/raw/customer_churn.csv'`

In [None]:
# TODO: Update with your chosen dataset path
DATA_PATH = '../data/raw/credit_risk.csv'  # Change this!

# Load data (limit rows for large datasets)
df = load_data(DATA_PATH, nrows=None)

# Display first few rows
df.head()

### 2.2 Basic Data Exploration (GUIDED)

In [None]:
# Dataset shape
print(f"Dataset Shape: {df.shape[0]} rows, {df.shape[1]} columns\n")

# Column information
print("Column Information:")
print(df.info())

# Statistical summary
print("\nStatistical Summary:")
df.describe()

### 2.3 Identify Target Variable

**Target Variable Names (choose what to predict):**
- Transaction Amount: `amount`
- Loan Amount: `loan_amnt`
- Balance: `balance`
- Income: `person_income`, `estimated_salary`

In [None]:
# TODO: Set your target column name
TARGET_COLUMN = 'loan_amnt'  # Change this based on your dataset!

# Check target distribution
print(f"Target Variable: {TARGET_COLUMN}")
print(f"\nBasic Statistics:")
print(df[TARGET_COLUMN].describe())

# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df[TARGET_COLUMN], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel(TARGET_COLUMN)
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Distribution of {TARGET_COLUMN}')
axes[0].grid(alpha=0.3)

# Boxplot
axes[1].boxplot(df[TARGET_COLUMN])
axes[1].set_ylabel(TARGET_COLUMN)
axes[1].set_title(f'Boxplot of {TARGET_COLUMN}')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Check for outliers
Q1 = df[TARGET_COLUMN].quantile(0.25)
Q3 = df[TARGET_COLUMN].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df[TARGET_COLUMN] < (Q1 - 1.5 * IQR)) | (df[TARGET_COLUMN] > (Q3 + 1.5 * IQR))).sum()
print(f"\nNumber of outliers: {outliers} ({outliers/len(df)*100:.2f}%)")

### 2.4 Check for Missing Values (GUIDED)

In [None]:
# Check missing values
missing_summary = check_missing_values(df)

### 2.5 Identify Feature Types

In [None]:
# TODO: Use the helper function to identify numerical and categorical features
feature_types = get_feature_types(df)

numerical_features = feature_types['numerical']
categorical_features = feature_types['categorical']

### 2.6 Correlation Analysis (GUIDED)

In [None]:
# Correlation with target variable
correlations = df[numerical_features].corr()[TARGET_COLUMN].sort_values(ascending=False)
print("Correlation with target variable:")
print(correlations)

# Visualize top correlations
plt.figure(figsize=(10, 6))
correlations.drop(TARGET_COLUMN).plot(kind='barh')
plt.xlabel('Correlation with Target')
plt.title(f'Feature Correlations with {TARGET_COLUMN}')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### ðŸ“Š Checkpoint 1
Before proceeding, ensure:
- âœ… Data is loaded successfully
- âœ… You understand the target variable distribution
- âœ… You've identified missing values (if any)
- âœ… You know which features are numerical vs categorical
- âœ… You've examined correlations with target

---
## 3. Data Engineering <a id='data-engineering'></a>

### 3.1 Handle Missing Values

In [None]:
# TODO: Handle missing values if any were found
# Choose strategy: 'mean', 'median', 'mode', or 'drop'

if len(missing_summary) > 0:
    # Example: Fill numerical columns with median
    df = handle_missing_values(df, strategy='median')
else:
    print("No missing values to handle!")

### 3.2 Drop Irrelevant Columns

**Common columns to drop:**
- ID columns
- Timestamp columns
- Columns with data leakage (e.g., future information)

In [None]:
# TODO: Drop irrelevant columns
columns_to_drop = []  # Add column names here

if columns_to_drop:
    df = df.drop(columns=columns_to_drop)
    print(f"Dropped columns: {columns_to_drop}")

print(f"\nRemaining columns: {df.shape[1]}")
print(df.columns.tolist())

### 3.3 Handle Outliers (Optional)

Be careful with outliers in regression - they might be legitimate extreme values!

In [None]:
# TODO: Remove outliers if necessary
# Uncomment and modify if needed

# columns_to_check = [TARGET_COLUMN]  # Add other columns if needed
# df = remove_outliers(df, columns_to_check, method='iqr', threshold=1.5)

print(f"Current dataset shape: {df.shape}")

### 3.4 Encode Categorical Variables

In [None]:
# TODO: Encode categorical variables
# Update categorical_features list if you dropped any columns

# Remove target column from categorical features if present
if TARGET_COLUMN in categorical_features:
    categorical_features.remove(TARGET_COLUMN)

if categorical_features:
    print(f"Encoding categorical features: {categorical_features}")
    df = encode_categorical(df, categorical_features, method='onehot')
else:
    print("No categorical features to encode")

print(f"\nShape after encoding: {df.shape}")

### 3.5 Separate Features and Target

In [None]:
# TODO: Separate features (X) and target (y)
X = df.drop(columns=[TARGET_COLUMN])
y = df[TARGET_COLUMN]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns ({len(X.columns)}):")
print(X.columns.tolist())

### 3.6 Train-Test Split (GUIDED)

In [None]:
# Split data: 80% training, 20% testing
# No stratification needed for regression
X_train, X_test, y_train, y_test = create_train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# Check target distribution in both sets
print(f"\nTraining target - Mean: {y_train.mean():.2f}, Std: {y_train.std():.2f}")
print(f"Test target - Mean: {y_test.mean():.2f}, Std: {y_test.std():.2f}")

### 3.7 Feature Scaling

In [None]:
# TODO: Scale numerical features
X_train_scaled, X_test_scaled, scaler = scale_features(
    X_train, X_test, 
    method='standard'
)

# Convert back to DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("\nFeatures scaled successfully!")

### ðŸ“Š Checkpoint 2
Before proceeding to modeling, verify:
- âœ… Missing values are handled
- âœ… Outliers are addressed (if needed)
- âœ… Categorical variables are encoded
- âœ… Data is split into train/test sets
- âœ… Features are scaled

---
## 4. Model Training <a id='model-training'></a>

### 4.1 Train Multiple Models (GUIDED)

In [None]:
# Train four regression models
trained_models = train_regression_models(X_train_scaled, y_train)

print("\nAll models trained successfully!")
print(f"Models: {list(trained_models.keys())}")

### 4.2 Compare Models on Test Set

In [None]:
# TODO: Compare all models and create comparison dataframe
comparison_df = compare_regression_models(trained_models, X_test_scaled, y_test)

# Display comparison
comparison_df

### 4.3 Select Best Model

In [None]:
# TODO: Select best model based on RÂ² score (or RMSE)
# Higher RÂ² is better, Lower RMSE is better

best_model_name = comparison_df.loc[comparison_df['r2'].idxmax(), 'Model']
best_model = trained_models[best_model_name]

print(f"âœ… Best Model: {best_model_name}")
print(f"\nBest Model Metrics:")
print(comparison_df[comparison_df['Model'] == best_model_name].to_string(index=False))

---
## 5. Model Evaluation <a id='model-evaluation'></a>

### 5.1 Detailed Evaluation of Best Model

In [None]:
# Get predictions
y_pred = best_model.predict(X_test_scaled)

# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Detailed Evaluation - {best_model_name}")
print("="*60)
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE):      {mae:.4f}")
print(f"RÂ² Score:                       {r2:.4f}")
print("\nInterpretation:")
print(f"- On average, predictions are off by {mae:.2f} units")
print(f"- Model explains {r2*100:.2f}% of the variance in the target")

### 5.2 Actual vs Predicted Plot

In [None]:
# TODO: Plot actual vs predicted values
plot_predictions(y_test, y_pred, best_model_name)

### 5.3 Residual Analysis

In [None]:
# TODO: Plot residuals
plot_residuals(y_test, y_pred, best_model_name)

### 5.4 Error Distribution Analysis

In [None]:
# Analyze prediction errors
errors = y_test - y_pred
percent_errors = (errors / y_test) * 100

print("Error Analysis:")
print(f"Mean Error: {errors.mean():.4f}")
print(f"Std Error: {errors.std():.4f}")
print(f"Mean Absolute Percentage Error: {np.abs(percent_errors).mean():.2f}%")

# Find worst predictions
worst_predictions = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred,
    'Error': np.abs(errors)
}).sort_values('Error', ascending=False).head(10)

print("\nTop 10 Worst Predictions:")
print(worst_predictions)

### 5.5 Feature Importance

In [None]:
# TODO: Get and plot feature importance (for tree-based models)
if hasattr(best_model, 'feature_importances_'):
    feature_importance_df = get_feature_importance(
        best_model, 
        X_train_scaled.columns, 
        top_n=10
    )
else:
    print(f"{best_model_name} does not have feature importance attribute")
    print("\nFor Linear/Ridge models, you can examine coefficients:")
    if hasattr(best_model, 'coef_'):
        coef_df = pd.DataFrame({
            'Feature': X_train_scaled.columns,
            'Coefficient': best_model.coef_
        }).sort_values('Coefficient', key=abs, ascending=False)
        print(coef_df.head(10))

---
## 6. Model Saving <a id='model-saving'></a>

### 6.1 Save the Best Model

In [None]:
# TODO: Save your best model
model_filename = f"../models/regression_{best_model_name.lower().replace(' ', '_')}.pkl"
save_model(best_model, model_filename)

### 6.2 Save Preprocessing Artifacts

In [None]:
# TODO: Save scaler for use in production
import joblib
joblib.dump(scaler, '../models/scaler_regression.pkl')
print("âœ… Scaler saved to ../models/scaler_regression.pkl")

### 6.3 Save Model Performance Summary

In [None]:
# Save comparison results
comparison_df.to_csv('../models/regression_model_comparison.csv', index=False)
print("âœ… Model comparison saved to ../models/regression_model_comparison.csv")

# Save predictions
save_predictions(y_test, y_pred, '../models/regression_predictions.csv')

---
## 7. Conclusions <a id='conclusions'></a>

### 7.1 Summary of Results

**TODO: Write a brief summary of your findings**

1. **Dataset**: [Describe which dataset you used]

2. **Target Variable**: [What are you predicting?]

3. **Data Challenges**: [Note any issues like outliers, correlations, etc.]

4. **Best Model**: [State which model performed best and why]

5. **Key Metrics**: 
   - RMSE: [value]
   - MAE: [value]
   - RÂ²: [value]

6. **Prediction Accuracy**: [How accurate are the predictions? What's the average error?]

7. **Important Features**: [List top 3-5 features that influenced predictions]

8. **Model Limitations**: [Note any limitations, residual patterns, or concerns]

9. **Next Steps**: [Suggest improvements or further analysis]

### 7.2 Test Model Loading

In [None]:
# Verify model can be loaded
loaded_model = load_model(model_filename)

# Test prediction
test_prediction = loaded_model.predict(X_test_scaled[:5])
print(f"\nTest predictions: {test_prediction}")
print(f"Actual values: {y_test[:5].values}")
print(f"Errors: {y_test[:5].values - test_prediction}")
print("\nâœ… Model loaded and tested successfully!")

---
## ðŸŽ‰ Congratulations!

You've successfully completed the Regression Model notebook!

**Next Steps:**
1. Review both classification and regression models
2. Compare performance across different approaches
3. Document your learnings
4. Prepare for model deployment to H2O
5. Consider advanced techniques (feature engineering, hyperparameter tuning)