# **3. Regression Modeling**

## *Table of Contents*

1. [Data Cleaning](../01_Data_Cleaning/1_Data_Cleaning.ipynb)
2. [EDA and Feature Engineering](../02_Exploratory_Data_Analysis/2_Exploratory_Data_Analysis.ipynb)
3. [**Regression Modeling**](./3_Regression_Modeling.ipynb)
4. [Time Series](../04_Time_Series_Analysis/4_Time_Series.ipynb)

## **Library Imports**

### Standard library imports

In [5]:
import sys # Provides a way of using operating system dependent functionality
import os  # For interacting with the operating system

### Third-party imports

In [6]:
import matplotlib.pyplot as plt  # For creating visualizations
import numpy as np  # For numerical computations
import pandas as pd  # For data manipulation and analysis
import seaborn as sns  # For high-level data visualization
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV, mutual_info_regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import (train_test_split, GridSearchCV, cross_val_score, KFold)
from sklearn.inspection import plot_partial_dependence, permutation_importance

### Local application imports

In [7]:
# Define the absolute path of the parent directory of the script's grandparent directory
# This is useful for module importation from a different directory structure
parent_dir = os.path.dirname(os.getcwd())
sys.path.insert(0, parent_dir)

# Local application imports
from utils import func_utils

## **File Importation**

In [8]:
# Determine the absolute path to the directory containing the current script
script_dir = os.path.dirname(os.getcwd())

# Construct the path to the data file
data_path = os.path.join(script_dir, '01_Data_Cleaning', '1_cleaned_melb_data.csv')

# Load dataset containing cleaned Melbourne housing data
melb_data = pd.read_csv(data_path)

## **Feature Selection**

In [None]:
# Identify categorical features for dummy variable creation.
categorical_features = [
    'Postcode', 'Suburb', 'Regionname', 'CouncilArea', 'Type', 'SellerG',
    'Method', 'Year', 'Month'
]

# Create dummy variables for categorical features and ensure column names are strings.
melb_fs_df = func_utils.concat_dummies(melb_data, categorical_features)
melb_fs_df.columns = melb_fs_df.columns.astype(str)

# Define columns to exclude from the feature set.
excluded_columns = [
    'Address', 'Postcode', 'Suburb', 'Regionname', 'CouncilArea', 'Type',
    'SellerG', 'Method', 'Date', 'Year', 'Month'
]

# Prepare the feature matrix (X) and target vector (y).
X_fs = melb_fs_df.drop(columns='Price')
y_fs = melb_fs_df['Price']

# Split the dataset into training and test sets with a test size of 20% and a fixed random state for reproducibility.
X_train_fs, X_test_fs, y_train_fs, y_test_fs = train_test_split(
    X_fs, y_fs, test_size=0.2, random_state=42
)

### Mutual Information (MI)

In [None]:
# Calculate Mutual Information (MI) scores to select informative features.
mi_scores = mutual_info_regression(X_train_fs, y_train_fs)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X_train_fs.columns)
mi_scores = mi_scores.sort_values(ascending=False)
print(mi_scores.head(100).to_string())

### Recursive Feature Elimination (RFE)

In [None]:
# Initialize a RandomForestRegressor as the estimator for RFE.
estimator = RandomForestRegressor(n_jobs=-1, random_state=42)

#### Apply RFECV:

In [None]:
# Define cross-validation strategy
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize RFECV with increased step size and simplified model
rfecv = RFECV(estimator, step=1, cv=cv_strategy, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit RFECV
rfecv.fit(X_train_fs, y_train_fs)

#### Review the Results:

In [None]:
# Print the optimal number of features and the best features.
print("Optimal number of features: ", rfecv.n_features_)
print("Best features: ", X_train_fs.columns[rfecv.support_])

# Get the feature rankings
feature_rankings = rfecv.ranking_

# Create a series with feature names and their corresponding rankings
ranking_series = pd.Series(feature_rankings, index=X_train_fs.columns)

# Sort the series to have the highest ranking features at the top
sorted_ranking_series = ranking_series.sort_values()

# Filter the series to get only features with rank 1
top_features = sorted_ranking_series[sorted_ranking_series == 1].index.tolist()

print(top_features)

#### Plot the CV Score vs. Number of Features:

In [None]:
# Plot the CV score as a function of the number of features.
if hasattr(rfecv, "cv_results_"):
    scores = rfecv.cv_results_['mean_test_score']
    plt.figure(figsize=(12, 6))  # You can adjust the figure size to your preference
    plt.xlabel("Number of features selected")
    plt.ylabel("Cross-validation score (neg_mean_squared_error)")
    plt.plot(range(1, len(scores) + 1), scores, marker='o', markersize=3)  # Add marker for each point
    plt.grid(True)  # Add gridlines for better precision in viewing
    plt.title('RFECV - Number of features vs CV Score')
    
    # Set x-axis ticks to increments of 25
    plt.xticks(np.arange(0, len(scores) + 1, 25))
    
    plt.tight_layout()  # Adjusts plot to ensure everything fits without overlapping
    plt.axvline(x=rfecv.n_features_, color='r', linestyle='--', label='Optimal number of features')
    plt.legend()
    plt.show()

## **Preprocessing for Model Training**

## **GridSearchCV for Random Forest**

## **Random Forest Model Fitting**

## **Model Diagnostics**

### R² (R-squared)

### Cross Validation

### RMSE Calculation

### MSE & MAE Calculation

### Partial Dependence Plots

### Permutation Feature Importance

### Feature Importance Analysis