# House price regression

![houses](https://storage.googleapis.com/kaggle-media/competitions/House%20Prices/kaggle_5407_media_housesbanner.png)

### Description

We will now put together our knowledge of data preprocessing and linear regression to make a machine learning model for house price prediction. We will focus on feature selection and explore a couple of different linear regression model options with a goal of training the best predictive model.

### Dataset

Like yesterday, we will take our dataset from kaggle. This is a relatively large dataset, with as many as 79 features describing houses in Ames, Iowa. 

### Aims

1. Explore the data to see what preprocessing is required
2. Look into the relationships between features and the target variable to select features
4. Train three different regression models and compare results
5. Try again with different selected features 

Each time we load Colab, we need to upload our kaggle.json file to access the dataset. 

In [None]:
# Then, we need to move the kaggle.json file to the expected location  

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Import dataset
Housing prices

In [None]:
!kaggle competitions download -c house-prices-advanced-regression-techniques
!unzip house-prices-advanced-regression-techniques.zip
df = pd.read_csv('train.csv')

## Exploratory data analysis (EDA)

In [None]:
# Display first few rows and basic information about the dataset
df.head()

In [None]:
df.info()

In [None]:
# Summary statistics
print(df.describe())

In [None]:
# Plot relationship between chosen features and target variable
plt.figure(figsize=(10, 6))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=df)
plt.title('Relationship between GrLivArea (Above Grade Living Area) and Sale Price')
plt.xlabel('GrLivArea')
plt.ylabel('Sale Price')
plt.show()

Or, plot relationship between all features and target variable, by computing the correlations between variables

To do this, and to use all the features in the dataset, we need to make sure that all the features are numerical. We can do this by converting all the categorical features to numerical using one-hot encoding. We can then drop the original categorical features. We can also drop the Id column as it is not useful for our analysis.

In [None]:
# Identify non-numeric columns
non_numeric_columns = df.select_dtypes(include=['object']).columns
print("Non-numeric columns:", non_numeric_columns)

In [None]:
# Convert non-numeric columns using one-hot encoding
df_encoded = pd.get_dummies(df, columns=non_numeric_columns, drop_first=True)

In [None]:
# Drop the Id column
df_encoded.drop('Id', axis=1, inplace=True)

In [None]:
# Get correlation between variables with respect to SalePrice
correlation = df_encoded.corr()

# get top 10 (positive) and bottom 10 (negative) correlation values
top_corr = correlation['SalePrice'].sort_values(ascending=False)[:10]
bottom_corr = correlation['SalePrice'].sort_values(ascending=True)[:10]

# Concatenate top and bottom correlations
all_corr = pd.concat([top_corr, bottom_corr])

In [None]:
# Plot heatmap of correlation matrix for top 10 positively and top 10 negatively correlated features
plt.figure(figsize=(12, 10))
sns.heatmap(correlation.loc[all_corr.index, all_corr.index], annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Top 10 Features')
plt.show()

With the correlation matrix, you can see which features are correlated with the SalePrice, and which features are correlated with each other to identify dependent and independent variables. 

## Preprocessing

In [None]:
# Check for missing values
missing = df.isnull().sum()

In [None]:
# Show only the columns that have missing values
missing_columns = missing[missing > 0]

# Convert to a percentage to see % missing values for each column
missing_percentage = (missing_columns / len(df)) * 100
print(missing_percentage)

In [None]:
# Drop columns with more than 50% missing values - or consider a lower threshold 
df_cleaned = df.dropna(thresh=0.5*len(df), axis=1)

In [None]:
# Check for duplicate rows
duplicates = df_cleaned.duplicated().sum()
print("Number of duplicate rows:", duplicates)

Can you think of any other checks to do on the data?

In [None]:
# Select features and target variable, e.g. 
X = df_cleaned[['GrLivArea', 'YearBuilt']]
y = df_cleaned['SalePrice']

In [None]:
# Encode categorical variables (if not already done)
X = pd.get_dummies(X)

In [None]:
# Split data into test and training sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Feature scaling 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Regression model training

In [None]:
# Import regression models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42)
}

In [None]:
# Train and evaluate models
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Model: {name}")
    print(f"Mean Squared Error: {mse:.2f}")
    print(f"R^2 Score: {r2:.2f}")
    print("-------------------")

## Model evaluation and selection

In [None]:
# Compare models visually
plt.figure(figsize=(10, 6))
for name, model in models.items():
    y_pred = model.predict(X_test_scaled)
    plt.scatter(y_test, y_pred, label=name)
plt.xlabel('Actual Sale Price')
plt.ylabel('Predicted Sale Price')
plt.title('Actual vs Predicted Sale Prices')
plt.legend()
plt.show()

# Discuss model performance and selection criteria (e.g., MSE, R^2)

## Advanced: optimise model hyperparameters

You can read more about the hyperparameters of the models here: \
Decision trees:\
https://scikit-learn.org/stable/modules/tree.html \
Random forest:\
https://scikit-learn.org/stable/modules/ensemble.html#forest \
Finding hyperparameters and GridSearch: \
https://scikit-learn.org/stable/modules/grid_search.html

In [None]:
# Run the models again with hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Hyperparameters for Decision Tree
param_grid_dt = {
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid Search for Decision Tree
grid_search_dt = GridSearchCV(DecisionTreeRegressor(random_state=42), param_grid_dt, cv=5)
grid_search_dt.fit(X_train_scaled, y_train)
best_dt = grid_search_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test_scaled)
mse_dt = mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)
print("Decision Tree:")
print(f"Best Parameters: {grid_search_dt.best_params_}")
print(f"Mean Squared Error: {mse_dt:.2f}")
print(f"R^2 Score: {r2_dt:.2f}")

In [None]:
# Hyperparameters for Random Forest
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid Search for Random Forest
grid_search_rf = GridSearchCV(RandomForestRegressor(random_state=42), param_grid_rf, cv=5)
grid_search_rf.fit(X_train_scaled, y_train)
best_rf = grid_search_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test_scaled)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print("Random Forest:")
print(f"Best Parameters: {grid_search_rf.best_params_}")
print(f"Mean Squared Error: {mse_rf:.2f}")
print(f"R^2 Score: {r2_rf:.2f}")


Optional extra: Find and implement additional regression model options on scikit-learn https://scikit-learn.org/stable/supervised_learning.html 