# House Price Prediction with Linear Regression

In this project, i am going to predict the price of a house using information like its location, area, no. of rooms etc. I'll use the dataset from the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition on [Kaggle](https://kaggle.com). I'll follow a step-by-step process to train my model:

1. Download and explore the data
2. Prepare the dataset for training
3. Train a linear regression model
4. Make predictions and evaluate the model

## Step 1 - Download and Explore the Data

The dataset is available as a ZIP file at the following url:

In [None]:
dataset_url = 'https://github.com/alkatomar19/MachineLearning/tree/main/HousingPrediction/dataset/house-prices-advanced-regression-techniques.zip'

In [None]:
from urllib.request import urlretrieve

In [None]:
urlretrieve(dataset_url, 'house-prices.zip')

In [None]:
from zipfile import ZipFile

In [None]:
with ZipFile('house-prices.zip') as f:
    f.extractall(path='house-prices')

The dataset is extracted to the folder `house-prices`. Let's view the contents of the folder using the [`os`](https://docs.python.org/3/library/os.html) module.

In [None]:
import os

In [None]:
data_dir = 'house-prices'

In [None]:
os.listdir(data_dir)

We'll use the data in the file `train.csv` for training our model. We can load the for processing using the [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html) library.

In [None]:
import pandas as pd
pd.options.display.max_columns = 200
pd.options.display.max_rows = 200

In [None]:
train_csv_path = data_dir + '/train.csv'
train_csv_path

In [None]:
# Load the data from the file `train.csv` into a Pandas data frame.
prices_df = pd.read_csv(train_csv_path)

In [None]:
#columns and data types within the dataset.
prices_df.info()

In [None]:
#How many rows and columns does the dataset contain? 
n_rows = prices_df.shape[0]
n_cols = prices_df.shape[1]
print('The dataset contains {} rows and {} columns.'.format(n_rows, n_cols))


Exploration and visualization of data from the various columns within the dataset, and studying their relationship with the price of the house (using scatter plot and correlations).

In [None]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#The following settings will improve the default style and font sizes for our charts.

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
fig = px.scatter(prices_df, 
                 x='TotalBsmtSF', 
                 y='SalePrice', 
                 title='Basement Size vs. Sales Price')
fig.update_traces(marker_size=5)
fig.show()


## Step 2 - Prepare the Dataset for Training

Before we can train the model, we need to prepare the dataset. Here are the steps we'll follow:

1. Identify the input and target column(s) for training the model.
2. Identify numeric and categorical input columns.
3. [Impute](https://scikit-learn.org/stable/modules/impute.html) (fill) missing values in numeric columns
4. [Scale](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) values in numeric columns to a $(0,1)$ range.
5. [Encode](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) categorical data into one-hot vectors.
6. Split the dataset into training and validation sets.


### Identify Inputs and Targets

While the dataset contains 81 columns, not all of them are useful for modeling. 

- The first column `Id` is a unique ID for each house and isn't useful for training the model.
- The last column `SalePrice` contains the value we need to predict i.e. it's the target column.
- Data from all the other columns (except the first and the last column) can be used as inputs to the model.
 

In [None]:
prices_df.head()

In [None]:
# Identify the input columns (a list of column names)
input_cols = prices_df.columns.difference(['Id','SalePrice'])

In [None]:
# Identify the name of the target column (a single string, not a list)
target_col = 'SalePrice'

Make sure that the `Id` and `SalePrice` columns are not included in `input_cols`.

Now that we've identified the input and target columns, we can separate input & target data.

In [None]:
inputs_df = prices_df[input_cols].copy()

In [None]:
targets = prices_df[target_col]

In [None]:
inputs_df.head()

In [None]:
targets

### Identify Numeric and Categorical Data

The next step in data preparation is to identify numeric and categorical columns. We can do this by looking at the data type of each column.

In [None]:
prices_df.info()

We will Create two lists `numeric_cols` and `categorical_cols` containing names of numeric and categorical input columns within the dataframe respectively. Numeric columns have data types `int64` and `float64`, whereas categorical columns have the data type `object`. 

In [None]:
import numpy as np
numeric_cols = inputs_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = inputs_df.select_dtypes(include=['object']).columns.tolist()

In [None]:
print(list(numeric_cols))

In [None]:
print(list(categorical_cols))

In [None]:
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

I'll replace missing values with the average value in the column using the `SimpleImputer` class from `sklearn.impute`.


In [None]:
from sklearn.impute import SimpleImputer
# 1. Create the imputer
imputer = SimpleImputer(strategy='mean')
# 2. Fit the imputer to the numeric colums
imputer.fit(inputs_df[numeric_cols])
# 3. Transform and replace the numeric columns
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])

#After imputation, none of the numeric columns should contain any missing values.
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0] # should be an empty list

In [None]:
### Scale Numerical Values

#The numeric columns in our dataset have varying ranges. 
inputs_df[numeric_cols].describe().loc[['min', 'max']]

In [None]:
from sklearn.preprocessing import MinMaxScaler
# Create the scaler
scaler = MinMaxScaler()
# Fit the scaler to the numeric columns
scaler.fit(inputs_df[numeric_cols])
# Transform and replace the numeric columns
inputs_df[numeric_cols] = scaler.transform(inputs_df[numeric_cols])


In [None]:
#After scaling, the ranges of all numeric columns should be $(0, 1)$.
inputs_df[numeric_cols].describe().loc[['min', 'max']]

In [None]:
### Encode Categorical Columns

#Our dataset contains several categorical columns, each with a different number of categories.
inputs_df[categorical_cols].nunique().sort_values(ascending=False)

In [None]:
from sklearn.preprocessing import OneHotEncoder
# 1. Create the encoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
# 2. Fit the encoder to the categorical colums
encoder.fit(inputs_df[categorical_cols])
# 3. Generate column names for each category
encoded_cols = list(encoder.get_feature_names(categorical_cols))
len(encoded_cols)
# 4. Transform and add new one-hot category columns
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])


### Training and Validation Set

Finally, let's split the dataset into a training and validation set. I'll use a randomly select 25% subset of the data for validation. Also, we'll use just the numeric and encoded columns, since the inputs to our model must be numbers. 

In [None]:
from sklearn.model_selection import train_test_split
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs_df[numeric_cols + encoded_cols], targets, test_size=0.25, random_state=42)

## Step 3 - Train a Linear Regression Model

We're now ready to train the model. I'll use Ridge Regression, a variant of linear regression that uses a technique called L2 regularization to introduce another loss term that forces the model to generalize better. 

In [None]:
from sklearn.linear_model import Ridge
# Create the model
model = Ridge(random_state=42)
# Fit the model using inputs and targets
model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)



## Step 4 - Make Predictions and Evaluate Your Model

The model is now trained, and we can use it to generate predictions for the training and validation inputs. We can evaluate the model's performance using the RMSE (root mean squared error) loss function.

In [None]:
from sklearn.metrics import mean_squared_error
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
train_preds = model.predict(X_train)
train_rmse = mean_squared_error(train_targets, train_preds,squared=False)
print('The RMSE loss for the training set is $ {}.'.format(train_rmse))


In [None]:
val_preds = model.predict(X_val)
val_rmse =mean_squared_error(val_targets,val_preds,squared=False)
print('The RMSE loss for the validation set is $ {}.'.format(val_rmse))


### Feature Importance

Let's look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

In [None]:
weights = model.coef_
weights_df = pd.DataFrame({
    'columns': train_inputs.columns,
    'weight': weights
}).sort_values('weight', ascending=False)
weights_df.head()

### Making Predictions

The model can be used to make predictions on new inputs using the following helper function:

In [None]:
def predict_input(single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols].values)
    X_input = input_df[numeric_cols + encoded_cols]
    return model.predict(X_input)[0]

In [None]:
sample_input = { 'MSSubClass': 20, 'MSZoning': 'RL', 'LotFrontage': 77.0, 'LotArea': 9320,
 'Street': 'Pave', 'Alley': None, 'LotShape': 'IR1', 'LandContour': 'Lvl', 'Utilities': 'AllPub',
 'LotConfig': 'Inside', 'LandSlope': 'Gtl', 'Neighborhood': 'NAmes', 'Condition1': 'Norm', 'Condition2': 'Norm',
 'BldgType': '1Fam', 'HouseStyle': '1Story', 'OverallQual': 4, 'OverallCond': 5, 'YearBuilt': 1959,
 'YearRemodAdd': 1959, 'RoofStyle': 'Gable', 'RoofMatl': 'CompShg', 'Exterior1st': 'Plywood',
 'Exterior2nd': 'Plywood', 'MasVnrType': 'None','MasVnrArea': 0.0,'ExterQual': 'TA','ExterCond': 'TA',
 'Foundation': 'CBlock','BsmtQual': 'TA','BsmtCond': 'TA','BsmtExposure': 'No','BsmtFinType1': 'ALQ',
 'BsmtFinSF1': 569,'BsmtFinType2': 'Unf','BsmtFinSF2': 0,'BsmtUnfSF': 381,
 'TotalBsmtSF': 950,'Heating': 'GasA','HeatingQC': 'Fa','CentralAir': 'Y','Electrical': 'SBrkr', '1stFlrSF': 1225,
 '2ndFlrSF': 0, 'LowQualFinSF': 0, 'GrLivArea': 1225, 'BsmtFullBath': 1, 'BsmtHalfBath': 0, 'FullBath': 1,
 'HalfBath': 1, 'BedroomAbvGr': 3, 'KitchenAbvGr': 1,'KitchenQual': 'TA','TotRmsAbvGrd': 6,'Functional': 'Typ',
 'Fireplaces': 0,'FireplaceQu': np.nan,'GarageType': np.nan,'GarageYrBlt': np.nan,'GarageFinish': np.nan,'GarageCars': 0,
 'GarageArea': 0,'GarageQual': np.nan,'GarageCond': np.nan,'PavedDrive': 'Y', 'WoodDeckSF': 352, 'OpenPorchSF': 0,
 'EnclosedPorch': 0,'3SsnPorch': 0, 'ScreenPorch': 0, 'PoolArea': 0, 'PoolQC': np.nan, 'Fence': np.nan, 'MiscFeature': 'Shed',
 'MiscVal': 400, 'MoSold': 1, 'YrSold': 2010, 'SaleType': 'WD', 'SaleCondition': 'Normal'}

In [None]:
predicted_price = predict_input(sample_input)
print('The predicted sale price of the house is ${}'.format(predicted_price))

### Saving the model

Let's save the model (along with other useful objects) to disk, so that we use it for making predictions without retraining.

In [None]:
import joblib
house_price_predictor = {
    'model': model,
    'imputer': imputer,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}
joblib.dump(house_price_predictor, 'house_price_predictor.joblib')