## Intro to Machine Learning: Linear Regression
#### Example using the Home Price Data from Kaggle
##### Datasets can be downloaded here: 
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

#### Problem Description:

We are trying to predict the sale price for a group of houses. We have two datasets: one containing a bunch of information about some houses, along with the price for which the home actually sold. The other dataset contains the same information about some different homes, but the data does NOT include the sale price of the home.

The goal of this notebook is to predict the house price using a few different data science models, including a linear regression and a random forest regresssor. 

We will then evaluate our models, and submit the result from one with the best performance to Kaggle!

#### Notebook Overview:

1) Read in the data and prepare for model building:
2) Create model(s) and train them on the 'train' dataset.
3) Evaluate their success using a metric call "Mean Absolute Error" (MAE for short). 
4) Use the model with the best performance to make predictions and send to Kaggle to see how we stack up. 

#### 1) Read in the data and prepare for model building:

In [1]:
# Python libraries we'll need to manipulate the data and create predictions

import numpy as np
import sklearn
import pandas as pd

# Allow us to view entire dataset without it being truncated
#pd.set_option('display.max_columns', 500)



In [2]:
# Uncomment and run this cell block if the cell above gives an error. This will install the libraries

#!pip install -U scikit-learn
#!pip install -U pandas
#!pip install -U numpy

In [3]:
# To download the data, go to:
# https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

# Scroll to the bottom, click on 'download all' in the bottom right, unzip the files, 
# and save them in the same directory that you have this notebook saved. 

In [4]:
# Once the files are downloaded, unzipped, and saved in the right location, we can run this code to read them into our notebook:

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Note that we are given two files: train.csv and test.csv

# The two files are similar, but not identical. One row is one house that was sold, and both contain 80 columns describing
# info about the home. 

# The training dataset has one extra column, called SalePrice. We will use that column, along with the features in our training set,
# to create a model that predicts the home value for each of the records in the test set. 

In [5]:
# List first 5 columns of training set:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [6]:
# There are lot of columns, so let's just read in some ones have numeric values:

numeric_columns = ['LotArea', 'OverallQual', 'YearBuilt', 'YearRemodAdd', 'GrLivArea', 
                   'FullBath','YrSold']


# Keep the Sale Price column, we'll need this later.
sale_price = train['SalePrice']

# Drop all but a few of the columns for both the train and the test datasets:
train = train[numeric_columns]
test = test[numeric_columns]

# Add the sale price column back to the training dataset
train['SalePrice'] = sale_price


# Display first 5 rows of the training dataset:
train.head()

Unnamed: 0,LotArea,OverallQual,YearBuilt,YearRemodAdd,GrLivArea,FullBath,YrSold,SalePrice
0,8450,7,2003,2003,1710,2,2008,208500
1,9600,6,1976,1976,1262,2,2007,181500
2,11250,7,2001,2002,1786,2,2008,223500
3,9550,7,1915,1970,1717,1,2006,140000
4,14260,8,2000,2000,2198,2,2008,250000


#### 2) Create two models to predict the SalePrice for houses in our testing dataset. 

First, we need to separate out the columns we're using to predict (typically assinged variable name 'X') from the column we're trying to predict (in this case SalePrice, typically called 'y').

Next we'll need to split up our training dataset into two datasets: one called a training dataset and one called a validation dataset.

We'll use the training dataset to train two models and the we'll use the validation dataset to see which performed better.

We'll separate out our training dataset into two datasets, one comprised of all the columns we're using to 
predict(called X) and one containing the value we're trying to predict (in this case SalePrice) called y.

In [7]:
# Separate the training datasets into X, y:

y = train['SalePrice']
X = train.drop('SalePrice', axis = 1)


# Split the datasets into a training and validation dataset using sklearn's pre-built function:
from sklearn.model_selection import train_test_split

X_train, X_validate, y_train, y_validate = train_test_split(X, y, test_size=0.50, random_state=0)

In [8]:
# Create a Linear Regression using SKlearn.

# Create a model and fit it to our training dataset
# We could 'tune' the model here by changing its parameters, but let's just use the standard parameters that Sklearn provides.

from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression().fit(X_train, y_train)

In [9]:
# Create another regression model using a Random forest

from sklearn.ensemble import RandomForestRegressor
rf_regression = RandomForestRegressor(random_state=0).fit(X_train, y_train)

In [10]:
# Let's Create predictions using our model on the validation dataset.

# Predictions for Linear Regression (LR):
y_pred_LR = linear_regression.predict(X_validate)

In [11]:
# Let's Create predictions using our model on the validation dataset.

# Predictions for Random Forest Regressor (RF):
y_pred_RF = rf_regression.predict(X_validate)

#### 3) Evaluate the models using Mean Absolute Error (MAE)

In [12]:
# Calculate Mean Absolute Error (MAE) for each model.
# MAE is a metric measuring "how far off" our predictions are from the actual values. We want this to be a LOW VALUE.

# We'll use sklearn's pre-built function to calculate it:
from sklearn.metrics import mean_absolute_error

LR_MAE = mean_absolute_error(y_validate, y_pred_LR)
print('Mean Absolute Error of the Linear Regression: ' + str(LR_MAE))

RF_MAE = mean_absolute_error(y_validate, y_pred_RF)
print('Mean Absolute Error of the Random Forest Regression: ' + str(RF_MAE))

Mean Absolute Error of the Linear Regression: 26686.246576391557
Mean Absolute Error of the Random Forest Regression: 22505.776952511416


The Random Forest's predictions performed better on the validation set. They were off by 22.5k on average while the Linear Regression was off by 26.7k on average.

#### 4) Use the Random Forest Regressor to predict SalePrice for the testing set.

In [13]:
y_pred = rf_regression.predict(test)

In [14]:
# Optional: Look at sample submissions so we know how to submit our final predictions:

pd.read_csv('sample_submission.csv').head()

Unnamed: 0,Id,SalePrice
0,1461,169277.052498
1,1462,187758.393989
2,1463,183583.68357
3,1464,179317.477511
4,1465,150730.079977


Lastly, we'll format our predictions in the way Kaggle wants, export the data as a CSV, and upload back to Kaggle to see how we did:

In [15]:
# Get the Id column from the test set
test_ids = pd.read_csv('test.csv')['Id']

# Zip the id column and our predictions together:
df_final = pd.DataFrame(zip(test_ids, y_pred), columns = ['Id', 'SalePrice'])

df_final.to_csv('submission.csv', index = False)

In [16]:
# Read back in and display to make sure formatting looks good:

pd.read_csv('submission.csv')

Unnamed: 0,Id,SalePrice
0,1461,126645.830000
1,1462,161941.700000
2,1463,186874.976667
3,1464,180405.000000
4,1465,224520.000000
...,...,...
1454,2915,83049.750000
1455,2916,83049.750000
1456,2917,161540.500000
1457,2918,118675.000000


#### Submit your scores!

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

#### Not satisfied with the results? There's a lot more we can do to improve the quality of the model, including:

1) Include more of the features. We only included features that don't have missing values. We could include other features, but we'd have to employ some imputation (ie fill in missing values with something) to use them for modeling.
2) Conduct Feature Engineering to ensure only the most important information is being included.
3) We could tune the model to improve MAE
4) Different techniques when leveraging a train and validation dataset.