# Predicting House Prices with Linear Regression


This notebook will show how to train and evaluate a linear regression model to predict house prices using the *size* (in a square foot) and the *number of bedrooms* as features.

We start by importing the required functions and classes. Here we import the following:

* The `pyplot` module from the `matplotlib` library for creating the visulizations needed.
* We import the `pandas` library for loading the data from a CSV file.
* From the `sklearn.model_selection`, we import the `train_test_split()` function which we use to split our data into training and testing data.
* From the `sklearn.linear_model`, we import the `LinearRegression` class to create the object that we use to first our model.
* From the `sklearn.metrics`, we import the `mean_squared_error()`, `r2_score()` function to evaluate the regression model.


In [None]:
import matplotlib.pyplot as plt                          # the visualizations  framework
import pandas as pd                                      # data processing, CSV file I/O
from sklearn.model_selection import train_test_split     # split the data in train/test
from sklearn.linear_model import LinearRegression        # the linear regression model
from sklearn.metrics import mean_squared_error, r2_score # evaluation metrics 

## Loading the Data

The data set we use here is stored in the CSV file under `"./data/house_prices.csv"`. We use the `pandas` library short name (i.e., `pd`) to call the `read_csv` method, which takes the file path as an argument. The method returns a data frame that is assigned to the variable `data`. We use that variable to show the top five rows (using the `head` method) and then print the summary statistics (count, mean, standard deviation, etc.) using the `describe` method. 

In [None]:
# load the data set

data = pd.read_csv("./data/house_prices.csv")

print("First five rows:")
print(data.head())  # dsiplay the first 5 rows

print("\n\n")

print("Summary statsitics:")
print(data.describe())  # show the the summary statistics for the dataset

As you can see, the dataset has three columns:

* **Size**: the size of the house in square feet.
* **Bedrooms**: the number of bedrooms in the house
* **Price**: the price of the house in USD (our *target label*).

The number of examples in the data set (as shown in the summary statistics) is 47. 

# Spliting the Data into Train/Test

Before splitting the data, we get two slices from the data frame (i.e., `data`). The first slice contains the features (i.e., `size` and `bedrooms`), which we assign to a variable named `X`. The second slice contains the target (i.e., `price`), which we set to the `y` variable. 

We then use the `train_test_split()` to split our data into training and testing data randomly. The function expects two sequences of data: 

* `X` sequence containing the features.
* `y` sequence containing the target label.

In addition to the sequence, we pass two parameters: 

* `test_size` is the number that defines the size of the test set.
* `random_state`, which is an integer that specifies that state of the random split.  **To make your tests reproducible, it is essential to set this parameter. Otherwise, you will get different splits each time you run your code.**

`train_test_split()` performs the split and returns four sequences in this order:

1. `X_train`: The training part of the first sequence (`X`)
2. `X_test`: The test part of the first sequence (`X`)
3. `y_train`: The training part of the second sequence (`y`)
4. `y_test`: The test part of the second sequence (`y`)

In [None]:
X = data.iloc[:, :-1]   # features is all the columns except last one
y = data.iloc[:, -1]    # the target label

# split the data into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.30, 
                                                    random_state=0)



## Visualizing the Data 

Before you fit the model, it is good to explore the relationships between the different variables within your data. You can gain more insights into the dependent variables that might serve as good predictors of the target label. You can either keep the variables,  remove them, or combine them with others based on such insights. 



In [None]:
# let us visulize the data

fig, axs = plt.subplots(1, 2, figsize=(14, 6))   # create plot gird with 1 row and 2 columns


for i, ax in enumerate(axs):
    ax.scatter(X_train.iloc[:, i], y_train, s=30) # plot the feature against the label (i.e., price)
    ax.set_xlabel(data.columns[i])        
    ax.set_ylabel(data.columns[-1])

Here we make two scatter plots for each of the feature variables against the label. As you can see, there is a positive correlation between $Size-Price$ and $Bedrooms-Price$.  

## Creating the Model
Now that data is split, it is time to fit the model. With scikit-learn, we do this in two lines of code. The first line creates a regressor object (i.e., `reg`) using the `LinearRegression()` class we imported earlier. The second line calls the `reg` object's `fit()` method by passing two parameters: `X_train` and `y_train`. 

In [None]:
# Create  the model
reg = LinearRegression()

# fit the model
reg.fit(X_train, y_train)

## The Model Parmeters
After fitting the model, it is time to inspect what our model has learned.  Here we are interested in knowing about the model parameter. Recall that  a linear regression model  has the following form:
$y = w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n$

After learning, the vector $w = [w_0, w_1, \dots, w_n]$ will hold the model parameters. Using scikit-learn terminology,  the values $w_1, \dots, w_n$ are called the **coefficients**, and $w_0$ is called the **intercept**. 

The code below shows The code below shows how to get the values of coefficients and intercept of the learned regressor.  The output should look like the following:

```
coefficient:  [  131.77059948 -4570.48794717]
intercept/bias: 89793.79977782266
```

Given this, our model function can be written as:

$price = 89793.8 +  131.8 \times size - 4570.5 \times bedrooms $

In [None]:
# let use print the model paramters
print("coefficient: ", reg.coef_)
print("intercept/bias:", reg.intercept_)

## Making Predictions
Making predictions using our model is as easy as fitting the model in the first place. For example, let us say we want to predict the price of the house whose size is 400 square feet and has four bedrooms. To do so, pass the array data frame that contains a single row and then we pass this data frame  to the `predict()` method of our regression object `reg`, which will calculate the corresponding y value based on the learned model. 

In [None]:
# let us predict the value of the house with
# 400 sqaure feet
# 4 bedrooms

# sample data disctonary
sample_data = {"Size": [400],
              "Bedrooms": [4]}

# create a data frame for our sample data
sample_df = pd.DataFrame(sample_data, columns=["Size", "Bedrooms"])

value = reg.predict(sample_df)

print("The predicted price for a 400sqt house with 4 bedrooms is: ", value)

## Model Evaluation
Last, we evaluate our model using various regression metrics by comparing our model's predictions against the ground truth.  We first call the `predict()` method by passing the `X_test` split of the data. The returned predictions for each of the examples in the `X_test` array is saved in `y_pred` array. We then pass the `y_pred` array along with the ground truth  (i.e., `y_test`) to the scoring functions  `mean_squared_error()` and  `r2_score()` to calculate mean squared error (MSE), root mean squared error (RMSE), and the coefficient of determination $R^2$.  These metrics provide some hint of the performance of the `LinearRegression()` model on the given data set.

Note, these metrics are of little value if the performance of the used model is not compared with the performance of other models.  Therefore, I encourage you to use the train another model (perhaps `sklearn.svm.LinearSVR`) and compare it  against the linear regression model. 

In [None]:
# now we evaluate the model
y_pred = reg.predict(X_test)

# calculate the MSE, RMSE and R^2
mse = mean_squared_error(y_test, y_pred)
print(f"MSE = {mse:.2f}")

rmse = mean_squared_error(y_test, y_pred, squared=False)     
print(f"RMSE = {rmse:.2f}")

r2score = r2_score(y_test, y_pred)
print(f"R2 Score = {r2score:.2f}")


**Congratulations!**

You have just built your first machine learning flow using the scikit-learn library. Don't just stop here. Play around with the provided code and try to build other regression models. See if you can find a model that gives a better performance than the one obtained here. 