_____________
# 06. Modeling Housing Prices - `Linear Regression Example`
The following notebook contains an example of usa housing price modeling using a linear regression model:
_______________

<img src="img/linear_regression.jpg" style="width: 600px;"/>

Let's firstly add the imports that we might need for the data loading and plotting:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
RANDOM_STATE = 11

## Data Loading & Exploration
Now load and have a quick look at the provided data.

In [None]:
usa_df = pd.read_csv('data/USA_Housing.csv', low_memory=False)

In [None]:
usa_df.head().T

In [None]:
usa_df.info()

In [None]:
usa_df.describe()

In [None]:
usa_df.columns

You may also find it useful to design some general use-case helpers for quickly getting a bunch of information/plots about your data.

In [None]:
def get_info(df):
    return df.head().T, df.describe(), df.columns

head, descr, cols = get_info(usa_df)

In [None]:
head

In [None]:
descr

In [None]:
cols

Such helpers may be especially useful when running row or column-wise:

In [None]:
def get_info_col(df_col):
    
    col_description = df_col.describe()
    col_value_counts = df_col.value_counts()
    
    return col_description, col_value_counts

In [None]:
# Example

In [None]:
# What else?

In [None]:
# List over columns + Pretty printing 

Always use plotting, a lot of it, but make sure you understand what they show, or even better - make sure that a person with no data science knowledge could understand it. 

In [None]:
sns.distplot(usa_df['Price'])

## `Correlation`
Make sure to check the correlations between your variables, as this may affect your predictive model's performance quite a lot. 

<img src="img/correlation.jpg" style="width: 400px;"/>

    Correlation is used to test the relationships between quantitative or categorical variables. Essentially, a measure of how things relate to each other. Correlation coefficient ranges from -1 to 1 and describes how these two variables vary together and the direction of their association.

<img src="img/cov.png" style="width: 600px;"/>

See more about correlation [here](https://medium.com/swlh/covariance-correlation-r-sqaured-5cbefc5cbe1c) and, most importantly, note that **correlation does not imply causation**:

<img src="img/pirates.png" style="width: 600px;"/>


In [None]:
usa_df.corr()

In [None]:
usa_df.plot.scatter('Avg. Area Income', 'Price')

In [None]:
import itertools
comb = list(itertools.combinations(usa_df.columns, 2))
comb

In [None]:
for fst, snd in comb:
    
    usa_df.plot.scatter(fst, snd)

In [None]:
sns.pairplot(usa_df)

In [None]:
sns.heatmap(usa_df.corr())

## Feature Selection & Train|Test Splitting

In [None]:
features = usa_df[
    ['Avg. Area Income', 
     'Avg. Area House Age', 
     'Avg. Area Number of Rooms',
     'Avg. Area Number of Bedrooms', 
     'Area Population'
    ]
]
target = usa_df['Price']

In [None]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_targets, test_targets = train_test_split(
    features, target, test_size=0.2, random_state=RANDOM_STATE)

In [None]:
for x in (train_features, test_features, train_targets, test_targets):
    print(x.shape)

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
model.fit(train_features, train_targets);

In [None]:
model.intercept_

In [None]:
model.coef_

<img src="img/lr_real.jpg" style="width: 600px;"/>

In [None]:
coeff_df = pd.DataFrame(model.coef_, train_features.columns, columns=['Coefficient'])
coeff_df

## Inference
At this point, we can test our trained model using the test data that we have prepared earlier. Let's pick a single example first to better understand how the prediction works:

In [None]:
test_features.head(1)

In [None]:
model.coef_

In [None]:
test_features.head(1).values * model.coef_

In [None]:
(test_features.head(1).values * model.coef_).sum()

In [None]:
(test_features.head(1).values * model.coef_).sum() + model.intercept_

In [None]:
model.predict(test_features.head(1))

## Evaluation
At this point, we know how to use our model on unseen data, but we still don't know anything about its performance. Let's evaluate it using a few common metrics, the MAE (Mean Absolute Error) and MSE (Mean Squared Error).

### MAE (Mean Absolute Error) & MSE (Mean Squared Error)
__________
- **MAE** - calculate the residual for every data point, taking only the absolute value of each (so that negative and positive residuals do not cancel out), then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals. 
- **MSE** - just like the MAE, but squares the difference before summing them all instead of using the absolute value. 

<table><tr>
<td> <img src="img/mae.jpg" style="width: 630px;"/> </td>
<td> <img src="img/mse.jpg" style="width: 710px;"/> </td>
</tr></table>

<table><tr>
<td> <img src="img/mae2.jpg" style="width: 700px;"/> </td>
<td> <img src="img/mse2.jpg" style="width: 700px;"/> </td>
</tr></table>

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
test_predictions = model.predict(test_features)

In [None]:
mean_absolute_error(test_targets, test_predictions)

In [None]:
mean_squared_error(test_targets, test_predictions)

In [None]:
test_rmse = np.sqrt(mean_squared_error(test_targets, test_predictions))

In [None]:
test_rmse

In [None]:
test_r2 = model.score(test_features, test_targets)

In [None]:
test_r2

In [None]:
from sklearn.dummy import DummyRegressor
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(train_features, train_targets)
test_predictions = dummy_regr.predict(test_targets)
mean_absolute_error(test_targets, test_predictions)

In [None]:
mean_squared_error(test_targets, test_predictions)

For additional material on linear regression models and their evaluation methods, check this [tutorial](https://www.dataquest.io/blog/understanding-regression-error-metrics/), from which most of the images were taken.