# The dataset

<img src="https://www.cityofames.org/Home/ShowImage?id=6334&t=635943415687730000">

(Image source: [City of Ames homepage](https://www.cityofames.org/about-ames))

We will use the "Ames Housing" dataset that describes properties in Ames (Iowa) together with their estimated value.
The list and explanation of all features can be found [here](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt), but we will concentrate only on a few columns, namely

- Lot Area (Continuous): Lot size in square feet
- Gr Liv Area (Continuous): Above grade (ground) living area square feet
- Total Bsmt SF (Continuous): Total square feet of basement area
- Garage Cars (Discrete): Size of garage in car capacity
- Overall Qual (Ordinal): Rates the overall material and finish of the house
- Overall Cond (Ordinal): Rates the overall condition of the house
- SalePrice (Continuous): Sale price


Our task is to predict the sale price from the other variables.

# Task: loading data and normalizing column names

In [None]:
! wget "https://drive.google.com/uc?export=download&id=1PZT1MrswHXYuNUiYxRcPcBZe81uVdPM9" -O AmesHousing.csv

In [None]:
import numpy as np
random_seed = 111222
np.random.seed(random_seed) #This is good practice for reproducibiity

In [None]:
# Please load the dataset from AmesHousing.csv with Pandas
# and look into it

...

In [None]:
## nothing to do, just check the column names
df.columns

In [None]:
# Let's just look into the numeric columns
df = df[["Lot Area", "Gr Liv Area", "Total Bsmt SF", "Overall Qual", "Overall Cond", "Garage Cars", "SalePrice"]]
df.head()

In [None]:
## NOTE: THIS CELL IS AN **OPTIONAL** TASK FOR YOU TO PRACTICE DATA PREPARATION
## IF YOU WANT TO JUST FOCUS ON REGRESSION, SKIP **THIS CELL**

# Please normalize a column name by downcasing (turn into lower case) and replacing spaces with '_'.
# Write a function and use the rename dataframe method
...

# Let's see how nice it is! :-)
df.head()

In [None]:
## run this cell to make sure that later, regression-related cells work for you, 
## which refer to the lowercase saleprice variable
df.rename(columns=str.lower, inplace=True)

In [None]:
df.describe()

In [None]:
# Please drop rows with "empty" values 
# Remember to use "inplace" syntax for dropping
# And the fact that you are dropping rows, not columns (although dropping rows is actually the default in pandas)

...

df.describe()

# Task: Dividing the data

Please ALWAYS observe this pattern by your project, since without this you have no estimate for real performance!

In [None]:
from sklearn.model_selection import train_test_split
# This is THE gold standard, use it at all times!!!!!!
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

# Please use the above imported function to do a train-valid-test split of 80%-10%-10%
# So the trick will be to use train_test_split **twice**, choosing the test_size well!
# Make sure to set the random_state parameter to the random_seed.
# Note that we here just use train_test_split on the data "as is", not on separate X and y data
# (consequently, train_test_split will return just TWO things, not four).
# The assertions below should be satisfied, that means, 
# the code will not run further if the shapes and names don't fit!


...

## the three variables below should be the name of the train, validation and test subdatasets!
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)

assert df_train.shape==(2342, 7)
assert df_valid.shape==(293, 7)
assert df_test.shape==(293, 7)


# Task: Fitting a linear regression model

In [None]:
# Please build a pipeline under the variable "pipe" that consists of the application of a scaler 
# and a linear regression model from Scikit, with the named step of "lr".
# Do the appropriate imports, of course.

...



In [None]:
# Please train (fit) the pipe on the df_train data.
# Remember, your target variable should be the saleprice.
# Don't forget to remove the target variable (saleprice) from the input, 
# and store the input in the variable train_input.
# The training should run on train_input as X and the saleprice column of your dataframe as y.

...


## nothing to do here, we just display the coefficients:
coefs = pipe.named_steps["lr"].coef_
coefs

In [None]:
# No task here. :-)
# This is just an intermediary step to transform the coefficients into a nice form, paired with their names, and ordered in a descending order.
names_and_coefs = [(df.columns[i], coefs[i]) for i, _ in enumerate(coefs)] # This is a list comprehension inside, https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
sorted(names_and_coefs, key=lambda x: x[1], reverse=True) #And this is a lambda https://www.w3schools.com/python/python_lambda.asp
# These are by no way mandatory elements of Python, but may come handy at times.

In [None]:
pipe.named_steps["lr"].intercept_ # nothing to do, we just display the intercept here

# Task: Predicting with the model 

In [None]:
#Please predict on the training input first! (train_input)
train_lr_prediction = ...

# Please import and calculate mean squared eror and mean absolute error metrics for the training!
# Use Scikit's metrics
mse=...
mae=...

print("Train mean squared error:", mse)
print("Train mean abs. error:", mae))

In [None]:
# Please repeat the procedure and calculate MAE on the validation dataset!

...

print("Valid mean abs. error:", valid_mae)

# Regularized versions

Below we have implemented the regularized versions of the regression.

The task here is just to observe their behavior and discuss it.



## Ridge

In [None]:
from sklearn.linear_model import Ridge
pipe = Pipeline(steps=[("scaler", StandardScaler()), ("ridge", Ridge(alpha=1000.))])
pipe.fit(train_input, df_train.saleprice)
coefs = pipe.named_steps["ridge"].coef_
names_and_coefs = [(df.columns[i], coefs[i]) for i, _ in enumerate(coefs)]
sorted(names_and_coefs, key=lambda x: x[1], reverse=True)

In [None]:
train_ridge_prediction = pipe.predict(train_input)
print("Train mean squared error:", mean_squared_error(df_train.saleprice, train_ridge_prediction))
print("Train mean abs. error:", mean_absolute_error(df_train.saleprice, train_ridge_prediction))
valid_ridge_prediction = pipe.predict(valid_input)             
print("Valid mean abs. error:", mean_absolute_error(df_valid.saleprice, valid_ridge_prediction))

## Lasso

In [None]:
from sklearn.linear_model import Lasso
pipe = Pipeline(steps=[("scaler", StandardScaler()), ("lasso", Lasso(alpha=3000))])
pipe.fit(train_input, df_train.saleprice)
coefs = pipe.named_steps["lasso"].coef_
names_and_coefs = [(df.columns[i], coefs[i]) for i, _ in enumerate(coefs)]
sorted(names_and_coefs, key=lambda x: x[1], reverse=True)

In [None]:
train_lasso_prediction = pipe.predict(train_input)
print("Train mean squared error:", mean_squared_error(df_train.saleprice, train_lasso_prediction))
print("Train mean abs. error:", mean_absolute_error(df_train.saleprice, train_lasso_prediction))
valid_lasso_prediction = pipe.predict(valid_input)             
print("Valid mean abs. error:", mean_absolute_error(df_valid.saleprice, valid_lasso_prediction))

# Task: Observe!

**Please observe the training of Ridge and LASSO on the same data!**

**Can you mention some interesting / specific observations about the training of these methods?**

# Task: Look at residuals

(For more sophisticated inspection of residuals you can use [Yellowbricks](http://www.scikit-yb.org/en/latest/index.html), but let's stick to a manual apprach for now.)

In [None]:
import matplotlib.pyplot as plt

# Please visualize with a scatterplot the residuals of the LASSO prediction
# Let's for now take residuals as being the prediction substracted from the real target value
# Do the visualization that plots residuals against the y value!


**Please observe and share with us the conclusions!**