## Assignment for Module 4, Training Models

In this assignment you will train different models on a given data set, and find the one that performs best

### Getting the data for the assignment (similar to the notebook from chapter 2 of Hands-On...)

In [1]:
import os
import tarfile
from six.moves import urllib
import pandas as pd


DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

### Fix the categories in the categorical variable

In [2]:
d = {'<1H OCEAN':'LESS_1H_OCEAN', 'INLAND':'INLAND', 'ISLAND':'ISLAND', 'NEAR BAY':'NEAR_BAY', 'NEAR OCEAN':'NEAR_OCEAN'}
housing['ocean_proximity'] = housing['ocean_proximity'].map(lambda s: d[s])

### Add 2 more features

In [3]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["population_per_household"]=housing["population"]/housing["households"]

### Fix missing data

In [4]:
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True) 

### Create dummy variables based on the categorical variable

In [5]:
one_hot = pd.get_dummies(housing['ocean_proximity'])
housing = housing.drop('ocean_proximity', axis=1)
housing = housing.join(one_hot)

### Check the data

In [6]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   longitude                 20640 non-null  float64
 1   latitude                  20640 non-null  float64
 2   housing_median_age        20640 non-null  float64
 3   total_rooms               20640 non-null  float64
 4   total_bedrooms            20640 non-null  float64
 5   population                20640 non-null  float64
 6   households                20640 non-null  float64
 7   median_income             20640 non-null  float64
 8   median_house_value        20640 non-null  float64
 9   rooms_per_household       20640 non-null  float64
 10  population_per_household  20640 non-null  float64
 11  INLAND                    20640 non-null  uint8  
 12  ISLAND                    20640 non-null  uint8  
 13  LESS_1H_OCEAN             20640 non-null  uint8  
 14  NEAR_B

# ASSIGNMENT

Using the familiar California housing dataset (target = 'median_house_value'), train several models using various regularization techniques to improve model accuracy. 

### 1.1 Partition into train and test

Use `train_test_split` from `sklearn.model_selection` to partition the dataset into 70% for training and 30% for testing.

You can use the 70% for training set as both training and validation by using cross-validation.


In [None]:
from sklearn.model_selection import train_test_split
target_col = 'median_house_value'
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

### 1.2 Polynomial transformations

Use PolynomialFeatures from sklearn.preprocessing

In [7]:
from sklearn.preprocessing import PolynomialFeatures
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

##### You should obtain X_train and X_test with 136 columns each, since originally you had 15 features.

##### With m original features, the new added polynomial features of degree 2 are: $(m^2-m)/2+m+1$.

##### These, plus the original features gives a total of  $(m^2-m)/2+2m+1$

In [None]:
print("Original number of features: "+str(len(features)))
print("Final number of features: "+str(X_tr.shape[1]))

### 1.3 Scaling features

Similarly, use `StandardScaler` from `sklearn.preprocessing` to normalize the training and testing data, using the training data

In [7]:
from sklearn.preprocessing import StandardScaler
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

#### Comparing models

Use this function to display your cross val scores, or you may use your own custom function.

**Either way it is important to display your results as you train new models.**

In [8]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())

### 2.1 Linear regression on original features (no transformations) --- benchmark

Train a simple linear regression model using `cross_val_score` with no regularization or feature transformations. This model will serve as your benchmark.

In [None]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

### 2.2 Linear regression  (on transformed features: polynomial transformation + scaling)

Now do as in **2.1** but with the original and transformed features (136 features)

In [None]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

If the error on the cross-validation is too high it is because the model is over-fitting. Regularization is needed.

### 2.3 Ridge regression

Using the same transformed dataset from **2.2**, train another linear model but this time apply L2 regularization. Run the model through grid search to find the optimal regularization hyperparams. Print the results. 

In [None]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

### 2.4 Lasso regression

Now do the same as in **2.3** but with Lasso

In [None]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

### 2.5 Elastic Net regression

Do the same as in **2.3** and **2.4**, but now with Elastic Net. However, the grid search should be over the parameters alpha and  L1 ratio. Use just 3 values for L1 ratio.

In [None]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

### 3.1  Expected Results

Before you compute the final_rmse on the test data using your best model, pause and reflect:
- Does your best model have high variance? 
- Why was your best performing model better than the others?
- What is your expected rmse score on your test data? 

##### YOUR ANSWER HERE 

### 3.2. Evaluating your best model on TESTING data

Of the models you created above, choose the best one to test on the test set.

In [None]:
from sklearn.metrics import mean_squared_error

## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

final_mse = mean_squared_error(y_test, YOUR_PREDICTIONS) # swap 'YOUR_PREDICTIONS' with the predictions values your model produces
final_rmse = np.sqrt(final_mse)
print(final_rmse)

In [None]:
# Plot the results
import matplotlib.pyplot as plt

plt.scatter(y_test, YOUR_PREDICTIONS) # swap 'YOUR_PREDICTIONS' with the prediction values your model produces
plt.xlim([-200000,800000])
plt.ylim([-200000,800000])
plt.show()

### 4.1 Try a more advanced model

Try a more complex algorithm (SVR, RandomForest, etc.) and see if the accuracy improves (train on the full training set, and then test on the full test set). We have already done this in one of the earlier assignments so this should be easy!

Why does the accuracy improve when using a more complex algorithm in this case? Write a very breif answer in the cell below following your code.

In [None]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

##### YOUR ANSWER HERE 

# BONUS QUESTIONS

#[Optional]
Why does the matrix X appears transponsed in the normal equation in the linear regression? Equation 4.4. Start from equation 4.3



#[Optional]
Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?



#[Optional]
Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?



#[Optional]
Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?



#[Optional]
Why does the matrix X appears transponsed in the normal equation in the linear regression? Equation 4.4. Start from equation 4.3

