### Your name:

### Collaborators:

<pre> Enter the name of the people you worked with if any</pre>


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
from utils_3253 import CategoricalEncoder
from utils_3253 import DataFrameSelector

Open the housing data


In [3]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

## Question 1: Build full pipeline for the data analysis following the example of the notebook.
 Hint: the main part requested to change is the algorithm used (Lasso regression)

If you want to learn more about the Lasso regression, see resources below:
- http://scikit-learn.org/stable/modules/linear_model.html#lasso
- https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

#### Considerations for building pipeline:

- Split data into training and testing sets below.
- Convert all categorical data to one-hot vectors below
- Normalize all non-categorical data 
-  Perform Lasso-based regression using a variety of values for $\alpha$ between 0 and 1 via a grid search where  *housing_labels* is the output and all other features are the input (similar to as seen in lecture two.)

In [4]:
from sklearn.linear_model import Lasso

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

## Answer 1

### Data Pre-Processing for stratified split

In [5]:
# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

### Stratified Split

In [6]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [7]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)
compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
}).sort_index()
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

In [8]:
compare_props

Unnamed: 0,Overall,Stratified,Strat. %error
1.0,0.039826,0.039729,-0.243309
2.0,0.318847,0.318798,-0.015195
3.0,0.350581,0.350533,-0.01382
4.0,0.176308,0.176357,0.02748
5.0,0.114438,0.114583,0.127011


In [9]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

In [10]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

In [11]:
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

### Data Preparation Pipeline

In [12]:
housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('nrm_scaler', MinMaxScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
    ])

from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

In [13]:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

array([[ 0.24501992,  0.50478215,  0.7254902 , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.24103586,  0.47927736,  0.25490196, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.71215139,  0.02444208,  0.58823529, ...,  0.        ,
         0.        ,  1.        ],
       ..., 
       [ 0.79183267,  0.16471838,  0.15686275, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.6314741 ,  0.1360255 ,  0.58823529, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.18924303,  0.55579171,  1.        , ...,  0.        ,
         1.        ,  0.        ]])

In [14]:
housing_prepared.shape

(16512, 16)

### Full pipeline for preparation as well as Lasso regression

In [15]:
from sklearn import linear_model
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [39]:
param_grid = [{'alpha': [0.5,0.8,0.9, 1.0,5.0, 10.0], 'max_iter':[5000]}]
grid_search = GridSearchCV(linear_model.Lasso(), param_grid, cv=5,
                           scoring='neg_mean_squared_error')

In [40]:
full_lasso_pipeline_with_predictor = Pipeline([("preparation", full_pipeline),
        ("lasso", grid_search)])
full_lasso_pipeline_with_predictor.fit(housing, housing_labels)



Pipeline(memory=None,
     steps=[('preparation', FeatureUnion(n_jobs=1,
       transformer_list=[('num_pipeline', Pipeline(memory=None,
     steps=[('selector', DataFrameSelector(attribute_names=['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income'])), ('...*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0))])

In [28]:
grid_search.best_params_

{'alpha': 10.0}

In [29]:
grid_search.best_estimator_

Lasso(alpha=10.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [30]:
cvres_lasso2 = grid_search.cv_results_
for mean_score, params in zip(cvres_lasso2["mean_test_score"], cvres_lasso2["params"]):
    print(np.sqrt(-mean_score), params)

69133.6641407 {'alpha': 0.5}
69125.9627212 {'alpha': 0.8}
69123.4405753 {'alpha': 0.9}
69121.1761625 {'alpha': 1.0}
69065.7338444 {'alpha': 5.0}
69027.465631 {'alpha': 10.0}


In [31]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_alpha,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,...,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.249627,0.000822,-4779464000.0,-4702929000.0,0.5,{'alpha': 0.5},6,-4475847000.0,-4771999000.0,-5268332000.0,...,-4849860000.0,-4679440000.0,-4440286000.0,-4778704000.0,-4862937000.0,-4684841000.0,0.016265,6.9e-05,302716000.0,66416190.0
1,0.266107,0.00081,-4778399000.0,-4702958000.0,0.8,{'alpha': 0.8},5,-4475573000.0,-4772016000.0,-5264647000.0,...,-4849594000.0,-4679463000.0,-4440380000.0,-4778724000.0,-4861743000.0,-4684876000.0,0.02043,0.000166,301481900.0,66406120.0
2,0.264028,0.000832,-4778050000.0,-4702970000.0,0.9,{'alpha': 0.9},4,-4475484000.0,-4772024000.0,-5263425000.0,...,-4849509000.0,-4679473000.0,-4440413000.0,-4778733000.0,-4861364000.0,-4684891000.0,0.008084,0.000198,301073500.0,66401730.0
3,0.319577,0.000819,-4777737000.0,-4702983000.0,1.0,{'alpha': 1.0},3,-4475394000.0,-4772032000.0,-5262369000.0,...,-4849426000.0,-4679484000.0,-4440446000.0,-4778742000.0,-4860994000.0,-4684907000.0,0.073133,0.00023,300719100.0,66398250.0
4,0.149538,0.000664,-4770076000.0,-4704384000.0,5.0,{'alpha': 5.0},2,-4475377000.0,-4772928000.0,-5230865000.0,...,-4847682000.0,-4680914000.0,-4442363000.0,-4779965000.0,-4854041000.0,-4687023000.0,0.016985,3.9e-05,289718000.0,66230420.0
5,0.068062,0.000633,-4764791000.0,-4708333000.0,10.0,{'alpha': 10.0},1,-4476472000.0,-4775146000.0,-5191221000.0,...,-4849396000.0,-4684931000.0,-4445823000.0,-4783467000.0,-4861001000.0,-4692648000.0,0.013302,1.4e-05,276824500.0,65549210.0


In [32]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [33]:
final_rmse

67008.079442467657

## Question 2: Why is it necessary to normalize all continuous variables before performing Lasso? (OPTIONAL)

## Answer 2:
<pre> There are three reasons. In summary:
1) Interpretability of coefficients.
2) Ability to rank the coefficient importance by the relative magnitude of post-shrinkage coefficient estimates.
3) No need for intercept.
They can be explained as follows:
Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables.
The result of centering the variables means that there is no longer an intercept. </pre>

## Question 3:  Conclusions
For what values of $\alpha$ does Lasso perform best? Does it perform as well on the housing data as the linear regressor from the lectures? Why do you think this is?

## Answer 3:
<pre> Lasso is performing best at α of 1 (assignment asked to check it maximum upto 1).
It doesnt perform better than Linear regression because primarily Lasso is a regularization (simplification) method. So it works better when there are large number of festures. In the given case, number of features is 15 and hence it is not working better than LR. 
</pre>

## Question 4:  Read appending B

- Reflect on your last data project, read appendix B. Then, write down a few of the checklist items that your last data project could have used. If you have not yet done a data project, then write down a few of the items that you found most interesting.


## Answer 4:
<pre>Your own answer </pre>

### Submit your notebook

Submit your solution here
https://goo.gl/forms/VKD7Zwu54oHjutDc2
Make sure you rename your notebook to    
W2_UTORid.ipynb    
Example W2_adfasd01.ipynb
