## Lasso Regression

Lasso regression is a technique used on datasets with many features (typically over 10). It works by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations - basically trying to prevent overfitting and make the fitted model smoother.
- Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients
- Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

In [411]:
import pandas as pd
import numpy as np

In [412]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

In [413]:
%matplotlib inline

In [414]:
sj = pd.read_csv('../data/sj.csv', index_col=0)

In [415]:
sj.isnull().sum()

city                                       0
year                                       0
weekofyear                                 0
week_start_date                            0
ndvi_ne                                  191
ndvi_nw                                   49
ndvi_se                                   19
ndvi_sw                                   19
precipitation_amt_mm                       9
reanalysis_air_temp_k                      6
reanalysis_avg_temp_k                      6
reanalysis_dew_point_temp_k                6
reanalysis_max_air_temp_k                  6
reanalysis_min_air_temp_k                  6
reanalysis_precip_amt_kg_per_m2            6
reanalysis_relative_humidity_percent       6
reanalysis_specific_humidity_g_per_kg      6
reanalysis_tdtr_k                          6
station_precip_mm                          6
total_cases                                0
station_min_temp_k                         6
station_avg_temp_k                         6
station_ma

As before, I will be deleting the ``ndvi`` values as many are not present and they have a weak correlation as well as the ``city``, ``year``, and ``week_start_date`` as before and the columns with low correlation ``station_precipitation_mm``, ``precipitation_amount_mm``, ``reanalysis_tdtr_k``, and ``station_diur_temp_rng_k``

In [416]:
sj.drop(sj.columns[[0, 1, 3, 4, 5, 6, 7]], axis=1, inplace=True)

In [417]:
sj.columns

Index(['weekofyear', 'precipitation_amt_mm', 'reanalysis_air_temp_k',
       'reanalysis_avg_temp_k', 'reanalysis_dew_point_temp_k',
       'reanalysis_max_air_temp_k', 'reanalysis_min_air_temp_k',
       'reanalysis_precip_amt_kg_per_m2',
       'reanalysis_relative_humidity_percent',
       'reanalysis_specific_humidity_g_per_kg', 'reanalysis_tdtr_k',
       'station_precip_mm', 'total_cases', 'station_min_temp_k',
       'station_avg_temp_k', 'station_max_temp_k', 'station_diur_temp_rng_k'],
      dtype='object')

In [418]:
sj.isnull().sum()

weekofyear                               0
precipitation_amt_mm                     9
reanalysis_air_temp_k                    6
reanalysis_avg_temp_k                    6
reanalysis_dew_point_temp_k              6
reanalysis_max_air_temp_k                6
reanalysis_min_air_temp_k                6
reanalysis_precip_amt_kg_per_m2          6
reanalysis_relative_humidity_percent     6
reanalysis_specific_humidity_g_per_kg    6
reanalysis_tdtr_k                        6
station_precip_mm                        6
total_cases                              0
station_min_temp_k                       6
station_avg_temp_k                       6
station_max_temp_k                       6
station_diur_temp_rng_k                  6
dtype: int64

In [419]:
from sklearn.linear_model import Lasso

In [420]:
from sklearn.model_selection import train_test_split

Use scikit learn's Imputer module to fill in the NaN values

In [421]:
from sklearn.preprocessing import Imputer

In [422]:
fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)

In [423]:
imputed_sj = pd.DataFrame(fill_NaN.fit_transform(sj))
imputed_sj.columns = sj.columns
imputed_sj.index = sj.index

In [424]:
imputed_sj.isnull().sum()

weekofyear                               0
precipitation_amt_mm                     0
reanalysis_air_temp_k                    0
reanalysis_avg_temp_k                    0
reanalysis_dew_point_temp_k              0
reanalysis_max_air_temp_k                0
reanalysis_min_air_temp_k                0
reanalysis_precip_amt_kg_per_m2          0
reanalysis_relative_humidity_percent     0
reanalysis_specific_humidity_g_per_kg    0
reanalysis_tdtr_k                        0
station_precip_mm                        0
total_cases                              0
station_min_temp_k                       0
station_avg_temp_k                       0
station_max_temp_k                       0
station_diur_temp_rng_k                  0
dtype: int64

In [425]:
# set target variable
y = imputed_sj['total_cases']

In [426]:
# drop the target variable from the dataset
imputed_sj.drop('total_cases', inplace=True, axis=1)

In [427]:
len(imputed_sj), len(y)

(936, 936)

In [428]:
X_train, X_test, y_train, y_test = train_test_split(imputed_sj, y, test_size=0.3)

In [429]:
from sklearn.linear_model import Lasso

In [430]:
lreg = Lasso(alpha=0.3, normalize=True)

In [431]:
lreg.fit(X_train,y_train)

Lasso(alpha=0.3, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=True, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [432]:
pred = lreg.predict(X_test)

In [433]:
# mse = np.mean((pred - y_test)**2)
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, pred)

2429.4293846932228

In [434]:
# mean absolute error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, pred)

26.23213316644366

In [435]:
# calculate the Rsquared
# 6% of variance in total cases is explained by our features
lreg.score(X_test, y_test)

0.056996260252192843