# Lasso Algorithm

One limitation of XGBoost algorithm, in particular, is its inability to extrapolate and because of this linear model can better predict any sale prices outside the range of prices given in our training set. Hence, regularized linear models are used with penalty term lambda to minimize the error. LASSO, also known as Least Absolute Shrinkage and Selection Operator, is a regression model that does variable selection and regularization. The Lasso model uses a parameter that penalizes fitting too many variables. It allows the shrinkage of variable coefficients to 0, which essentially results in those variables having no effect in the model, thereby reducing dimensionality.

In [122]:
import os
import numpy as np 
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error

In [123]:
os.system('cd /home/mcheruvu/notebook/code')

print(os.getcwd())
print("")

train = pd.read_csv("../data/train_after_feature_engineering.csv")
test = pd.read_csv("../data/test_after_feature_engineering.csv")


print ('The train data has {0} rows and {1} columns'.format(train.shape[0],train.shape[1]))    
print ('The test data has {0} rows and {1} columns'.format(test.shape[0],test.shape[1]))


/home/mcheruvu/notebook

The train data has 1460 rows and 307 columns
The test data has 1459 rows and 306 columns


In [124]:
np.random.seed(1234)

#found this best alpha value through cross-validation
_best_alpha = 0.0001

_lasso_algo = Lasso(alpha = _best_alpha, max_iter = 50000)

# Fit the Model

In [125]:
target_vector = pd.DataFrame(index = train.index, columns=["SalePrice"])
target_vector["SalePrice"] = train["SalePrice"]

target_vector["SalePrice"] = np.log1p(target_vector["SalePrice"]) # log(SalePrice) + 1

train.drop(['SalePrice'], axis=1, inplace=True)

_lasso_algo.fit(train, target_vector)


Lasso(alpha=0.0001, copy_X=True, fit_intercept=True, max_iter=50000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

# Predict the Test Sale Price

In [126]:
y_train = target_vector
y_train_pred = _lasso_algo.predict(train)
    
rmse_train = np.sqrt(mean_squared_error(y_train,y_train_pred))

print("Lasso score on training set: ", rmse_train)

y_test_pred = _lasso_algo.predict(test)

print(y_test_pred[5:])

('Lasso score on training set: ', 0.10151498075749298)
[ 12.07016228  12.08523428  12.00229156 ...,  12.03431156  11.66466558
  12.34821441]


# Save Predictions

In [127]:
df_predict = pd.DataFrame({'Id': test["Id"], 'SalePrice': np.exp(y_test_pred) - 1.0})
#df_predict = pd.DataFrame({'Id': id_vector, 'SalePrice': sale_price_vector})

df_predict.to_csv('../data/kaggle_python_lasso.csv', header=True, index=False)

print("..file saved")

..file saved
