## Running a Lasso Regression Analysis

***

### Project Description

This week’s assignment involves running a lasso regression analysis. Lasso regression analysis is a
shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both. 

Your assignment is to run a lasso regression analysis using k-fold cross validation to identify a subset of predictors from a larger pool of predictor variables that best predicts a quantitative response variable. 

## Data Dictionary

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| incomeperperson|	2010 Gross Domestic Product per capita in constant 2000 US$.|
| alcconsumption|	2008 alcohol consumption per adult (age 15+), litres|
| armedforcesrate|	Armed forces personnel (% of total labor force)|
| breastcancerper100TH|	2002 breast cancer new cases per 100,000 female|
| co2emissions|	2006 cumulative CO2 emission (metric tons)|
| femaleemployrate|	2007 female employees age 15+ (% of population)|
| employrate|	2007 total employees age 15+ (% of population)|
| HIVrate|	2009 estimated HIV Prevalence %|
| Internetuserate|	2010 Internet users (per 100 people)|
| lifeexpectancy|	2011 life expectancy at birth (years)|
| oilperperson|	2010 oil Consumption per capita (tonnes per year and person)|
| polityscore|	2009 Democracy score (Polity)|
| relectricperperson|	2008 residential electricity consumption, per person (kWh)|
| suicideper100TH|	2005 Suicide, age adjusted, per 100 000|
| urbanrate|	2008 urban population (% of total)|

## Summary

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
# import shap
# import statsmodels.api as sm
# import datetime
# from datetime import datetime, timedelta
# import scipy.stats
# import pandas_profiling
# from pandas_profiling import ProfileReport
# import graphviz

# import xgboost as xgb
# from xgboost import XGBClassifier, XGBRegressor
# from xgboost import to_graphviz, plot_importance

#from sklearn.experimental import enable_hist_gradient_boosting
#from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, LogisticRegression, Ridge
#from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor
#from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, HistGradientBoostingClassifier, HistGradientBoostingRegressor


%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
#sns.set_style('dark')
#sns.set(font_scale=1.2)

plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

#from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoLarsCV

#from sklearn.pipeline import Pipeline
#from sklearn.model_selection import RepeatedStratifiedKFold
#from sklearn.feature_selection import RFE, RFECV, SelectKBest, f_classif, f_regression, chi2

#from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, OneHotEncoder
#from sklearn.pipeline import Pipeline
from sklearn.tree import export_graphviz, plot_tree
from sklearn.metrics import confusion_matrix, classification_report, mean_absolute_error, mean_squared_error,r2_score
from sklearn.metrics import plot_confusion_matrix, plot_precision_recall_curve, plot_roc_curve, accuracy_score
from sklearn.metrics import auc, f1_score, precision_score, recall_score, roc_auc_score


#from tpot import TPOTClassifier, TPOTRegressor
#from imblearn.under_sampling import RandomUnderSampler
#from imblearn.over_sampling import RandomOverSampler
#from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

# import pickle
# from pickle import dump, load

# Use Folium library to plot values on a map.
#import folium

# Use Feature-Engine library

#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce
#from feature_engine.discretisation import EqualWidthDiscretiser, EqualFrequencyDiscretiser, DecisionTreeDiscretiser
#from feature_engine.encoding import OrdinalEncoder

np.random.seed(0)

#from pycaret.classification import *
#from pycaret.clustering import *
#from pycaret.regression import *

pd.set_option('display.max_columns',100)
#pd.set_option('display.max_rows',100)
pd.set_option('display.width', 1000)
pd.option_context('float_format', '{:.2f}'.format)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


## Exploratory Data Analysis

In [2]:
df = pd.read_csv("gapminderfinal6.csv")

In [3]:
df

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat,politycat
0,8740.97,0.03,0.57,27,76.0,25.60,1.94,4,49,1.48,0,1173.18,7,55.70,24.04,1,1,3,0,0.155844,0
1,1915.00,7.29,1.02,57,224.0,42.10,1.94,45,77,1.48,9,636.34,8,51.40,46.72,3,2,1,3,0.101449,1
2,2231.99,0.69,2.31,24,2932.0,31.70,0.10,12,73,0.42,2,590.51,5,50.50,65.22,2,3,2,0,0.101449,1
3,21943.34,10.17,1.44,37,5033.0,47.55,1.94,81,70,1.48,4,1173.18,5,58.64,88.92,2,4,4,4,0.155844,1
4,1381.00,5.57,1.46,23,248.0,69.40,2.00,10,51,1.48,-2,173.00,15,75.70,56.70,1,2,1,2,0.101449,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208,722.81,3.91,1.09,16,1425.0,67.60,0.40,28,75,1.48,-7,302.73,12,71.00,27.84,0,3,1,1,0.101449,0
209,8740.97,6.69,5.94,37,14.0,11.30,1.94,36,73,1.48,4,1173.18,10,32.00,71.90,2,0,3,2,0.155844,1
210,610.36,0.20,2.32,35,235.0,20.30,1.94,12,65,1.48,-2,130.06,6,39.00,30.64,1,2,0,0,0.101449,0
211,432.23,3.56,0.34,13,132.0,53.50,13.50,10,49,1.48,7,168.62,12,61.00,35.42,3,2,0,1,0.101449,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   incomeperperson       213 non-null    float64
 1   alcconsumption        213 non-null    float64
 2   armedforcesrate       213 non-null    float64
 3   breastcancerper100th  213 non-null    int64  
 4   co2emissions          213 non-null    float64
 5   femaleemployrate      213 non-null    float64
 6   hivrate               213 non-null    float64
 7   internetuserate       213 non-null    int64  
 8   lifeexpectancy        213 non-null    int64  
 9   oilperperson          213 non-null    float64
 10  polityscore           213 non-null    int64  
 11  relectricperperson    213 non-null    float64
 12  suicideper100th       213 non-null    int64  
 13  employrate            213 non-null    float64
 14  urbanrate             213 non-null    float64
 15  demoscorecat          2

In [5]:
df.shape

(213, 21)

In [6]:
df.columns

Index(['incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate', 'demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat', 'politycat'], dtype='object')

## Data Preprocessing

### Drop unwanted features

In [7]:
df.columns

Index(['incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate', 'demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat', 'politycat'], dtype='object')

In [8]:
df.drop(['demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat','polityscore','politycat'],axis=1, inplace=True)

In [9]:
df.head()

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,relectricperperson,suicideper100th,employrate,urbanrate
0,8740.97,0.03,0.57,27,76.0,25.6,1.94,4,49,1.48,1173.18,7,55.7,24.04
1,1915.0,7.29,1.02,57,224.0,42.1,1.94,45,77,1.48,636.34,8,51.4,46.72
2,2231.99,0.69,2.31,24,2932.0,31.7,0.1,12,73,0.42,590.51,5,50.5,65.22
3,21943.34,10.17,1.44,37,5033.0,47.55,1.94,81,70,1.48,1173.18,5,58.64,88.92
4,1381.0,5.57,1.46,23,248.0,69.4,2.0,10,51,1.48,173.0,15,75.7,56.7


### Treat Missing Values

In [10]:
df.isnull().sum()

incomeperperson         0
alcconsumption          0
armedforcesrate         0
breastcancerper100th    0
co2emissions            0
femaleemployrate        0
hivrate                 0
internetuserate         0
lifeexpectancy          0
oilperperson            0
relectricperperson      0
suicideper100th         0
employrate              0
urbanrate               0
dtype: int64

### Treat Duplicate Values

In [11]:
df.duplicated(keep='first').sum()

0

### Treat Data Types

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   incomeperperson       213 non-null    float64
 1   alcconsumption        213 non-null    float64
 2   armedforcesrate       213 non-null    float64
 3   breastcancerper100th  213 non-null    int64  
 4   co2emissions          213 non-null    float64
 5   femaleemployrate      213 non-null    float64
 6   hivrate               213 non-null    float64
 7   internetuserate       213 non-null    int64  
 8   lifeexpectancy        213 non-null    int64  
 9   oilperperson          213 non-null    float64
 10  relectricperperson    213 non-null    float64
 11  suicideper100th       213 non-null    int64  
 12  employrate            213 non-null    float64
 13  urbanrate             213 non-null    float64
dtypes: float64(10), int64(4)
memory usage: 23.4 KB


### Train Test Split

In [13]:
df.shape

(213, 14)

In [14]:
X = df.iloc[:,1:14]
y = df.iloc[:,0]

### Train Test Split Cont'd

In [15]:
X.values, y.values

(array([[ 0.03,  0.57, 27.  , ...,  7.  , 55.7 , 24.04],
        [ 7.29,  1.02, 57.  , ...,  8.  , 51.4 , 46.72],
        [ 0.69,  2.31, 24.  , ...,  5.  , 50.5 , 65.22],
        ...,
        [ 0.2 ,  2.32, 35.  , ...,  6.  , 39.  , 30.64],
        [ 3.56,  0.34, 13.  , ..., 12.  , 61.  , 35.42],
        [ 4.96,  1.03, 19.  , ..., 14.  , 66.8 , 37.34]]),
 array([  8740.97,   1915.  ,   2231.99,  21943.34,   1381.  ,  11894.46,
         10749.42,   1326.74,   8740.97,  25249.99,  26692.98,   2344.9 ,
         19630.54,  12505.21,    558.06,   9243.59,   2737.67,  24496.05,
          3545.65,    377.04,  62682.15,   1324.19,   1232.79,   2183.34,
          4189.44,   4699.41,  17092.46,   2549.56,    276.2 ,    115.31,
           557.95,    713.64,  25575.35,   1959.84,   8740.97,    239.52,
           275.88,   6334.11,   2425.47,   3233.42,    336.37,    103.78,
          1253.29,   8740.97,   5188.9 ,    591.07,   6338.49,   4495.05,
         15313.86,   7381.31,  30532.28,    895.32,

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [17]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((170, 13), (43, 13), (170,), (43,))

### Feature Scaling

In [18]:
X_train

Unnamed: 0,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,relectricperperson,suicideper100th,employrate,urbanrate
16,18.85,3.71,36,1000.0,48.6,0.30,32,70,0.69,614.91,27,53.4,73.46
135,2.42,1.01,22,53.0,54.6,0.40,8,69,1.48,31.54,12,61.8,17.24
122,0.11,1.55,28,57.0,45.3,0.70,3,59,1.48,1173.18,7,46.9,41.00
22,5.78,1.88,25,255.0,61.6,0.20,20,67,1.48,213.06,2,70.4,65.58
80,8.70,0.88,30,69.0,41.8,1.20,30,70,1.48,1173.18,36,58.9,28.38
...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,3.58,0.13,6,7.0,65.7,2.00,9,58,1.48,1173.18,6,71.7,56.42
192,1.92,0.34,28,32.0,48.4,3.20,5,57,1.48,66.24,6,63.9,42.00
117,6.69,1.44,37,8.0,42.1,0.06,28,77,1.48,1173.18,22,56.9,37.86
47,5.12,1.50,31,1287.0,43.7,0.10,16,79,1.48,528.79,11,56.0,75.66


In [19]:
scaler = StandardScaler()

In [20]:
X_train_scaled = scaler.fit_transform(X_train)

In [21]:
X_test_scaled = scaler.transform(X_test)

In [22]:
X_train_scaled

array([[ 2.77667345,  1.48793291, -0.03438884, ...,  2.88198276,
        -0.56881929,  0.72721402],
       [-0.88708946, -0.29739988, -0.71643414, ...,  0.37719202,
         0.31039828, -1.70341109],
       [-1.40220159,  0.05966668, -0.42412901, ..., -0.45773823,
        -1.24916623, -0.67616718],
       ...,
       [ 0.06508751, -0.0130691 ,  0.01432868, ...,  2.04705251,
        -0.20247864, -0.81192248],
       [-0.28501035,  0.02660496, -0.27797645, ...,  0.21020597,
        -0.29668052,  0.8223292 ],
       [ 0.68723593, -0.62140472, -0.42412901, ...,  0.37719202,
         0.5197358 , -0.8162459 ]])

In [23]:
X_test_scaled

array([[-0.7532941 ,  0.6084912 , -0.71643414,  0.0066039 , -1.90230782,
        -0.56211855,  0.23156383,  0.51700201, -1.61166644, -0.50557267,
        -0.95869638, -1.67830814,  0.52055468],
       [ 0.53783111, -0.07919254,  0.35535134, -0.13228115, -0.89378085,
        -0.45300598,  0.42472305,  1.05620655, -0.88232713, -0.52167688,
         0.04321992, -0.82002432,  1.37486191],
       [-1.32192438,  4.15270741, -0.22925893, -0.11117873, -2.60679356,
         0.04120982, -1.23644629, -0.02220254,  0.07978004, -0.99339382,
         1.21212227, -2.24351944,  0.43062761],
       [ 0.45755389, -0.0130691 ,  0.01432868, -0.19877368, -0.0001963 ,
         0.04120982, -0.57970492, -0.45356618,  0.07978004,  0.04583265,
         0.04321992, -0.020355  ,  0.16949321],
       [ 0.33713807, -0.376748  , -0.13182388, -0.19540743,  1.31607971,
        -0.48509791, -0.54107308,  0.30132019,  0.07978004, -0.52510501,
        -0.95869638,  1.49315527,  0.15825233],
       [ 0.06508751, -0.013069

### Model Training

### Using Regression or Classification Models

In [24]:
lasso = LassoLarsCV(verbose=True, cv=5)

In [25]:
lasso.fit(X_train_scaled,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


LassoLarsCV(cv=5, verbose=True)

In [26]:
y_pred = lasso.predict(X_test_scaled)

In [27]:
y_pred

array([ 7899.12753004, 10837.7764023 , -2572.22773687,  2953.57596371,
        4838.9472499 ,  6036.75004839,  6166.5082613 ,  9918.40261599,
       21236.17806371, 23157.26849425,  3311.59578222, 30446.44435468,
        8333.53376654, 18622.95892137,  4714.3333953 , 29554.57036142,
         305.51317127, 14252.15738487,  5709.13098974,  6278.66910912,
        4739.79194075,  -856.29696417, -3064.5530701 , -1825.08264503,
        7973.37518113,   622.61933921, 30961.32959658, 21254.01486584,
       -1841.20486384, 18238.84882892,  6050.36723372, 24871.93572702,
       -2579.27987521,  8182.97282499, 30242.65548198, 19006.34076646,
        6715.35129425, 12892.74326482, 18185.6477461 ,  2768.34243279,
        3292.36937041, 16938.25749688, 27648.58245033])

In [28]:
y_pred = y_pred.round(2)

### Model Evaluation

In [29]:
lasso.coef_

array([-1906.64492429,   -18.11559052,  1142.00715297,   844.83707806,
           0.        ,  1010.56499734,  8403.75601814,  -317.69110004,
         343.77231823,     0.        ,     0.        ,  1503.64955834,
        1378.59577031])

In [30]:
X.columns

Index(['alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate'], dtype='object')

In [31]:
pd.DataFrame({"Columns": X.columns, "Regression Coefficeients": lasso.coef_})

Unnamed: 0,Columns,Regression Coefficeients
0,alcconsumption,-1906.644924
1,armedforcesrate,-18.115591
2,breastcancerper100th,1142.007153
3,co2emissions,844.837078
4,femaleemployrate,0.0
5,hivrate,1010.564997
6,internetuserate,8403.756018
7,lifeexpectancy,-317.6911
8,oilperperson,343.772318
9,relectricperperson,0.0


In [32]:
mse_test = mean_squared_error(y_test, y_pred)

In [33]:
mse_test

38403119.36073256

In [34]:
y_pred_train = lasso.predict(X_train_scaled)

In [35]:
mse_train = mean_squared_error(y_train, y_pred_train)

In [36]:
mse_train

109948200.40027761

In [37]:
r2_test = r2_score(y_test, y_pred)
r2_test

0.6861879081444817

In [38]:
r2_train = r2_score(y_train, y_pred_train)
r2_train

0.4354082802355824

### Cross-Validation

In [39]:
cv = cross_val_score(lasso,X,y,cv=5,verbose=1,scoring='neg_mean_squared_error')

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


In [40]:
cv.mean()

-104408943.11007066

#### Python code done by Dennis Lam