# Lasso regression
## By Austin Kaliher
### For APRD6342 (Digital Advertising)

In [1]:
# This code creates a Lasso regression to analyze digital advertising data. It was originally created for APRD6342
# (Digital Advertising) at CU Boulder. The data can be found in my GitHub repository

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn.metrics import mean_squared_error

# import the data set
alldata = pd.read_csv('finalmaster-ratios.csv')

# get list of the collumn headers for analysis
allvariablenames = list(alldata.columns.values)

#remove the first 8 value of the collumn headers from the list
listofallpredictors = allvariablenames[8:]

#load predictors into dataframe
predictors = alldata[listofallpredictors]  

#load target into dataframe
target = alldata['# Purchases'] 



In [2]:
# split the data into training and testing data

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)

# create the model using LASSO
        
model = LassoLarsCV(cv = 10, precompute = False).fit(pred_train, tar_train)

# build the coefficient chart

predictors_model = pd.DataFrame(listofallpredictors)
predictors_model.columns = ['label']
predictors_model['coeff'] = model.coef_
print(' ')
print(' ')
print(' ')
print('Coefficient Chart:')
print(' ')
print(' ')
print(' ')
for index, row in predictors_model.iterrows():
    if row['coeff'] > 0:
        print(row.values)



 
 
 
Coefficient Chart:
 
 
 
['B01001036' 2.7861365955132507]
['B01001037' 0.9200572652790069]
['B01001038' 0.9459340522644333]
['B02001005' 0.39156809216155525]
['B13014026' 0.22056164158451835]
['B13014027' 0.05049787197081092]
['B19001017' 1.6062678580473928]


In [3]:
# Question #1
    
print(' ')   
print(' ')   
print(' ')   
print('============================== Question #1 ==============================')
print(' ')
print(' ')
print(' ')
print('predictors_model=pd.DataFrame(listofallpredictors)')
print('predictors_model.columns = [\'label\']')
print('predictors_model[\'coeff\'] = model.coef_')
print(' ')
print('for index, row in predictors_model.iterrows():')
print('    if row[\'coeff\'] > 0:')
print('        print(row.values)')
print(' ')
print(' ')
print('In your own words, explain what the above lines of code') 
print('are doing. Why am I doing it? Explain each line.')
print(' ')
print(' ')
print(' ')
print('Answer:')
print(' ')
print('This first line of code creates a predictors_model data frame for us to')
print('store the coefficients in. The second line of code renames the column')
print('name for the coeficients as "label". The third line makes a new column')
print('called coeff which has all the associate coefficients. After these first')
print('lines, we have a loop that prints the label and coefficient for each')
print('variable that has a coeffieicnt greater than 0')

# Question #2 section. look up the variable names

# B01001036: Sex by age, Female between 30 - 34 years
# B01001037: Sex by age, Female between 35 - 39 years
# B01001038: Sex by age, Female between 40 - 44 years
# B02001005: Race, Asian alone
# B13014026: Women who have not had a baby in the past 12 months, unmaried, with a bachelors degree
# B13014027: Women who have not had a baby in the past 12 months, unmaried, with a gradute or professional degree
# B19001017: Houseshold income from past 12 months, $200,000 +

print(' ')   
print(' ')   
print(' ')   
print('============================== Question #2 ==============================')
print(' ')
print(' ')
print(' ')
print('There are a few variables that have been identified by this model as')
print('predicting sales. The first three categories are women between the ages')
print('of 30 and 44 years of age. After this, we have people of Asian decendents')
print('and no other ethnicity. After that, we have females that have not had a')
print('baby in the past 12 months who have either a bachelors degree or a')
print('post-grad degree. Finally, we have households that have a combined income')
print('of over $200,000 per year. The practical implication of these groups is')
print('that Bobo is more likely to sell Bobo Bars in markets where these groups')
print('of people live')
print(' ')   
print(' ')   
print(' ')   
print('============================== Question #3 ==============================')
print(' ')
print(' ')
print(' ')
print('The two groups with the highest coefficients in the data set are females')
print('between 30 and 34 years old and households with combined income of over')
print('$200,000 per year. These are the groups that best predict sales data')
print(' ')   
print(' ')   
print(' ')   
print('============================== Question #4 ==============================')
print(' ')
print(' ')
print(' ')
train_error = mean_squared_error(tar_train, model.predict(pred_train))
print ('training data MSE')
print(train_error)
print(' ')
train_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('test data MSE')
print(train_error)
print('')
print('The MSE for both data sets are not the same. The MSE for the training')
print('data set is smaller than the MSE for the test data set. This means that')
print('our model is a better predictor of the training data set than it is of')
print('the test data set')
print(' ')   
print(' ')   
print(' ')   
print('============================== Question #5 ==============================')
print(' ')
print(' ')
print(' ')
rsquared_train=model.score(pred_train,tar_train)
print ('training data R-square')
print(rsquared_train)
print(' ')
rsquared_train=model.score(pred_test,tar_test)
print ('test data R-square')
print(rsquared_train)
print(' ')
print(' ')
print(' ')
print('The census data does not predict sales very well. The the training data')
print('has an R-squared of .22 and the test data has an R-Squared of .17. If we')
print('were looking to make a predictive model that does a good job of predicting')
print('sales, we would want to see an R-squared values much higher than these')
print(' ')   
print(' ')   
print(' ')   
print('============================== Question #6 ==============================')
print(' ')
print(' ')
print(' ')
print("y interecept:")
print(model.intercept_)
print(' ')
print(' ')
print(' ')
print('The y intercept for this model is 2.7 bars. From a conceptual standpoint')
print('this means that you will sell 2 bars if all the beta values on the')
print('equation are zero. From a practical standpoint, you will only sell 2 bars')
print('in a market that does not include anyone in the cateogires identified above.')

 
 
 
 
 
 
predictors_model=pd.DataFrame(listofallpredictors)
predictors_model.columns = ['label']
predictors_model['coeff'] = model.coef_
 
for index, row in predictors_model.iterrows():
    if row['coeff'] > 0:
        print(row.values)
 
 
In your own words, explain what the above lines of code
are doing. Why am I doing it? Explain each line.
 
 
 
Answer:
 
This first line of code creates a predictors_model data frame for us to
store the coefficients in. The second line of code renames the column
name for the coeficients as "label". The third line makes a new column
called coeff which has all the associate coefficients. After these first
lines, we have a loop that prints the label and coefficient for each
variable that has a coeffieicnt greater than 0
 
 
 
 
 
 
There are a few variables that have been identified by this model as
predicting sales. The first three categories are women between the ages
of 30 and 44 years of age. After this, we have people of Asian decendents
and no