### Week 3 Application Assignment

Let’s reconsider the customer reward program dataset. In this exercise, you will complete a predictive modeling task where the target variable is binary.

The dataset also contains a column IndustryType, which is created based on the column Industry in the raw data. Note that Industry has many categories. The analyst who prepared the data chose to combine some categories, which resulted in the column IndustryType. IndustryType has five categories: Department, Discount, Grocery, Restaurants, Specialty. You can create a set of dummy variables based on IndustryType in XLMiner by using the Transform functions.

Part I.

Consider logistic regression models with Reward column as the target variable. Fit the model with two indicator variables, one indicating whether a retailer is a discount store (i.e., IndustryType is Discount), and the other indicating whether a retailer is a grocery store (i.e., IndustryType is Grocery). Report the coefficient estimates in the next three questions. [Hint: After you create the dummy variables, use them as Selected Variables (instead of Categorical Variables) in the first step of Logistic Regression.]

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import datetime

%matplotlib inline
sns.set_style('dark')
sns.set(font_scale=1.2)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from pycaret.classification import *

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)

In [2]:
df = pd.read_csv("crp_cleandata3.csv")

In [3]:
df.head()

Unnamed: 0,Retailer,Salerank,X2013USSales,X2013WorldSales,ProfitMargin,NumStores,Industry,Reward,ProgramName,RewardType,RewardStructure,RewardSize,ExpirationMonth,IndustryType
0,A&P,74,5.831,5.831,48.85,0.277,"Discount, Variety Stores",0,No rewards program,-,-,-,-,Discount
1,Albertsons,21,19.452,19.452,69.02,1.024,Grocery Stores,0,No rewards program,-,-,-,-,Grocery
2,Aldi,38,10.898,10.65,69.41,1.328,Grocery Stores,0,No rewards program,-,-,-,-,Grocery
3,Alimentation Couche Tard (Circle K),82,4.755,8.551,68.03,3.826,Grocery Stores,0,No rewards program,-,-,-,-,Grocery
4,Apple Stores,15,26.648,30.736,11.07,0.254,ElectronicEquipment,0,No rewards program,-,-,-,-,Specialty


In [4]:
df.describe(include='all')

Unnamed: 0,Retailer,Salerank,X2013USSales,X2013WorldSales,ProfitMargin,NumStores,Industry,Reward,ProgramName,RewardType,RewardStructure,RewardSize,ExpirationMonth,IndustryType
count,100,100.0,100.0,100.0,100.0,100.0,100,100.0,100,96,100,100,100,100
unique,100,,,,,,17,,58,11,57,27,11,5
top,Wakefern / Shoprite,,,,,,Grocery Stores,,No rewards program,-,-,-,-,Specialty
freq,1,,,,,,18,,40,37,41,44,44,43
mean,,50.5,18.3735,24.13154,45.273,2.69876,,0.55,,,,,,
std,,29.011492,36.476003,50.845864,29.23139,3.997641,,0.5,,,,,,
min,,1.0,3.6,3.6,1.02,0.0,,0.0,,,,,,
25%,,25.75,5.20675,6.10825,19.445,0.3385,,0.0,,,,,,
50%,,50.5,8.3485,9.629,42.02,1.3315,,1.0,,,,,,
75%,,75.25,16.841,22.1315,69.5725,3.51975,,1.0,,,,,,


### Exploratory Data Analysis

In [5]:
df.columns

Index(['Retailer', 'Salerank', 'X2013USSales', 'X2013WorldSales',
       'ProfitMargin', 'NumStores', 'Industry', 'Reward', 'ProgramName',
       'RewardType', 'RewardStructure', 'RewardSize', 'ExpirationMonth',
       'IndustryType'],
      dtype='object')

In [6]:
df.drop(['Retailer','Salerank', 'X2013USSales', 'X2013WorldSales','ProfitMargin', 'NumStores',
         'Industry', 'ProgramName','RewardType', 'RewardStructure', 'RewardSize', 'ExpirationMonth'],axis=1,inplace=True)

In [7]:
df.head()

Unnamed: 0,Reward,IndustryType
0,0,Discount
1,0,Grocery
2,0,Grocery
3,0,Grocery
4,0,Specialty


In [8]:
df["IndustryType_Discount"] = np.where(df["IndustryType"] == "Discount",1,0)

In [9]:
df["IndustryType_Grocery"] = np.where(df["IndustryType"] == "Grocery",1,0)

In [10]:
df

Unnamed: 0,Reward,IndustryType,IndustryType_Discount,IndustryType_Grocery
0,0,Discount,1,0
1,0,Grocery,0,1
2,0,Grocery,0,1
3,0,Grocery,0,1
4,0,Specialty,0,0
...,...,...,...,...
95,1,Specialty,0,0
96,1,Specialty,0,0
97,1,Specialty,0,0
98,1,Specialty,0,0


In [11]:
df["Reward"].value_counts()

1    55
0    45
Name: Reward, dtype: int64

### Linear Regression

In [12]:
df.columns

Index(['Reward', 'IndustryType', 'IndustryType_Discount',
       'IndustryType_Grocery'],
      dtype='object')

In [13]:
y = df['Reward']
x1 = df[['IndustryType_Discount', 'IndustryType_Grocery']]

In [14]:
x = sm.add_constant(x1)

In [15]:
results = sm.OLS(y,x).fit()

In [16]:
results.summary()

0,1,2,3
Dep. Variable:,Reward,R-squared:,0.042
Model:,OLS,Adj. R-squared:,0.022
Method:,Least Squares,F-statistic:,2.101
Date:,"Fri, 21 Aug 2020",Prob (F-statistic):,0.128
Time:,18:36:15,Log-Likelihood:,-69.956
No. Observations:,100,AIC:,145.9
Df Residuals:,97,BIC:,153.7
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6250,0.062,10.111,0.000,0.502,0.748
IndustryType_Discount,-0.2361,0.132,-1.790,0.077,-0.498,0.026
IndustryType_Grocery,-0.1806,0.132,-1.368,0.174,-0.442,0.081

0,1,2,3
Omnibus:,1138.476,Durbin-Watson:,0.131
Prob(Omnibus):,0.0,Jarque-Bera (JB):,13.95
Skew:,-0.2,Prob(JB):,0.000935
Kurtosis:,1.214,Cond. No.,3.16


In [17]:
df.head()

Unnamed: 0,Reward,IndustryType,IndustryType_Discount,IndustryType_Grocery
0,0,Discount,1,0
1,0,Grocery,0,1
2,0,Grocery,0,1
3,0,Grocery,0,1
4,0,Specialty,0,0


In [18]:
df2 = df.drop(['IndustryType'],axis=1)

In [19]:
df2.head()

Unnamed: 0,Reward,IndustryType_Discount,IndustryType_Grocery
0,0,1,0
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,0


In [20]:
X = df2.iloc[:,1:]
y = df2.iloc[:,0]

In [21]:
X.values, y.values

(array([[1, 0],
        [0, 1],
        [0, 1],
        [0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0],
        [0, 0],
        [1, 0],
        [1, 0],
        [0, 0],
        [0, 0],
        [0, 1],
        [0, 0],
        [1, 0],
        [1, 0],
        [1, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0],
        [0, 0],
        [0, 0],
        [0, 1],
        [0, 0],
        [0, 0],
        [1, 0],
        [1, 0],
        [0, 1],
        [0, 0],
        [0, 1],
        [1, 0],
        [0, 1],
        [0, 0],
        [0, 1],
        [0, 1],
        [0, 0],
        [1, 0],
        [0, 0],
        [0, 0],
        [0, 1],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [1, 0],
        [1, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        [0, 0],
        

In [22]:
#train_test_split()

In [23]:
logistic = LogisticRegression()

In [24]:
logistic.fit(X,y)

LogisticRegression()

### Use PyCaret

Part II.

Split the dataset into training and validation sets using a 60:40 split (set the seed for partitioning to 12345; this should be the default value if you have not changed it). [Hint: note that there two Partition buttons in XLMiner ribbon. You should use the Partition->Standard Partition in the Data Mining group.] Report the new coefficient estimates in the next three questions. Use the same two predictor variables as in Part I.

In [25]:
exp = setup(data=df2, train_size=0.6, target = 'Reward', session_id=12345, 
            categorical_features=['IndustryType_Discount','IndustryType_Grocery'])

Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,12345
1,Target Type,Binary
2,Label Encoded,
3,Original Data,"(100, 3)"
4,Missing Values,False
5,Numeric Features,0
6,Categorical Features,2
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


In [26]:
lr = create_model('lr')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.5,0.375,0.75,0.6,0.6667,-0.2857,-0.3162
1,0.6667,1.0,1.0,0.6667,0.8,0.0,0.0
2,0.6667,0.5625,0.75,0.75,0.75,0.25,0.25
3,0.6667,0.6667,0.6667,0.6667,0.6667,0.3333,0.3333
4,0.6667,0.6667,1.0,0.6,0.75,0.3333,0.4472
5,0.3333,0.3333,0.6667,0.4,0.5,-0.3333,-0.4472
6,0.6667,0.6667,0.6667,0.6667,0.6667,0.3333,0.3333
7,0.6667,0.6667,1.0,0.6,0.75,0.3333,0.4472
8,0.8333,0.8333,1.0,0.75,0.8571,0.6667,0.7071
9,0.3333,0.3889,0.3333,0.3333,0.3333,-0.3333,-0.3333


In [27]:
evaluate_model(lr)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…