# Starbucks Capstone Challenge
Notebook 4 of 4

# Linear Regression
In order to analyze our data and determine the impact of each promotion, we will use several different models for linear regression to understand how viewing each promotion impacted customer total spending. Total_amount_purchased will be our dependent variable in all models. 

- **Model 1:** Our baseline will be the simplest, with the original features of portfolio.json and the different promotions as independent variables to determine the coefficients for each promotion and whether they are statistically significant.
- **Model 2:** Our final model will actually be a collection of models. We will separate the data by customer segment and run each segment separately to understand the impact of each promotion on the individual segment. This one should be the most interesting as it will support future targeting efforts for the marketing department. 

*Special note*: when using categories as columns, promotions in our example, we must avoid overfitting by ensuring that there is data if a customer has a 0 for all categories. Since there were customers who did not view any promotions, we have a base case for promotions automatically. However, if this were not he case then we would need to remove 1 of the promotions and use that as the base case.

## ***Methodology/Assumptions***
Linear regression is a complicated topic which is beyond the scope of this Nanodegree program. Optimization of this model is not the purpose of this project and we will make important assumptions as follows to complete this project:
- We will assume that backward, stepwise, OLS regression is an appropriate model.
- We will assume that relationships are linear and not polynomial.
- We will use R-squared adjusted values from the training data and MAPE values on the test data to evaluate overall model performance. We will give both metrics equal weight, meaning that a model performs better if it has both a higher R-squared adjusted and a lower MAPE. If only 1 metric is better and the other is worse when comparing models, then our evaluation is inconclusive.

In [1]:
#Let's get the data

import pandas as pd
import numpy as np
import math
from IPython.display import display, HTML

import matplotlib.pyplot as plt
import matplotlib
#% matplotlib inline

prefix = 'regression_ready_data'

def import_csvs(prefix, filename):
    df = pd.read_csv(prefix + filename)
    df.index = df['cust_id']
    df = df.drop(columns ='cust_id')    
    return df
    
customers = import_csvs(prefix,'/customers.csv')

print("customers: {} rows and {} columns".format(customers.shape[0],customers.shape[1]))
display(HTML(customers.iloc[0:1].to_html())) 

customers: 14288 rows and 15 columns


Unnamed: 0_level_0,age,income,age of account,customer_segment,prom_0,prom_1,prom_2,prom_3,prom_4,prom_5,prom_6,prom_7,prom_8,prom_9,total_amount_purchased
cust_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
e2127556f4f64592b11af22de27a7932,68,70,2,3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,57.73


# Test and Training Data
Let's go ahead and do the split now.

In [2]:
from sklearn.model_selection import train_test_split

idx = customers.shape[1]-1

X = pd.DataFrame(customers.iloc[:,:idx])
y = pd.Series(customers['total_amount_purchased'])

model_1_features = ['age', 'income', 'age of account', 'prom_0', 'prom_1', 'prom_2', 'prom_3',
                    'prom_4', 'prom_5', 'prom_6', 'prom_7', 'prom_8', 'prom_9']

X_1 = X.loc[:,model_1_features]

X_train, X_test, y_train, y_test = train_test_split(X_1, y, 
                                                    test_size=0.3, 
                                                    random_state=0)

## Model 1 - Baseline
We need to remove extra columns and only keep the original features and promotions.

In [3]:
display(HTML(X_train.iloc[0:1].to_html())) 

Unnamed: 0_level_0,age,income,age of account,prom_0,prom_1,prom_2,prom_3,prom_4,prom_5,prom_6,prom_7,prom_8,prom_9
cust_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
48c1f1b492d3451b804b81877bf957f5,67,77,4,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [4]:
#Uncomment and run this code if necessary on your machine
#!pip install git+https://github.com/statsmodels/statsmodels

In [5]:
import statsmodels.api as sm

def backward_step(X,y, sig):
    #This function conducts backward stepwise linear regression and returns the final data
    vars = X.shape[1]
    for i in range(0, vars):
        regressor_OLS = sm.OLS(y, X).fit()
        maxVar = max(regressor_OLS.pvalues)
        if maxVar > sig:
            for j in range(0, vars - i):
                #print(vars, i)
                try:
                    if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                        X = X.drop(X.columns[j], axis=1)
                except:
                    pass
    regressor_OLS.summary()
    return X

sig = 0.05 #standard assumption
X_model = backward_step(X_train,y_train, sig) 

Now we need to compare our model with the test data.

In [6]:
def eval_model(regressor, x_train, x_model, x_test, y_test, verbose = True):    
    new_row = [None]*12
    
    new_row[0] = round(regressor.rsquared_adj,2)
    
    x_test_model = x_test.loc[:,x_model.columns]
    y_pred = regressor.predict(x_test_model) 
    MAPE = np.mean(np.abs((y_test - y_pred)) / y_test) * 100
    new_row[1] = round(MAPE,1)
    
    for i,var in zip(range(len(x_model.columns)),x_model.columns):
        if var[:1]=='p':
            prom = int(var[-1:])
            new_row[prom+2] = round(regressor.params[i-2],1)
    
    if verbose == True: 
        print("\nInsignificant variables in our model are {}.\n".format(list(set(x_train.columns) - set(x_model.columns))))
        print("\nSignificant variables in our model are {}.\n".format(x_model.columns))
        print("Coefficients for our models are \n{}\n".format(regressor.params))
        print("R-squared adjusted: {}\n".format(new_row[0]))
        print("MAPE: {}%\n".format(new_row[1]))
    
    return new_row

Let's also start saving the model outputs for later.

In [7]:
model_columns = ["R2_adj", "MAPE", "Coeff_prom_0","Coeff_prom_1","Coeff_prom_2","Coeff_prom_3","Coeff_prom_4",
                 "Coeff_prom_5","Coeff_prom_6","Coeff_prom_7","Coeff_prom_8","Coeff_prom_9"]
model_results = pd.DataFrame(columns=model_columns)

In [8]:
regressor_OLS = sm.OLS(endog = y_train, exog = X_model).fit()
model_results.loc[len(model_results)] = eval_model(regressor_OLS, X_train, X_model, X_test, y_test, verbose = True)
display(HTML(model_results.to_html())) 


Insignificant variables in our model are ['prom_7'].


Significant variables in our model are Index(['age', 'income', 'age of account', 'prom_0', 'prom_1', 'prom_2',
       'prom_3', 'prom_4', 'prom_5', 'prom_6', 'prom_8', 'prom_9'],
      dtype='object').

Coefficients for our models are 
age               -0.456435
income             1.449694
age of account     1.017666
prom_0            10.837081
prom_1             8.460135
prom_2             5.750428
prom_3            15.123807
prom_4            26.350214
prom_5            10.179714
prom_6            16.257850
prom_8             8.478514
prom_9            20.283172
dtype: float64

R-squared adjusted: 0.54

MAPE: 217.1%



Unnamed: 0,R2_adj,MAPE,Coeff_prom_0,Coeff_prom_1,Coeff_prom_2,Coeff_prom_3,Coeff_prom_4,Coeff_prom_5,Coeff_prom_6,Coeff_prom_7,Coeff_prom_8,Coeff_prom_9
0,0.54,217.1,1.4,1.0,10.8,8.5,5.8,15.1,26.4,,10.2,16.3


In our base model, backward stepwise regression only eliminated promotion 7 from our model due to an insignificant p-value. All remaining promotions have positive coefficients, implying that they are increasing overall spending. It also appears that those with higher incomes, who are younger, and have had an account for longer are spending more overall.

Promotions with the largest coefficients were promotion 4, and 9. Promotion 2, 1 and 8 had the smallest coefficients.

## Model 2 - individual models per customer segment
Step 1: split our datasets per segment.

Step 2: split each segment into training and test sets.

Step 3: fit model for each segment.

Step 4: evaluate model on test set.

Step 5: compare models with baseline.


In [9]:
#Step 1: create list of our datasets per segment.
cust_by_segment = []
segments = list(range(12))
#cust_with_segments = customers.drop(columns = ['age','age of account','income'])

for seg in segments:
    #cust_by_segment.append(cust_with_segments[cust_with_segments['customer_segment']==seg].drop(columns = 'customer_segment'))
    cust_by_segment.append(customers[customers['customer_segment']==seg].drop(columns = 'customer_segment'))

In [10]:
#Step 2: split each segment into training and test sets. Outputs list of dictionaries of training and test sets
segments_training_test_data = []
idx = cust_by_segment[0].shape[1]-1

for seg in cust_by_segment:
    X = pd.DataFrame(seg.iloc[:,:idx])
    y = pd.Series(seg['total_amount_purchased'])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    segments_training_test_data.append({'X_train':X_train, 'X_test': X_test, 'y_train':y_train,'y_test':y_test})

In [11]:
#Step 3: fit model for each segment

segment_models = []
sig = .05

for data in segments_training_test_data:
    X_model = backward_step(data['X_train'],data['y_train'], sig) 
    segment_models.append(X_model)

In [12]:
#Step 4: evaluate model on test
for i in range(len(segment_models)):
    X_train, X_test, y_train, y_test = segments_training_test_data[i]['X_train'], segments_training_test_data[i]['X_test'], segments_training_test_data[i]['y_train'], segments_training_test_data[i]['y_test']
    X_model = segment_models[i]
    regressor_OLS = sm.OLS(endog = y_train, exog = X_model).fit()
    model_results.loc[len(model_results)] = eval_model(regressor_OLS, X_train, X_model, X_test, y_test, verbose = False)

In [13]:
model_names = ["baseline", "seg_0","seg_1","seg_2","seg_3","seg_4","seg_5","seg_6","seg_7","seg_8","seg_9","seg_10","seg_11"]
model_results.index = model_names
display(HTML(model_results.to_html())) 

Unnamed: 0,R2_adj,MAPE,Coeff_prom_0,Coeff_prom_1,Coeff_prom_2,Coeff_prom_3,Coeff_prom_4,Coeff_prom_5,Coeff_prom_6,Coeff_prom_7,Coeff_prom_8,Coeff_prom_9
baseline,0.54,217.1,1.4,1.0,10.8,8.5,5.8,15.1,26.4,,10.2,16.3
seg_0,0.42,239.5,0.5,,3.2,20.3,24.0,37.0,32.7,,,13.1
seg_1,0.59,76.9,42.6,4.1,,54.3,,45.1,40.0,32.5,60.0,36.2
seg_2,0.67,74.4,2.8,,,-2.8,,24.9,14.2,16.7,20.7,13.4
seg_3,0.45,270.3,,,,,0.7,,5.0,,,
seg_4,0.48,211.8,0.8,3.7,13.8,9.6,23.3,,20.1,,23.0,25.1
seg_5,0.62,67.1,1.6,-1.6,20.5,19.3,31.9,19.5,31.6,24.8,30.7,9.2
seg_6,0.7,55.7,,,,,,3.7,-2.8,,36.1,
seg_7,0.68,48.6,1.9,-2.1,,17.6,32.8,25.6,32.4,,21.5,46.5
seg_8,0.53,89.8,,,,,,0.9,,,3.9,


A quick glance shows that many of our models for individual customer segments are much more interesting than the baseline model. The majority of the models have higher R-squared adjusted values and lower MAPE. What is even more interesting is the amount of NaN coefficients, the magnitude of the numerical coefficients, and the existence of several negative coefficients.  

### R-squared Adjusted and MAPE
Unfortunately, we cannot directly compare the R-squared adjusted and MAPE values between the models since the models are not based on the same underlying data (generally speaking it is of course the same data, but due slicing of the data for the segment models, it is not comparable). Again, the finer points of linear regression are beyond the scope of this project and course and we will not dive further into detail here. We will simply make a generalization as follows: if the R-squared adjusted and MAPE are *significantly* larger or smaller than the baseline model, then we should at least be aware of it. We'll define signficant as a 20% difference - so an R-squared 20% different than .54 would be less than .43 or greater than .65 and a 20% difference on the MAPE would be less than 173% or greater than 260%. Only 1 of our models has a MAPE worse than 260% but it's R-squared is still .45. Only 1 model has an R-squared less than .43 but it's MAPE is still 239%. Therefore, for our analysis we will state that none of our individual models were considerably less accurate than the baseline. This means we can comfortably analyze the individual coefficients and base our conclusions on the coefficients of the promotions in the individual models.

### NaN coefficients
A NaN coefficient in our matrix implies than the promotion was considered to be insigificant. Or, more formally, we could not reject the null hypothesis that the given promotion had an impact on the y-variable (total amount purchased). For the Starbucks marketing department, this would imply that their promotion was a waste of time, effort, and money and should not be continued. If we focus only on the baseline, we see that promotion 7 has the only NaN, which would suggest they should cancel promotion 7 and keep the others (the others all had positive coefficients after all). 

However, looking at the individual models provides much more interesting coefficients. We very quickly find that many segments are indifferent to many promotions. Reading the matrix column-wise reveals that every promotion is ineffective on at least 1 customer segment. We also see that promotion 7 which was ineffective when measurd on the entire population is now effective for 3 customer segments (with fairly large coefficients). Only 1 segment has no Nans - segment 5 - which means they are responsive to all promotions. Many segments have many NaNs, with the most NaNs (row-wise) appearing in segment 8, for which only 2 promotions have impacts. 

### Magnitude of coefficients
The larger the absolute value of the coefficient, the larger the impact of the promotion is on the total amount purchased by the consumer. Focusing on the larger coefficients is the best strategy for increasing total amount purchased by consumers. 

The closer the coefficient is to 0, the less the overall impact. For those with a smaller magnitude, the marketing department may wish to evaluate the overall cost of the promotion (time, effort, and money) and may determine that those with the smallest coefficients are not worth the effort.

Another interesting observation is the difference in magnitude of coefficients for the same promotion across different models. For example, promotion 0 has a very small coefficient - 1.4 - in our baseline model. However, in customer segment 1 the coefficient for the same promotion is very high at 42.6. While our baseline model may suggest that the promotion is barely worth the time and effort, our segmentation has indicated that this promotion is very valuable on a customer segment 1.

### Negative coefficients
These are especially interesting for the marketing department because they imply that the promotion actually *decreased* the total amount purchased by the consumer. This indicates that the promotion should immediately be canceled for a given customer segment! We actually find 5 negative coefficients in our data - fortunately their magnitudes are small - but they still should be immediately reconsidered.

# Final Recommendation
The final recommendation will be based on the most impactful findings of our model. While a deeper analysis and fine tuning of our model could produce more insights, the main findings of our current model can already provide a significant improvement in the marketing strategy. Since we do not know anything about the cost of the promotions, we will need to make the following assumption - that a coefficient of 20 is 'most impactful'.

### DO
Customer targeting for the largest coefficients ('most impactful') will yield higher total amounts purchased by consumers. For our models, the promotions that should be used for each segment are as follows:
- Segment 0: 3, 4, 5, 6
- Segment 1: 0, 3, 5, 6, 7, 8, 9
- Segment 2: 5, 8
- Segment 3: *None*
- Segment 4: 4, 6, 8, 9
- Segment 5: 2, 4, 6, 7, 8
- Segment 6: 8
- Segment 7: 4, 5, 6, 8, 9
- Segment 8: *None*
- Segment 9: *None*
- Segment 10: 8
- Segment 11: 9

### DO NOT
Avoid at all costs giving a promotion to a customer segment which had a negative coefficient. In our models, the promotions to avoid for specific segments were:
- Segment 2: promotion 3
- Segment 5: promotion 1
- Segment 6: promotion 6
- Segment 7: promotion 1
- Segment 10: promotion 5

### Reconsider
Promotion 1 was not 'most impactful' for any segment and had very small coefficients for those for which it was significant. Internally, the marketing department should re-evaluate this promotion and consider canceling it entirely.

In [14]:
#Show again the matrix
display(HTML(model_results.to_html())) 

Unnamed: 0,R2_adj,MAPE,Coeff_prom_0,Coeff_prom_1,Coeff_prom_2,Coeff_prom_3,Coeff_prom_4,Coeff_prom_5,Coeff_prom_6,Coeff_prom_7,Coeff_prom_8,Coeff_prom_9
baseline,0.54,217.1,1.4,1.0,10.8,8.5,5.8,15.1,26.4,,10.2,16.3
seg_0,0.42,239.5,0.5,,3.2,20.3,24.0,37.0,32.7,,,13.1
seg_1,0.59,76.9,42.6,4.1,,54.3,,45.1,40.0,32.5,60.0,36.2
seg_2,0.67,74.4,2.8,,,-2.8,,24.9,14.2,16.7,20.7,13.4
seg_3,0.45,270.3,,,,,0.7,,5.0,,,
seg_4,0.48,211.8,0.8,3.7,13.8,9.6,23.3,,20.1,,23.0,25.1
seg_5,0.62,67.1,1.6,-1.6,20.5,19.3,31.9,19.5,31.6,24.8,30.7,9.2
seg_6,0.7,55.7,,,,,,3.7,-2.8,,36.1,
seg_7,0.68,48.6,1.9,-2.1,,17.6,32.8,25.6,32.4,,21.5,46.5
seg_8,0.53,89.8,,,,,,0.9,,,3.9,
