# Part 1 - Exploratory Data Analysis
In this section, we will:
- Import necessary packages for executing the code
- Load the data
- Convert qualitative predictor variables to the *category* data type
- Conduct EDA on the data set using various visualizations and pivot tables

In [None]:
# Import 'numpy' and 'pandas' for working with numbers and data frames
import numpy as np
import pandas as pd

# Import 'pyplot' from 'matplotlib' and 'seaborn' for visualizations
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
# Load the data and take a look at it
df = pd.read_csv('health_nuts_data.csv')
df.head()

Note that the target variable here is *Spending* whereas the other variables are treated as predictors.

In [None]:
# Look at the specifics of the data frame
df.info()

Note that *Salary*, *Age* and *Spending* are numeric variables whereas the others are of the data type *object*.

In [None]:
# Convert qualitative predictors to the 'category' data type
categorical_columns = ['State', 'Gender', 'Marital Status', 'Repeat', 'Type', 'Coupon']
df[categorical_columns] = df[categorical_columns].astype('category')

In [None]:
# Look at the specifics of the data frame
df.info()

Note that the qualitative variables are now of the *category* data type.

## EDA: Visualizations

In [None]:
# Create count plots for the categorical variables
plt.figure(figsize = (14, 6))

fignum = 0
for featurename in categorical_columns:
    fignum = fignum + 1
    plt.subplot(2, 3, fignum)
    sns.countplot(data = df, x = featurename)
    
plt.tight_layout();

Count plots are useful to get a sense of the distribution of occurence of the different levels of categorical variables in the data set.

In [None]:
# Create bar plots for the categorical variables with 'Spending' on the Y-axis
plt.figure(figsize = (14, 6))

fignum = 0
for featurename in categorical_columns:
    fignum = fignum + 1
    plt.subplot(2, 3, fignum)
    sns.barplot(x = df[featurename], y = df['Spending'], ci = None)
    
plt.tight_layout();

The bar plots here help us compare the value of the *Spending* variable between different levels of categorical variables.

In [None]:
# Create a pair plot for the numerical features in the data set
sns.pairplot(df);

In [None]:
# Create scatter plots of 'Spending' versus 'Salary', one colored by 'Gender' and the other by 'Repeat'
plt.figure(figsize = (14, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(data = df, x = 'Salary', y = 'Spending', hue = 'Gender')
plt.subplot(1, 2, 2)
sns.scatterplot(data = df, x = 'Salary', y = 'Spending', hue = 'Repeat');

It is interesting to note that the people with higher salaries and lower spendings are men whereas the people with lower salaries and higher spendings are women.

## EDA: Pivot Tables

In [None]:
# Create a pivot table of count of 'Spending' with respect to 'Gender' and 'Repeat'
pd.pivot_table(data = df, values = 'Spending', index = 'Gender', columns = 'Repeat', aggfunc = 'count', margins = True)

The entries in this pivot table are counts or frequencies of occurence of the different levels of the categorical variables *Gender* and *Repeat*.

In [None]:
# Create a pivot table of mean 'Spending' with respect to 'Gender' and 'Coupon'
pd.pivot_table(data = df, values = 'Spending', index = 'Gender', columns = 'Coupon', aggfunc = 'mean', margins = True)

The entries in this pivot table are mean *Spending* values for the different divisions in the table.

In [None]:
# Create a bar plot corresponding to the pivot table above
plt.figure(figsize = (8, 4))
sns.barplot(data = df, x = 'Gender', y = 'Spending', hue = 'Coupon', ci = None);

# Part 2 - Linear Regression Models
In this section, we will:
- Import necessary packages for executing the code
- Train and evaluate linear regression models for the data

In [None]:
# Import method for regression from 'statsmodels'
import statsmodels.formula.api as smf

## Model 1

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Gender'
lr_model_1 = ##### CODE HERE #####
lr_model_1 = ##### CODE HERE #####
print(##### CODE HERE #####)

According to this model:
- Men spend 37 dollars less than women on average
- The p-value for *Gender[T.Male]* here is less than 0.05, so *Gender* is statistically significant in explaining the variation in *Spending*
- About 50% of the variation in *Spending* is explained by *Gender*

Note that the base or reference category for this model is *Female*, so the average spending of women is recorded as the intercept of the model.

We can confirm the average spending values of men and women using the following pivot table.

In [None]:
# Create a pivot table of mean 'Spending' with respect to 'Gender'
table = pd.pivot_table(data = df, values = 'Spending', index = 'Gender', aggfunc = 'mean', margins = True)
table

In [None]:
# Compute the difference between the average spending of men and women
table.Spending[0] - table.Spending[1]

The difference between average spending of men and women is about 37.0708 dollars, which is the same as the value computed earlier.

## Model 2

In [None]:
# Create a data frame with dummy variables for 'Gender' and take a look at it
df_dummy = pd.get_dummies(df['Gender'])
df_dummy['Spending'] = df['Spending']
df_dummy.head()

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Male'
lr_model_2 = ##### CODE HERE #####
lr_model_2 = ##### CODE HERE #####
print(##### CODE HERE #####)

It is easy to see that *lr_model_2* is exactly the same as *lr_model_1*.

## Model 3

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Gender' with 'Male' as the reference category
df['Gender'] = ##### CODE HERE #####
lr_model_3 = ##### CODE HERE #####
lr_model_3 = ##### CODE HERE #####
print(##### CODE HERE #####)

The models *lr_model_1* or (*lr_model_2*) and *lr_model_3* are essentially the same. The only difference lies in which category of the *Gender* variable was set as the reference category. The coefficient's value has remained the same while its sign has changed.

## Model 4

In [None]:
# Create a data frame with dummy variables for 'Gender' and take a look at it
# Note: The base category for 'Gender' in the original data frame is currently 'Male'
df_dummy = pd.get_dummies(df['Gender'])
df_dummy['Spending'] = df['Spending']
df_dummy.head()

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Male'
lr_model_4 = ##### CODE HERE #####
lr_model_4 = ##### CODE HERE #####
print(##### CODE HERE #####)

It is easy to see that *lr_model_4* is exactly the same as *lr_model_1* and *lr_model_2*.

Let's switch back to the original category ordering for the *Gender* feature.

In [None]:
# Set the base category of 'Gender' to 'Female'
df['Gender'] = ##### CODE HERE #####

## Model 5

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Coupon' with 'None' as the reference category
df['Coupon'] = ##### CODE HERE #####
lr_model_5 = ##### CODE HERE #####
lr_model_5 = ##### CODE HERE #####
print(##### CODE HERE #####)

According to this model:
- People who did not receive or use a coupon spent about 124.53 dollars on average
- People who used a coupon that arrived by the USPS spent on average about 9.35 dollars less than those who did not receive or use a coupon
- People who used a coupon that arrived by email spent on average about 4.4 dollars more than those who did not receive or use a coupon
- About 4% of the variation in *Spending* is explained by *Coupon*

We can confirm the average spending values broken up by *Coupon* using the following pivot table.

In [None]:
# Create a pivot table of mean 'Spending' with respect to 'Coupon'
pd.pivot_table(data = df, values = 'Spending', index = 'Coupon', aggfunc = 'mean', margins = True)

## Model 6

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Salary'
lr_model_6 = ##### CODE HERE #####
lr_model_6 = ##### CODE HERE #####
print(##### CODE HERE #####)

According to this model:
- People with higher salaries generally spend less
- *Salary* is statistically significant
- About 23% of the variation in *Spending* is explained by *Salary*

## Model 7

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Salary' and 'Gender'
lr_model_7 = ##### CODE HERE #####
lr_model_7 = ##### CODE HERE #####
print(##### CODE HERE #####)

According to this model:
- Women in general spend more than men regardless of salary
- *Gender* is statistically significant whereas *Salary* is not
- About 50% of the variation in *Spending* is explained by *Salary* and *Gender*

In [None]:
# Create scatter plots of 'Spending' versus 'Salary', one colored by 'Gender' and the other without any categorical division
plt.figure(figsize = (14, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(data = df, x = 'Salary', y = 'Spending')
plt.subplot(1, 2, 2)
sns.scatterplot(data = df, x = 'Salary', y = 'Spending', hue = 'Gender');

If *Gender* is not accounted for, there seems to be a negative correlation between *Salary* and *Spending*. But if *Gender* is accounted for, there seems to be no particular correlation between *Salary* and *Spending*.

## Model 8

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using 'Age' and 'Gender'
lr_model_8 = ##### CODE HERE #####
lr_model_8 = ##### CODE HERE #####
print(##### CODE HERE #####)

According to this model:
- Older people tend to spend less
- Both *Age* and *Gender* are statistically significant
- About 58% of the variation in *Spending* is explained by *Age* and *Gender*

## Model 9

In [None]:
# Create and train a linear regression model for the data and view its summary
# Note: The objective is to predict 'Spending' using all the predictor variables in the data set except 'State'
lr_model_9 = ##### CODE HERE #####
lr_model_9 = ##### CODE HERE #####
print(##### CODE HERE #####)

# Part 3 - Diagnostic Plots
In this section, we will:
- Import necessary packages for executing the code
- Create and analyze diagnostic plots for *lr_model_9*

In [None]:
# Import methods for regression diagnostic plots from 'statsmodels'
from statsmodels.api import ProbPlot, qqplot

In [None]:
# Create a scatter plot between the fitted and actual values of 'Spending'
plt.figure(figsize = (5, 5))
sns.scatterplot(x = lr_model_9.fittedvalues, y = df['Spending'])
plt.axline((100,100), slope = 1, linestyle = '--', linewidth = 1, color = 'r')
plt.xlabel('Fitted Values of Spending')
plt.ylabel('Actual Values of Spending');

In [None]:
# Create a scatter plot between the fitted values of 'Spending' and the residuals
plt.figure(figsize = (8, 4))
sns.scatterplot(x = lr_model_9.fittedvalues, y = lr_model_9.resid)
plt.axhline(y = 0, xmin = 0, xmax = 1, linewidth = 1, color = 'k')
plt.xlabel('Fitted Values of Spending')
plt.ylabel('Residuals');

In [None]:
# Create a histogram of the residuals
plt.figure(figsize = (8, 4))
sns.histplot(data = df, x = lr_model_9.resid, color = 'lightgray')
plt.xlabel('Residual Value')
plt.ylabel('Frequency');

In [None]:
# Create a QQ plot for the data
QQ = ProbPlot(lr_model_9.get_influence().resid_studentized_internal)
fig = QQ.qqplot(line = '45', alpha = 0.5, lw = 1)
fig.set_size_inches(5, 5)
fig.gca().set_title('Normal Q-Q')
fig.gca().set_xlabel('Theoretical Quantiles')
fig.gca().set_ylabel('Standardized Residuals');