# Customer spending: Regression Model Evaluation with cross-validation
- Use “Spending” data.

In [1]:
import pandas as pd
from sklearn import model_selection
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

### 1. Divide the data into 75% training and 25% test set (use random_state=10) and run the following OLS models on training data. Report regression coefficients of the two models.
- Dependent variable = 'cur_year_spend'
- M1: Independent variables = all other variables
- M2: Independent variables = all other variables except 'prev_year_spend'

In [2]:
# Read data

df = pd.read_csv('Spending.csv')
df.head()

Unnamed: 0,cur_year_spend,prev_year_spend,days_since_last_purchase,days_since_first_purchase,total_transactions,age,income,engagement_score
0,5536.46,1681.26,7,61,34,61,97914.93,-0.652392
1,871.41,1366.74,12,34,33,68,30904.69,0.007327
2,2046.74,1419.38,10,81,22,54,48194.59,0.221666
3,4662.7,1561.21,12,32,34,49,93551.98,1.149641
4,3539.46,1397.6,17,72,34,66,66267.57,0.835834


- `pd.read_csv`: Used to read data from a CSV file (here Customer spend.csv) and create a DataFrame.

In [3]:
# Define x, y for M1

y=df.cur_year_spend
x1=df[['prev_year_spend', 'days_since_last_purchase', 'days_since_first_purchase',
       'total_transactions', 'age', 'income', 'engagement_score']]

x1 = df[df.columns[1:]]  # Alternative: all rows, 2nd~last columns
x1 = df.iloc[:, 1:]  # Alternative: all rows, 2nd~last columns

- `y = df.cur_year_spend`: Assigns the 'cur_year_spend' column from the DataFrame df to the variable y.
- `x1 = df[[...]]`: Creates a DataFrame x1 by selecting specific columns from the original DataFrame df.
- `x1 = df[df.columns[1:]]`: Assigns to x1 all rows and columns from the second column (index 1) to the last column of the DataFrame df.
- `x1 = df.iloc[:, 1:]`: It uses integer-based indexing with the iloc method to select all rows (indicated by :) and all columns from the second column (index 1) to the last column of the DataFrame df.

In [4]:
# Divide data into training and test sets. Fit regression model 1 and obtain coefficients.

xtrain1, xtest1, ytrain, ytest = train_test_split(x1, y, random_state=10)  # default: 75%, 25% split

m1 = LinearRegression().fit(xtrain1, ytrain)
m1.coef_

array([ 0.8563393 , -0.19767076, 15.55976917, 48.90062143, -0.18158702,
        0.05973838,  1.41221236])

- `xtrain1, xtest1, ytrain, ytest = train_test_split(x1, y, random_state=10)`: This line uses the train_test_split function to split your data into training and test sets.
    - `random_state=10` ensures that the random split is the same every time you run the code with the same dataset. You can use any integer.
- `m1 = LinearRegression().fit(xtrain1, ytrain)`: This line creates a linear regression model (m1) and fits it to the training data (xtrain1, ytrain).
- `m1.coef_`: This code retrieves the coefficients of the linear regression model.

In [5]:
# Define x for M2

xtrain2 = xtrain1.drop('prev_year_spend', axis=1)
xtest2 = xtest1.drop('prev_year_spend', axis=1)

- `xtrain2 = xtrain1.drop('prev_year_spend', axis=1)`: Creates a new DataFrame xtrain2 by dropping the 'prev_year_spend' column from the xtrain1 DataFrame. 
    - `axis=1` indicates that you're removing a column.

In [6]:
# Fit model 2 and obtain coefficients.

m2 = LinearRegression().fit(xtrain2, ytrain)
m2.coef_

array([  2.57941059,  15.65593588,  50.36041994,  -2.39007632,
         0.05957811, -14.05342288])

### 2. Compare MSE, MAE, and R2 of the two models on test data. Summarize your findings

In [7]:
# Predict y

pred1 = m1.predict(xtest1)
pred2 = m2.predict(xtest2)

- `predict(xtest1)`: It predicts y using m1 on the xtest1 datasets.

In [8]:
# Obtain MSE, MAE, R2 for M1

print('MSE1', mean_squared_error(ytest, pred1))
print('MAE1', mean_absolute_error(ytest, pred1))
print('R2_1', r2_score(ytest, pred1))

MSE1 2856.6489804141784
MAE1 43.05753616199802
R2_1 0.9979974884016146


- `mean_squared_error()`: calculates the Mean Squared Error.
- `mean_absolute_error()`: calculates the Mean Absolute Error.
- `r2_score()`: calculates the R-squared (R^2) score.

In [9]:
# Obtain MSE, MAE, R2 for M2

print('MSE2', mean_squared_error(ytest, pred2))
print('MAE2', mean_absolute_error(ytest, pred2))
print('R2_2', r2_score(ytest, pred2))

MSE2 181266.34045003506
MAE2 340.7897642596115
R2_2 0.8729322532670999


- M1 is a better model than M2: MSE and MAE are smaller for M1 and R2 is larger for M1.

### 3. Compare MSE and MAE of the two models using 10-fold cross validation. Summarize your findings

In [10]:
# Define linear regression and 10 fold CV

ols = LinearRegression()
cv10 = model_selection.KFold(n_splits=10)

- `ols = LinearRegression()`: Initializes a linear regression model.
- `cv10 = model_selection.KFold(n_splits=10)`: This line sets up a 10-fold cross-validation scheme.

In [11]:
# Obtain MSE, MAE for M1
mse1=-(model_selection.cross_val_score(ols, x1, y, cv=cv10, scoring='neg_mean_squared_error'))
mae1=-(model_selection.cross_val_score(ols, x1, y, cv=cv10, scoring='neg_mean_absolute_error'))

# scores of the estimator for each run of CV
print('MSE1', mse1)  
print('MAE1', mae1)

MSE1 [2783.85302023 2987.38416973 3524.19810311 2853.96659648 2571.39931392
 2413.61338146 2931.00530014 2535.94415572 3041.52995316 2599.38586265]
MAE1 [42.47136833 44.69460957 48.78478122 40.70813086 41.15156675 39.15389552
 42.65617524 41.37091645 43.22787964 41.94108309]


- `model_selection.cross_val_score()`: It evaluates the performance of a model using cross-validation and returns an array of scores (in this case, MSE values).
- `ols`: This is the linear regression model that you want to evaluate.
- `scoring='neg_mean_squared_error'`: The scoring parameter specifies the scoring metric used to evaluate the model. In this case, it's set to the negative of the mean squared error (MSE).

In [12]:
# Average scores of all the runs of CV

print('MSE1', mse1.mean())
print('MAE1', mae1.mean())

MSE1 2824.2279856612613
MAE1 42.616040668265896


- `mse1=mse1.mean()`: Calculates the mean of the MSE values obtained from cross-validation.

In [13]:
# Define x for M2. Obtain MSE, MAE for M2
x2 = x1.drop('prev_year_spend', axis=1)
mse2=-(model_selection.cross_val_score(ols, x2, y, cv=cv10, scoring='neg_mean_squared_error'))
mae2=-(model_selection.cross_val_score(ols, x2, y, cv=cv10, scoring='neg_mean_absolute_error'))

# scores of the estimator for each run of CV
print(mse2)
print(mae2)

[183946.64271733 193666.38951621 160690.84408008 169823.02726641
 170530.90864855 188506.12389994 187050.48843526 167547.91321476
 146522.43349646 244990.06184977]
[350.10248198 351.72924476 311.59696414 340.2184894  340.42371759
 343.83641777 355.17271902 344.30419057 297.3769624  411.30379277]


In [14]:
# Average scores of all the runs of CV

print('MSE2', mse2.mean())
print('MAE2', mae2.mean())

MSE2 181327.48331247782
MAE2 344.6064980394614


- M1 is a better model than M2: MSE and MAE are smaller for M1

### 4. Compare MSE and MAE of the two models using leave-one-out cross validation. Summarize your findings

In [15]:
# Define OLS and Leave-one-out CV

ols = LinearRegression()
loo = model_selection.LeaveOneOut()

- `loo = model_selection.LeaveOneOut()`: Creates a Leave-One-Out (LOO) cross-validation object.
- LOO is a specific type of cross-validation where each data point is used as the test set exactly once, while the rest of the data is used for training.

In [16]:
# Obtain MSE, MAE for M1: Scores of the estimator for each run of CV

mse1=-(model_selection.cross_val_score(ols, x1, y, cv=loo, scoring='neg_mean_squared_error'))
mae1=-(model_selection.cross_val_score(ols, x1, y, cv=loo, scoring='neg_mean_absolute_error'))

In [17]:
# Average scores of all the runs of CV

print('MSE1', mse1.mean())
print('MAE1', mae1.mean())

MSE1 2827.971634160528
MAE1 42.67415903955856


In [18]:
# Obtain MSE, MAE for M1: Scores of the estimator for each run of CV

mse2=-(model_selection.cross_val_score(ols, x2, y, cv=loo, scoring='neg_mean_squared_error'))
mae2=-(model_selection.cross_val_score(ols, x2, y, cv=loo, scoring='neg_mean_absolute_error'))

In [19]:
# Average scores of all the runs of CV

print('MSE2', mse2.mean())
print('MAE2', mae2.mean())

MSE2 180421.42538956885
MAE2 343.61913594548764


- M1 is a better model than M2: MSE and MAE are smaller for M1