# Walmart Sales: linear regressor training

In this notebook I trained a first linear regressor on the Wallmart Sales data[(Kaggle competition)](https://www.kaggle.com/competitions/walmart-sales-forecasting/overview) to predict weekly sales using multiple variables.  
 
 In more detail, I:  
- Pre-processed train and test sets before modeling:  
    - **Imputed** certain **missing** explanatory **variables**
    - **Scaled** any numerical explanatory variables and **encoded** categorical variables  
- Applied a first **multivariate linear regressor** using:  
    - Basic explanatory variables  
    - Feature engineered variables  


## Table of Contents  
1. Train and test set split
2. Process variables: impute missing values / scale / onehot encode
3. Train model: Linear regressor  
4. Feature importance  
5. Conclusions

## Import libraries

In [29]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.preprocessing import  StandardScaler, OneHotEncoder
from feature_engine.imputation import RandomSampleImputer
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.feature_selection import f_regression

import scipy.sparse

## Import data

### Target variable

In [30]:
filename = 'data/interim/Walmart_Store_sales-targetvar.csv'
with open(filename) as file:
    Y = [float(line.rstrip()) for line in file]

print('Target variable length:',len(Y))
Y[0:5]

Target variable length: 131


[1572117.54, 1807545.43, 1244390.03, 1644470.66, 1857533.7]

### Basic explanatory variables

In [31]:
X_basic_df = pd.read_csv('data/interim/Walmart_Store_sales-expvar-basic.csv')
X_basic_df.drop(['year'], axis=1, inplace=True)
print('Basic explanatory variables shape:', X_basic_df.shape)

basic_vars_ls = X_basic_df.columns.tolist()
print('Basic explanatory variables:', basic_vars_ls)

X_basic = X_basic_df.values

X_basic[0:3,:]

Basic explanatory variables shape: (131, 8)
Basic explanatory variables: ['Store_str', 'quarter', 'Fuel_Price', 'weekofyear', 'Holiday_Flag', 'Temperature', 'CPI', 'Unemployment']


array([[  6.        ,   1.        ,   3.045     ,   7.        ,
                 nan,  15.33888889, 214.7775231 ,   6.858     ],
       [ 13.        ,   1.        ,   3.435     ,  12.        ,
          0.        ,   5.76666667, 128.6160645 ,   7.47      ],
       [ 11.        ,          nan,          nan,          nan,
          0.        ,  29.20555556, 214.5564968 ,   7.346     ]])

### Engineered explanatory variables

In [32]:
X_eng_df = pd.read_csv('data/interim/Walmart_Store_sales-expvar-feateng.csv')
X_eng_df.drop(['year'], axis=1, inplace=True)
print('Engineered explanatory variables shape:', X_eng_df.shape)

eng_vars_ls = X_eng_df .columns.tolist()
print('Engineered explanatory variables:', eng_vars_ls)

X_eng = X_eng_df .values
X_eng[0:3,:]

Engineered explanatory variables shape: (131, 6)
Engineered explanatory variables: ['quarter_str', 'Fuel_Price', 'Temperature_group', 'Store_group_CPI', 'Store_group_unemp', 'weekofyear_holiday']


array([['q1', 3.045, 'mean_temp', 'highsales_highCPI',
        'highsales_lowunemp', nan],
       ['q1', 3.435, 'low_temp', 'highsales_lowCPI',
        'highsales_highunemp', 0.0],
       [nan, nan, 'high_temp', 'lowsales_highCPI', 'lowsales_lowunemp',
        0.0]], dtype=object)

## 1. Train and test set split  
Choose a slightly smaller test size due to low number of samples

In [33]:
X_basic_train, X_basic_test, Y_train, Y_test = train_test_split(X_basic, Y, test_size=0.2, random_state=0)

X_eng_train, X_eng_test, Y_train, Y_test = train_test_split(X_eng, Y, test_size=0.2, random_state=0)

print('X_basic_train shape:', X_basic_train.shape)
print('X_basic_test shape:', X_basic_test.shape)

print('X_eng_train shape:', X_eng_train.shape)
print('X_eng_test shape:', X_eng_test.shape)

X_basic_train shape: (104, 8)
X_basic_test shape: (27, 8)
X_eng_train shape: (104, 6)
X_eng_test shape: (27, 6)


## 2. Process variables:  
Impute missing values / scale / onehot encode

In [34]:
# Print basic variables
basic_vars_ls

['Store_str',
 'quarter',
 'Fuel_Price',
 'weekofyear',
 'Holiday_Flag',
 'Temperature',
 'CPI',
 'Unemployment']

In [35]:
# Engineered variables
eng_vars_ls

['quarter_str',
 'Fuel_Price',
 'Temperature_group',
 'Store_group_CPI',
 'Store_group_unemp',
 'weekofyear_holiday']

### Processing pipelines

In [36]:
# Pipelines for missing value imputations / scaling and one hot encoding
from src.features.build_features import *

In [37]:
# Create pre-processor objects

basic_preprocessor = ColumnTransformer(
    transformers=[
        ('num', basic_num_transformer, basic_num_feats),
        ('cat', cat_transformer, cat_feat),
        ('freqcat', cpi_transformer,cpi_feat )
    ])


eng_preprocessor = ColumnTransformer(
    transformers=[
        ('num', basic_num_transformer, eng_num_feats),
        ('cat', eng_cat_transformer, eng_cat_feats),
        ('rand', eng_rand_transformer, eng_rand_feats),
    ])

## 3. Train model: Linear regressor

In [38]:
# Define full pipeline with pre-processing and linear regressor
basic_ref_pipeline = Pipeline([
        ('preprocessing', basic_preprocessor),
        ('lin_reg', LinearRegression())
    ])

eng_ref_pipeline = Pipeline([
        ('preprocessing', eng_preprocessor),
        ('lin_reg', LinearRegression())
    ])

In [39]:
# Preprocess data and fit models
basic_ref_pipeline.fit(X_basic_train, Y_train)
eng_ref_pipeline.fit(X_eng_train, Y_train)

# Prediction on train 
y_basic_pred_train = basic_ref_pipeline.predict(X_basic_train)
y_eng_pred_train = eng_ref_pipeline.predict(X_eng_train)

# Prediction on test set
y_basic_pred_test = basic_ref_pipeline.predict(X_basic_test)
y_eng_pred_test = eng_ref_pipeline.predict(X_eng_test)

### Model performance: R^2

In [40]:
# Compare R^2 scores
print("R2 score on training set (basic) : ", r2_score(Y_train, y_basic_pred_train))
print("R2 score on test set (basic): ", r2_score(Y_test, y_basic_pred_test))

R2 score on training set (basic) :  0.9721495512292556
R2 score on test set (basic):  0.9277245415698245


In [41]:
print("R2 score on training set (engineered) : ", r2_score(Y_train, y_eng_pred_train))
print("R2 score on test set (engineered): ", r2_score(Y_test, y_eng_pred_test))

R2 score on training set (engineered) :  0.7773985675148699
R2 score on test set (engineered):  0.5285512874810852


## 4. Feature importance  

Carry out F-statistic test, to define which are the most important features contributing to prediction

In [42]:
# Preprocess Xtrain 
X_basic_train_t = basic_preprocessor.fit_transform(X_basic_train)
X_eng_train_t = eng_preprocessor.fit_transform(X_eng_train)

# Linear regression F-statistic
feat_basic_importance = f_regression(X_basic_train_t, np.array(Y_train))
feat_eng_importance = f_regression(X_eng_train_t, np.array(Y_train))


### Basic features

In [43]:
basic_preprocessor._columns

[[2, 3, 5, 7], [0, 1, 4], [6]]

In [44]:
# Features used in processor: ordered in entry of input to processor
basic_preprocessor_cols_ls = [val for sublist in basic_preprocessor._columns for val in sublist]
[basic_vars_ls[i] for i in basic_preprocessor_cols_ls]

['Fuel_Price',
 'weekofyear',
 'Temperature',
 'Unemployment',
 'Store_str',
 'quarter',
 'Holiday_Flag',
 'CPI']

In [45]:
# Get slices for each feature
print('Basic preprocessor:', basic_preprocessor.output_indices_)

# Create repeated feature names based on slices
basic_feats_ls = (['num_fuelprice_week_temp_unemp']* 4) + (['cat_store_quarter_hol']* 22) + (['rand_cpi']* 1)

Basic preprocessor: {'num': slice(0, 4, None), 'cat': slice(4, 26, None), 'freqcat': slice(26, 27, None), 'remainder': slice(0, 0, None)}


In [46]:
# Dataframes of feature importance

# Create DataFrame with feature importance
feat_basic_ranking = pd.DataFrame(columns=basic_feats_ls, data=feat_basic_importance, index=["f-score", "p-value"])
# Reshape DataFrame and sort by f-score
feat_basic_ranking = feat_basic_ranking.transpose().reset_index().rename(columns = {'index': 'feature'})
# Create column with feature names
feat_basic_ranking = feat_basic_ranking.sort_values(["f-score", "p-value"], ascending=False)
feat_basic_ranking


Unnamed: 0,feature,f-score,p-value
7,cat_store_quarter_hol,16.725497,8.6e-05
5,cat_store_quarter_hol,15.963695,0.000122
6,cat_store_quarter_hol,12.425462,0.000636
26,rand_cpi,11.997708,0.00078
14,cat_store_quarter_hol,11.270007,0.001108
15,cat_store_quarter_hol,10.735497,0.001437
9,cat_store_quarter_hol,10.735495,0.001437
4,cat_store_quarter_hol,7.256979,0.008259
17,cat_store_quarter_hol,6.027657,0.015775
21,cat_store_quarter_hol,5.557957,0.020305


In [47]:
# Compare to coefficients from regressor
regressor = LinearRegression()
regressor.fit(X_basic_train_t, Y_train)

regcoefs = list(zip(basic_vars_ls, abs(regressor.coef_)))
pd.DataFrame(regcoefs, columns =['Feature', 'reg_coefficient']).sort_values('reg_coefficient', ascending=False)

Unnamed: 0,Feature,reg_coefficient
7,Unemployment,1382658.0
5,Temperature,1267858.0
4,Holiday_Flag,416371.0
6,CPI,338001.3
1,quarter,150246.4
3,weekofyear,104130.6
0,Store_str,55267.83
2,Fuel_Price,32666.44


### Engineered features

In [48]:
eng_preprocessor._columns 

[[1], [2, 3, 4], [0, 5]]

In [49]:
# Features used in processor: ordered in entry of input to processor
eng_preprocessor_cols_ls = [val for sublist in eng_preprocessor._columns for val in sublist]
[eng_vars_ls[i] for i in eng_preprocessor_cols_ls]

['Fuel_Price',
 'Temperature_group',
 'Store_group_CPI',
 'Store_group_unemp',
 'quarter_str',
 'weekofyear_holiday']

In [50]:
# Get slices for each feature
print('Eng preprocessor:', eng_preprocessor.output_indices_)

feats_eng_ls = (['num_fuelprice']* 1) + (['cat_temp_store_cpi_unemp']* 8) + (['rand_quarter_weekhol']* 3) 

Eng preprocessor: {'num': slice(0, 1, None), 'cat': slice(1, 9, None), 'rand': slice(9, 12, None), 'remainder': slice(0, 0, None)}


In [51]:
feat_eng_ranking = pd.DataFrame(columns = feats_eng_ls, data=feat_eng_importance, index=["f-score", "p-value"])
# Reshape DataFrame and sort by f-score
feat_eng_ranking= feat_eng_ranking.transpose().reset_index().rename(columns = {'index': 'feature'})
# Create column with feature names
feat_eng_ranking = feat_eng_ranking.sort_values(["f-score", "p-value"], ascending=False)
feat_eng_ranking

Unnamed: 0,feature,f-score,p-value
8,cat_temp_store_cpi_unemp,67.484818,6.972269e-13
4,cat_temp_store_cpi_unemp,66.166872,1.044125e-12
3,cat_temp_store_cpi_unemp,28.352257,6.046165e-07
6,cat_temp_store_cpi_unemp,26.179941,1.466537e-06
7,cat_temp_store_cpi_unemp,15.451197,0.0001544155
1,cat_temp_store_cpi_unemp,5.664253,0.01917225
5,cat_temp_store_cpi_unemp,5.662016,0.01919539
9,rand_quarter_weekhol,0.810601,0.3700633
2,cat_temp_store_cpi_unemp,0.671389,0.4144784
0,num_fuelprice,0.201955,0.6541005


In [52]:
# Compare to coefficients from regressor
regressor = LinearRegression()
regressor.fit(X_eng_train_t, Y_train)

regcoefs = list(zip(eng_vars_ls, abs(regressor.coef_)))
pd.DataFrame(regcoefs, columns =['Feature', 'reg_coefficient']).sort_values('reg_coefficient', ascending=False)

Unnamed: 0,Feature,reg_coefficient
4,Store_group_unemp,396921.825819
1,Fuel_Price,320307.850386
5,weekofyear_holiday,228792.931694
3,Store_group_CPI,119892.721675
0,quarter_str,58243.714697
2,Temperature_group,9752.77337


## 5. Export processed trained data

In [57]:
X_basic_train_t

<104x27 sparse matrix of type '<class 'numpy.float64'>'
	with 705 stored elements in Compressed Sparse Row format>

In [61]:
# Export processed X_train/X_test and y_train/y_test for further evaluation of other models
scipy.sparse.save_npz('data/processed/Walmart_Store_sales-expvar-train-basic.npz', X_basic_train_t)

X_basic_test_t = basic_preprocessor.transform(X_basic_test)
scipy.sparse.save_npz('data/processed/Walmart_Store_sales-expvar-test-basic.npz', X_basic_test_t)


np.savetxt('data/processed/Walmart_Store_sales-target-train-basic.csv', Y_train, delimiter=",")
np.savetxt('data/processed/Walmart_Store_sales-target-test-basic.csv', Y_test, delimiter=",")


## 6. Conclusions  

**Basic explanatory variables regressor**  

 
 - The linear regressor trained with the basic features *Fuel_Price*, *weekofyear*, *Temperature*, *Unemployment*, *Store_str*, *quarter*, *Holiday_Flag* and *CPI* had a good performance on train set (R<sup>2</sup> = 0.97) but was overfitting (R<sup>2</sup> on test set = 0.93)  

 - The most important features behid the basic regressor prediction were the store ID and the CPI index  

<br>

**Engineered explanatory variables regressor**   
 - The linear regressor trained with the engineered features *Fuel_Price*, *Temperature_group*, *Store_group_CPI*, *Store_group_unemp*, *quarter_str*, *weekofyear_holiday* did not perform as well (R<sup>2</sup> on train set = 0.78 vs R<sup>2</sup> on test set = 0.53 - almost same as by chance)  
 
 - Such bad performance could be associated to the fact that the weights associated to temperature, CPI, unemployment rate were lost through categorization. A possible way of evaluating this would be to categorize through ordinal transformation  
 