# Walmart Sales: Basic linear regressor and regularization

In this notebook I trained a first linear regressor on the Wallmart Sales data[(Kaggle competition)](https://www.kaggle.com/competitions/walmart-sales-forecasting/overview) to predict weekly sales using multiple variables.  
 
 In more detail, I:  
- Pre-processed train and test sets before modeling:  
    - **Imputed** certain **missing** explanatory **variables**
    - **Scaled** any numerical explanatory variables and **encoded** categorical variables  
- Applied a first **multivariate linear regressor** using:  
    - Basic explanatory variables  
    - Feature engineered variables  
- Evaluated **model performance** throught a **cross validation**  
- **Optimized** my linear **model** via **regularization** and **grid search** for hypeparameter tuning


## Table of Contents  
1. Train and test set split
2. Process variables: impute missing values / scale / onehot encode
3. 

## Import libraries

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  StandardScaler
from sklearn.preprocessing import  OneHotEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import ColumnTransformer

from feature_engine.imputation import RandomSampleImputer


## Import data

### Target variable

In [2]:
filename = 'data/processed/Walmart_Store_sales-targetvar.csv'
with open(filename) as file:
    Y = [line.rstrip() for line in file]

print('Target variable length:',len(Y))
Y[0:5]

Target variable length: 131


['1572117.54', '1807545.43', '1244390.03', '1644470.66', '1857533.7']

### Basic explanatory variables

In [3]:
X_basic = pd.read_csv('data/processed/Walmart_Store_sales-expvar-basic.csv')
print('Basic explanatory variables shape:', X_basic.shape)

basic_vars_ls = X_basic.columns.tolist()
print('Basic explanatory variables:', basic_vars_ls)

X_basic = X_basic.values

X_basic[0:3,:]

Basic explanatory variables shape: (131, 7)
Basic explanatory variables: ['quarter', 'year', 'weekofyear', 'Holiday_Flag', 'Temperature', 'CPI', 'Unemployment']


array([[1.0, 'y2', 7.0, nan, 15.33888888888889, 214.7775231, 6.858],
       [1.0, 'y2', 12.0, 0.0, 5.766666666666668, 128.6160645, 7.47],
       [nan, nan, nan, 0.0, 29.205555555555552, 214.55649680000005,
        7.346]], dtype=object)

### Engineered explanatory variables

In [4]:
X_eng = pd.read_csv('data/processed/Walmart_Store_sales-expvar-feateng.csv')
print('Engineered explanatory variables shape:', X_eng.shape)

# Map certain categorical values to numerical values for missing value imputation


eng_vars_ls = X_eng.columns.tolist()
print('Engineered explanatory variables:', eng_vars_ls)

X_eng = X_eng.values
X_eng[0:3,:]

Engineered explanatory variables shape: (131, 6)
Engineered explanatory variables: ['quarter_str', 'year', 'Temperature_group', 'Store_group_CPI', 'Store_group_unemp', 'weekofyear_holiday']


array([['q1', 'y2', 'mean_temp', 'highsales_highCPI',
        'highsales_lowunemp', nan],
       ['q1', 'y2', 'low_temp', 'highsales_lowCPI',
        'highsales_highunemp', 0.0],
       [nan, nan, 'high_temp', 'lowsales_highCPI', 'lowsales_lowunemp',
        0.0]], dtype=object)

## 1. Train and test set split  
Choose a slightly smaller test size due to low number of samples

In [5]:
X_basic_train, X_basic_test, Y_train, Y_test = train_test_split(X_basic, Y, test_size=0.15, random_state=0)

X_eng_train, X_eng_test, Y_train, Y_test = train_test_split(X_eng, Y, test_size=0.15, random_state=0)

print('X_basic_train shape:', X_basic_train.shape)
print('X_basic_test shape:', X_basic_test.shape)

print('X_eng_train shape:', X_eng_train.shape)
print('X_eng_test shape:', X_eng_test.shape)

X_basic_train shape: (111, 7)
X_basic_test shape: (20, 7)
X_eng_train shape: (111, 6)
X_eng_test shape: (20, 6)


## 2. Process variables:  
Impute missing values / scale / onehot encode

In [6]:
# Basic variables
basic_vars_ls

['quarter',
 'year',
 'weekofyear',
 'Holiday_Flag',
 'Temperature',
 'CPI',
 'Unemployment']

In [7]:
# Engineered variables
eng_vars_ls

['quarter_str',
 'year',
 'Temperature_group',
 'Store_group_CPI',
 'Store_group_unemp',
 'weekofyear_holiday']

### Processing pipelines

In [8]:
# Pipelines for missing value imputations / scaling and one hot encoding

# Categorical year
# Impute less frequent: from EDA, year with least entries is 2010 (2010 has almost 52 weeks, 2011 37 weeks) 
# One hot encode
year_feat = [1]
year_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='y2')), 
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])


# Basic numerical variables: Holiday_Flag, Temperature, Unemployment
# Based on distibution of data, use median for imputation
basic_num_feats = [3, 4, 6] 
basic_num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])


# Basic variables with more complex distribution: quarter, weekofyear and CPI (binomial distribution)
# Multivariate imputation with Bayesian ridge: include already imputed temperature to aide imputation of other variables
basic_multivar_feats = [0, 2, 5, 4]
basic_multivar_transformer = Pipeline(steps=[
    ('imputer', IterativeImputer(max_iter=100, random_state=0)),
    ('scaler', StandardScaler())
])


# Engineered categorical variables: Temperature_group, Store_group_CPI, Store_group_unemp
# Impute most frequent and one hot encode 
eng_cat_feats = [2, 3, 4]
eng_cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first'))
])


# Random sample imputer on categorical quarter, week and weekofyear_holiday (since difficult to impute precisely) 
eng_rand_feats = [0, 1, 5] 
eng_rand_transformer = Pipeline(steps=[
    ('imputer', RandomSampleImputer(random_state=0, variables=['0', '1', '2'])),
    ('encoder', OneHotEncoder(drop='first'))
])

# # Categorical variables difficult to impute: quarter, week and weekofyear_holiday
# # Impute with a 'missing_value' constant and add missing value indicator
# eng_catmiss_feats = [0, 2]
# eng_catmiss_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='constant', fill_value="missing_value", add_indicator=True)),
#     ('encoder', dpOH(categorical_columns=["0","1"]))
# ])


### Column Transformer

In [9]:
# Create pre-processor objects

basic_preprocessor = ColumnTransformer(
    transformers=[
        ('bnum', basic_num_transformer, basic_num_feats),
        ('year', year_transformer, year_feat),
        ('bmulti', basic_multivar_transformer,basic_multivar_feats )
    ])

eng_preprocessor = ColumnTransformer(
    transformers=[
        ('ecat', eng_cat_transformer, eng_cat_feats),
        ('year', year_transformer, year_feat),
        ('rand', eng_rand_transformer, eng_rand_feats),
    ])

### Pre-process train and test sets

In [10]:
# Preprocessings on basic feats train set
print("Performing preprocessings on basic train set...")
print(X_basic_train[0:5,:])
X_basic_train = basic_preprocessor.fit_transform(X_basic_train)
print('...Done.')
print(X_basic_train[0:5,:])
print()

# Preprocessings on basic feats test set
print("Performing preprocessings on basic test set...")
print(X_basic_test[0:5,:])
X_basic_test = basic_preprocessor.transform(X_basic_test)
print('...Done.')
print(X_basic_test[0:5,:])
print()

Performing preprocessings on basic train set...
[[3.0 'y1' 34.0 0.0 23.84444444444445 214.9362793 6.315]
 [2.0 'y3' 19.0 0.0 23.994444444444444 225.2351496 6.664]
 [1.0 'y3' 2.0 0.0 11.033333333333331 nan 6.832999999999998]
 [3.0 'y2' 38.0 0.0 17.555555555555557 129.5183333 6.877000000000002]
 [3.0 'y3' 27.0 0.0 30.48333333333333 130.7196333 7.17]]
...Done.
[[-2.78693206e-01  8.66209495e-01 -1.11791716e+00  0.00000000e+00
   0.00000000e+00  5.67863327e-01  6.64587166e-01  9.33458314e-01
   8.74571747e-01]
 [-2.78693206e-01  8.82265989e-01 -7.39519951e-01  0.00000000e+00
   1.00000000e+00 -4.05285581e-01 -4.75946209e-01  1.20681619e+00
   8.90608485e-01]
 [-2.78693206e-01 -5.05133951e-01 -5.56284627e-01  0.00000000e+00
   1.00000000e+00 -1.37843449e+00 -1.76855070e+00 -7.26871374e-05
  -4.95084497e-01]
 [-2.78693206e-01  1.93026156e-01 -5.08578389e-01  1.00000000e+00
   0.00000000e+00  5.67863327e-01  9.68729399e-01 -1.33374846e+00
   2.02216644e-01]
 [-2.78693206e-01  1.57685799e+00 -1

In [11]:
# Preprocessings on engineered feats train set
print("Performing preprocessings on eng train set...")
print(X_eng_train[0:5,:])
X_eng_train = eng_preprocessor.fit_transform(X_eng_train)
print('...Done.')
print(X_eng_train[0:5,:])
print()

# Preprocessings on engineered feats test set
print("Performing preprocessings on eng test set...")
print(X_eng_test[0:5,:])
X_eng_test = eng_preprocessor.transform(X_eng_test)
print('...Done.')
print(X_eng_test[0:5,:])
print()

Performing preprocessings on eng train set...
[['q3' 'y1' 'mean_temp' 'lowsales_highCPI' 'lowsales_lowunemp' 0.0]
 ['q2' 'y3' 'mean_temp' 'lowsales_highCPI' 'lowsales_lowunemp' 0.0]
 ['q1' 'y3' 'mean_temp' nan 'lowsales_lowunemp' 0.0]
 ['q3' 'y2' 'mean_temp' 'highsales_lowCPI' 'highsales_lowunemp' 0.0]
 ['q3' 'y3' 'high_temp' 'highsales_lowCPI' 'highsales_lowunemp' 1.0]]
...Done.
  (0, 1)	1.0
  (0, 3)	1.0
  (0, 8)	1.0
  (0, 13)	1.0
  (1, 1)	1.0
  (1, 3)	1.0
  (1, 8)	1.0
  (1, 11)	1.0
  (1, 12)	1.0
  (1, 16)	1.0
  (2, 1)	1.0
  (2, 3)	1.0
  (2, 8)	1.0
  (2, 11)	1.0
  (2, 16)	1.0
  (3, 1)	1.0
  (3, 2)	1.0
  (3, 6)	1.0
  (3, 10)	1.0
  (3, 13)	1.0
  (3, 15)	1.0
  (4, 2)	1.0
  (4, 6)	1.0
  (4, 11)	1.0
  (4, 13)	1.0
  (4, 16)	1.0
  (4, 17)	1.0

Performing preprocessings on eng test set...
[[nan nan 'mean_temp' 'lowsales_highCPI' 'lowsales_lowunemp' 0.0]
 ['q1' 'y1' 'mean_temp' 'lowsales_highCPI' 'lowsales_lowunemp' 1.0]
 [nan nan 'mean_temp' 'lowsales_lowCPI' 'lowsales_highunemp' 0.0]
 ['q3' 