# Objectif:

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns

## 0) Import useful modules 

In [221]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer


## 1) File reading and basic exploration

In [222]:
dataset_init = pd.read_csv('Data/Walmart_Store_sales.csv')

In [223]:
dataset=dataset_init

In [224]:
dataset.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092


In [225]:
dataset.describe(include='all')

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,150.0,132,136.0,138.0,132.0,136.0,138.0,135.0
unique,,85,,,,,,
top,,19-10-2012,,,,,,
freq,,4,,,,,,
mean,9.866667,,1249536.0,0.07971,61.398106,3.320853,179.898509,7.59843
std,6.231191,,647463.0,0.271831,18.378901,0.478149,40.274956,1.577173
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.5875,2.85225,131.970831,6.5975
50%,9.0,,1261424.0,0.0,62.985,3.451,197.908893,7.47
75%,15.75,,1806386.0,0.0,76.345,3.70625,214.934616,8.15


In [226]:
print("Percentage of missing values: ")
100*dataset.isnull().sum()/dataset.shape[0]

Percentage of missing values: 


Store            0.000000
Date            12.000000
Weekly_Sales     9.333333
Holiday_Flag     8.000000
Temperature     12.000000
Fuel_Price       9.333333
CPI              8.000000
Unemployment    10.000000
dtype: float64

----------- Cleaning our Dataset ------------

Since our target variable is Weekly_Sales, we see that it has a lot of Nan values. Since we don't want to add any bias in our predictions, we will drop those lines

In [227]:
dataset=dataset.loc[dataset['Weekly_Sales'].notnull()]

In [228]:
print("Percentage of missing values: ")
100*dataset.isnull().sum()/dataset.shape[0]

Percentage of missing values: 


Store            0.000000
Date            13.235294
Weekly_Sales     0.000000
Holiday_Flag     8.088235
Temperature     11.029412
Fuel_Price       8.823529
CPI              8.088235
Unemployment    10.294118
dtype: float64

All Good for our target variable

Let's convert the date column and extract different infos : Year, Month, day, day_of_week

In [229]:
dataset['Date']=pd.to_datetime(dataset['Date'])

In [230]:
dataset['Year']=dataset['Date'].dt.year
dataset['Month']=dataset['Date'].dt.month
dataset['Day']=dataset['Date'].dt.day
dataset['Day_of_week']=dataset['Date'].dt.dayofweek

In [231]:
dataset

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Day_of_week
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,4.0
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.470,2011.0,3.0,25.0,4.0
3,11.0,NaT,1244390.03,0.0,84.57,,214.556497,7.346,,,,
4,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,4.0
5,4.0,2010-05-28,1857533.70,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...
145,14.0,2010-06-18,2248645.59,0.0,72.62,2.780,182.442420,8.899,2010.0,6.0,18.0,4.0
146,7.0,NaT,716388.81,,20.74,2.778,,,,,,
147,17.0,2010-11-06,845252.21,0.0,57.14,2.841,126.111903,,2010.0,11.0,6.0,5.0
148,8.0,2011-12-08,856796.10,0.0,86.05,3.638,219.007525,,2011.0,12.0,8.0,3.0


In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*

In [232]:
column_concerned=['Temperature','Fuel_Price','CPI','Unemployment']

for column in column_concerned:
    dataset=dataset.loc[(dataset[column]>dataset[column].mean()-3*dataset[column].std()) & (dataset[column]<dataset[column].mean()+3*dataset[column].std()) | (dataset[column].isnull())]

In [233]:
print("Percentage of missing values: ")
100*dataset.isnull().sum()/dataset.shape[0]

Percentage of missing values: 


Store            0.000000
Date            13.740458
Weekly_Sales     0.000000
Holiday_Flag     8.396947
Temperature     10.687023
Fuel_Price       9.160305
CPI              8.396947
Unemployment    10.687023
Year            13.740458
Month           13.740458
Day             13.740458
Day_of_week     13.740458
dtype: float64

-------- Some Visualisations ---------

In [234]:
# Correlation matrix
corr_matrix = dataset.corr()

import plotly.figure_factory as ff

fig = ff.create_annotated_heatmap(corr_matrix.values,
                                  x = corr_matrix.columns.values.tolist(),
                                  y = corr_matrix.index.values.tolist())


fig.show()

Let's  Visualize the impact of each variable on price with Plotly

In [235]:
import plotly.express as px
import plotly.graph_objects as go

In [236]:
data_temp=dataset.groupby('Store').mean().reset_index()

In [237]:
px.bar(data_temp,x='Store',y='Weekly_Sales')

Store 4 has the most sales in average

Let's see the evolution of the weekly_price per date 

In [238]:
dataset=dataset.sort_values(by="Date")

In [239]:
data_temp=dataset.groupby('Date').mean().reset_index()

In [240]:
# Create figure
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=data_temp['Date'],
        y=data_temp['CPI']
                ))

# Set title
fig.update_layout(
    title_text="Sum of weekly sales per date"
)

# Add range slider
fig.update_layout(
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=3,
                     label="3y",
                     step="year",
                     stepmode="backward"),
                dict(count=2,
                     label="2y",
                     step="year",
                     stepmode="backward"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)
#todate

fig.show()

Since there is no regularities in the extractions, we can't say a lot from this representation. All we can say, is that in some periodes as in January, June and December the weekly sales incresed for 3 years in a row

In [241]:
dataset.groupby(['Year','Store']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Month,Day,Day_of_week
Year,Store,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2010.0,1.0,3,3,2,3,3,3,3,3,3,3
2010.0,2.0,2,2,2,2,2,2,1,2,2,2
2010.0,3.0,3,3,3,3,3,3,3,3,3,3
2010.0,4.0,2,2,2,1,2,2,2,2,2,2
2010.0,5.0,3,3,3,3,1,3,2,3,3,3
2010.0,6.0,3,3,2,2,3,3,3,3,3,3
2010.0,7.0,1,1,1,1,1,1,1,1,1,1
2010.0,8.0,4,4,4,4,3,4,4,4,4,4
2010.0,9.0,4,4,2,4,4,4,3,4,4,4
2010.0,10.0,1,1,1,1,0,1,1,1,1,1


This table prouves what we saw in the graph : for each year not all stores have given a report of weekly_sales and they all have different day of extraction 

-- impact of holidays on weekly sales ---

In [242]:
dataset.groupby(['Store']).count().sort_values(by="Weekly_Sales")

Unnamed: 0_level_0,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Day_of_week
Store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
11.0,1,3,3,3,2,3,3,1,1,1,1
16.0,4,4,4,3,4,4,4,4,4,4,4
15.0,3,4,4,3,4,4,3,3,3,3,3
9.0,4,4,2,4,4,4,3,4,4,4,4
10.0,3,5,4,5,4,5,5,3,3,3,3
20.0,4,5,5,5,4,4,5,4,4,4,4
8.0,6,6,6,6,5,6,5,6,6,6,6
4.0,6,6,5,5,6,6,6,6,6,6,6
6.0,6,6,4,4,6,6,6,6,6,6,6
17.0,5,7,7,5,7,6,5,5,5,5,5


In [243]:
dataset.loc[dataset['Holiday_Flag']==1].groupby(['Store']).count().sort_values(by="Weekly_Sales")

Unnamed: 0_level_0,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Day_of_week
Store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
8.0,1,1,1,1,1,1,1,1,1,1,1
11.0,1,1,1,1,1,1,1,1,1,1,1
14.0,1,1,1,1,0,0,1,1,1,1,1
20.0,1,1,1,1,1,1,1,1,1,1,1
1.0,1,2,2,1,2,2,2,1,1,1,1
7.0,2,2,2,2,2,2,2,2,2,2,2


Since the store 7 is a one that has the most values with holidays, let's focus on it for more stats

In [244]:
data_temp= dataset.loc[dataset['Store']==7]

In [245]:
data_temp=data_temp.groupby('Holiday_Flag').mean().reset_index()

In [246]:
px.bar(data_temp, x= 'Holiday_Flag',y='Weekly_Sales')

Like what we expected, holidays has good impact on sales

-- impact of temperature on weekly sales ---

To evaluate the impact of temp on sales, we're going to choose store 3 since it has the most data

In [247]:
# Store 3
data_temp= dataset.loc[dataset['Store']==3]
data_temp.groupby(['Year','Temperature']).mean().reset_index()

Unnamed: 0,Year,Temperature,Store,Weekly_Sales,Holiday_Flag,Fuel_Price,CPI,Unemployment,Month,Day,Day_of_week
0,2010.0,45.71,3.0,461622.22,0.0,2.572,214.424881,7.368,5.0,2.0,6.0
1,2010.0,78.53,3.0,396968.8,0.0,2.705,214.495838,7.343,4.0,6.0,1.0
2,2010.0,83.52,3.0,364076.85,0.0,2.637,214.785826,7.343,6.0,18.0,4.0
3,2011.0,63.91,3.0,398838.97,0.0,3.308,221.643285,7.197,11.0,18.0,4.0
4,2011.0,75.54,3.0,403342.4,0.0,3.285,,7.197,7.0,10.0,6.0
5,2011.0,80.19,3.0,365248.94,0.0,3.467,219.741491,7.567,9.0,23.0,4.0
6,2012.0,51.86,3.0,367438.62,0.0,3.261,,6.833,1.0,13.0,4.0
7,2012.0,73.44,3.0,424513.08,0.0,3.594,226.968844,6.034,10.0,19.0,4.0
8,2012.0,75.19,3.0,431985.36,0.0,3.688,225.23515,6.664,11.0,5.0,0.0
9,2012.0,82.7,3.0,419497.95,0.0,3.346,225.306861,6.664,6.0,22.0,4.0


In [248]:
# Create figure
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=data_temp.loc[data_temp['Year']==2010]['Temperature'],
        y=data_temp.loc[data_temp['Year']==2010]['Weekly_Sales'],
        name='2010'
                ))

fig.add_trace(
    go.Scatter(
        x=data_temp.loc[data_temp['Year']==2011]['Temperature'],
        y=data_temp.loc[data_temp['Year']==2011]['Weekly_Sales'],
        name='2011'
                ))

fig.add_trace(
    go.Scatter(
        x=data_temp.loc[data_temp['Year']==2012]['Temperature'],
        y=data_temp.loc[data_temp['Year']==2012]['Weekly_Sales'],
        name='2012'
                )) 

fig.update_layout(
    title="Strore 3 : Evolution of sales with temperatures",
    xaxis_title="Temperatures",
    yaxis_title="Weekly Sales",
    legend_title="Year",
    font=dict(
        family="Courier New, monospace",
        size=12,
        color="black"
    )
)

fig.show()

We can see here that the colder it is the more sale are made for the store 3

let's check our conslusion with an oder store with high values count

In [249]:
# Store 13
data_temp= dataset.loc[dataset['Store']==13]
data_temp=data_temp.groupby(['Year','Temperature']).mean().reset_index()

In [250]:
# Create figure
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=data_temp.loc[data_temp['Year']==2010]['Temperature'],
        y=data_temp.loc[data_temp['Year']==2010]['Weekly_Sales'],
        name='2010'
                ))

fig.add_trace(
    go.Scatter(
        x=data_temp.loc[data_temp['Year']==2011]['Temperature'],
        y=data_temp.loc[data_temp['Year']==2011]['Weekly_Sales'],
        name='2011'
                ))

fig.add_trace(
    go.Scatter(
        x=data_temp.loc[data_temp['Year']==2012]['Temperature'],
        y=data_temp.loc[data_temp['Year']==2012]['Weekly_Sales'],
        name='2012'
                )) 

fig.update_layout(
    title="Strore 13 : Evolution of sales with temperatures ",
    xaxis_title="Temperatures",
    yaxis_title="Weekly Sales",
    legend_title="Year",
    font=dict(
        family="Courier New, monospace",
        size=12,
        color="black"
    )
)

fig.show()

From this graph and for 2011 we see diffrent results than Store 3 
Since we don't have a lot of data, we can't have any conclution about it

------- Cleaning Dataset with scikit-learn ------------

In [251]:
# Separate target variable Y from features X
target_name = 'Weekly_Sales'

print("Separating labels from features...")
Y = dataset.loc[:,target_name]
X = dataset.loc[:,[c for c in dataset.columns if c!=target_name]] # All columns are kept, except the target
print("...Done.")

Separating labels from features...
...Done.


In [252]:
X.head()

Unnamed: 0,Store,Date,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Day_of_week
14,17.0,2010-01-10,0.0,60.07,2.853,126.2346,6.885,2010.0,1.0,10.0,6.0
20,7.0,2010-02-04,0.0,38.26,2.725,189.704822,8.963,2010.0,2.0,4.0,3.0
47,19.0,2010-02-07,0.0,66.25,2.958,132.521867,8.099,2010.0,2.0,7.0,6.0
99,13.0,2010-02-07,0.0,78.82,2.814,126.1392,7.951,2010.0,2.0,7.0,6.0
115,15.0,2010-02-19,0.0,,2.909,131.637,,2010.0,2.0,19.0,4.0


In [253]:
print("Percentage of missing values: ")
100*X.isnull().sum()/X.shape[0]

Percentage of missing values: 


Store            0.000000
Date            13.740458
Holiday_Flag     8.396947
Temperature     10.687023
Fuel_Price       9.160305
CPI              8.396947
Unemployment    10.687023
Year            13.740458
Month           13.740458
Day             13.740458
Day_of_week     13.740458
dtype: float64

In [254]:
Numerical_variables=['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']
Categorical_variables=['Store', 'Year', 'Month']
Binary_variables=['Holiday_Flag']

In [255]:
X=X[Numerical_variables+Categorical_variables+Binary_variables]

In [256]:
X.head()

Unnamed: 0,Temperature,Fuel_Price,CPI,Unemployment,Store,Year,Month,Holiday_Flag
14,60.07,2.853,126.2346,6.885,17.0,2010.0,1.0,0.0
20,38.26,2.725,189.704822,8.963,7.0,2010.0,2.0,0.0
47,66.25,2.958,132.521867,8.099,19.0,2010.0,2.0,0.0
99,78.82,2.814,126.1392,7.951,13.0,2010.0,2.0,0.0
115,,2.909,131.637,,15.0,2010.0,2.0,0.0


In [257]:
columns=X.columns.tolist()
Numerical_features=[columns.index(i) for i in Numerical_variables]
Categorical_features=[columns.index(i) for i in Categorical_variables]
Binary_features=[columns.index(i) for i in Binary_variables]

In [258]:
# First : divide dataset into train set & test set !!
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [259]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomImputer(BaseEstimator, TransformerMixin):
    
    def __init__(self, fill_vals=None):
        '''
        Imputer to fill in missing data with specific values. Imputation strategies include mean, median, and most frequent values. 
    
        fill_vals (dictionary): 
                - key is column name with missing data
                - value is one of three:  
                    1: str of imputation strategy ('mean', 'median', 'most_frequent') 
                        - this will impute missing values based on entire column e.g. fill missing values of feature x w/mean of feature x
                    2: tuple of column to groupby and str of imputation strategy
                        - this will impute missing values based off groupby column 
                        e.g. fill missing values of feature x with mean of x grouped by column y: ('y', 'mean')
                    3: custom value such as 0 or a string
                        
        Returns DataFrame with filled in values
        '''
        
        self.fill_vals = fill_vals
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        for col, val in self.fill_vals.items():
            if val == 'mean':
                X[col] = X[col].fillna(X[col].mean())
                
            elif val == 'median':
                X[col] = X[col].fillna(X[col].median())
                
            elif val == 'most_frequent':
                X[col] = X[col].fillna(X[col].mode()[0])
                    
            elif type(val) == tuple:
                grpby_col = val[0]
                strategy = val[1]
                
                if strategy == 'mean':
                    X[col] = X.groupby(grpby_col)[col].transform(lambda x: x.fillna(x.mean()))
                    
                elif strategy == 'median':
                    X[col] = X.groupby(grpby_col)[col].transform(lambda x: x.fillna(x.median()))
                    
                elif strategy == 'most_frequent':
                    X[col] = X.groupby(grpby_col)[col].transform(lambda x: x.fillna(x.mode()[0]))
            
            else:
                X[col] = X[col].fillna(value=val)    

                
        return X

In [260]:
# for holiday_flag we take the most common value and for Year and Month we're taking the first month of the last registred year
fill_cat_vals = {'Store':'most_frequent','Holiday_Flag':0,'Year':2012,'Month':1}

# for Temperature we're taking the average of the month of the Nan value, and for the other values we're taking the average of the year of the Nan value
fill_num_vals = {'Temperature': ('Month','mean'), 'Fuel_Price': ('Year','mean'), 'CPI': ('Year','mean'), 'Unemployment': ('Year','mean')} 


In [261]:
#Test_imputer
#imputer = CustomImputer(fill_vals)
#X_train = imputer.fit_transform(X_train)

In [262]:
#print("Percentage of missing values: ")
#100*X_train.isnull().sum()/X_train.shape[0]

CustomImputer works fine !!!

In [263]:
# Create pipeline for categorical features
binary_features = Binary_variables # Positions of binary columns in X_train/X_test
binary_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

categorical_features = Categorical_variables # Positions of categorical columns in X_train/X_test
categorical_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder()) # first column will be dropped to avoid creating correlations between features
    ])

In [264]:
# Create pipeline for numeric features
numeric_features = Numerical_variables # Positions of numeric columns in X_train/X_test
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # missing values will be replaced by columns' median
    ('scaler', StandardScaler())
])

In [265]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('bin', binary_transformer, binary_features)
    ])

In [266]:
# Preprocessings on train set with the pipeline
print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print('...Done.')


# Preprocessings on test set
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test) # Don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print('...Done.')


Performing preprocessings on train set...
...Done.
Performing preprocessings on test set...
...Done.


In [267]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Train model
print("Train model...")
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


In [268]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = regressor.predict(X_train)
print("...Done.")

Predictions on training set...
...Done.


In [269]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = regressor.predict(X_test)
print("...Done.")


Predictions on test set...
...Done.


In [270]:
from sklearn.metrics import r2_score

# Print R^2 scores
print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

R2 score on training set :  0.9679229914442715
R2 score on test set :  0.9250967972411387


Our score are high, it might be an overfitting !!

We're going to check first the weight of variables and than check the overfitting with a cross_validation and a gridsearch with a ridge and then a lasso

Let's get the coefficients of our model to check the weight of each variable in the prediction of the weekly_sales

In [271]:
#Getting coefficients labels of each coefficient
Categorical_coefs_var=[]
for category in Categorical_variables:
    x=np.sort(X[category].unique())
    x=x[~pd.isna(x)]
    x=[category+'_'+str(int(i)) for i in x]
    Categorical_coefs_var+=x

Binary_coefs_var=[]
for category in Binary_variables:
    x=np.sort(X[category].unique())
    x=x[~pd.isna(x)][1:]
    x=[category+'_'+str(int(i)) for i in x]
    Binary_coefs_var+=x

In [272]:
coefs_labels=Numerical_variables+Categorical_coefs_var+Binary_coefs_var
#coefs_variables=[str(i) for i in coefs_variables]

In [273]:
px.bar(x=coefs_labels,y=regressor.coef_)

We can see from this graph that the weekly sales value is more store dependent than any other factor (economical or date). It will be mor interesting if we can do this analysis by store like we preconize it in the graphic analysis.

Other than that, if we have to make some preconisations to walmart: It's better to invest on marketing on holidays, December, October, June and March. In the other hand, they need to expect less sales when temperature or fuel_price or unemployment get high, but better sales when CPI get high which is expectable.

--- Ridge -----

In [281]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV

# Perform grid search
print("Grid search...")
regressor = Ridge()
# Grid of values to be tested
params = {
    'alpha': [0 , 0.01, 0.05, 0.1, 0.5] # 0 corresponds to no regularization
}
gridsearch = GridSearchCV(regressor, param_grid = params, cv = 5) # cv : the number of folds to be used for CV
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best R2 score : ", gridsearch.best_score_)

Grid search...
...Done.
Best hyperparameters :  {'alpha': 0}
Best R2 score :  0.8777213146739982


In [279]:
px.bar(x=coefs_labels,y=gridsearch.best_estimator_.coef_)

----- Lasso ------

In [282]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score, GridSearchCV

# Perform grid search
print("Grid search...")
regressor = Lasso()
# Grid of values to be tested
params = {
    'alpha': [0 , 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 1000, 1500] # 0 corresponds to no regularization
}
gridsearch = GridSearchCV(regressor, param_grid = params, cv = 5) # cv : the number of folds to be used for CV
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best R2 score : ", gridsearch.best_score_)

Grid search...



With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 503200599630.38916, tolerance: 3883253642.6710815


With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 535360606053.94495, tolerance: 3730324159.8282595


With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 465218348699.1094, tolerance: 3699084629.269926


With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 330305631466.5039, tolerance: 332

...Done.
Best hyperparameters :  {'alpha': 1500}
Best R2 score :  0.8839647046140982



Objective did not converge. You might want to increase the number of iterations. Duality gap: 22780246260.083008, tolerance: 3730324159.8282595


Objective did not converge. You might want to increase the number of iterations. Duality gap: 39108252260.21228, tolerance: 3699084629.269926


Objective did not converge. You might want to increase the number of iterations. Duality gap: 5888494676.423462, tolerance: 3320830567.6336107


Objective did not converge. You might want to increase the number of iterations. Duality gap: 15880415918.104614, tolerance: 3566197335.2392426


Objective did not converge. You might want to increase the number of iterations. Duality gap: 11721051656.140259, tolerance: 3883253642.6710815


Objective did not converge. You might want to increase the number of iterations. Duality gap: 12826386471.008667, tolerance: 3730324159.8282595


Objective did not converge. You might want to increase the number of iterations. Duality gap: 13767743779.251465, tolerance: 3

In [283]:
px.bar(x=coefs_labels,y=gridsearch.best_estimator_.coef_)

Same conclusion that for the other models