### Importing useful modules

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# setting Jedha color palette as default
pio.templates["jedha"] = go.layout.Template(
    layout_colorway=["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]
)
pio.templates.default = "jedha"
pio.renderers.default = "vscode"
from IPython.display import display

### File reading and basic exploration

In [5]:
# Importing dataset
walmart = pd.read_csv("Walmart_Store_sales.csv")

# Displaying the first rows of the dataset
walmart.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.47
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092


In order to predict weekly sales from other information in the dataset, we will consider the following variables:
- Target variable (Y): 'Weekly_Sales'.
- Explanatory variables (X): 'Store', 'Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment'. The 'Date' column will be used to create columns 'Year', 'Month', 'Date', 'Day_of_the_Week' that will be used as explanatory variables (X).

*Extract from description of the original dataset on Kaggle:*

"This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:



- *Store* - the store number
- *Date* - the week of sales
- *Weekly_Sales* - sales for the given store
- *Holiday_Flag* - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week
- *Temperature* - Temperature on the day of sale
- *Fuel_Price* - Cost of fuel in the region
- *CPI* - Prevailing consumer price index
- *Unemployment* - Prevailing unemployment rate


Holiday Events:
- Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13\
- Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13\
- Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13\
- Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13"

In [6]:
#Checking data types in the dataframe
walmart.dtypes

Store           float64
Date             object
Weekly_Sales    float64
Holiday_Flag    float64
Temperature     float64
Fuel_Price      float64
CPI             float64
Unemployment    float64
dtype: object

We see that the 'Date' column contains values in string ('object') format. These dates will need to be converted to datetime format to be useful for analysis.

In [13]:
# Basic statistics
print("Number of rows : {}".format(walmart.shape[0])), 
print("Number of columns : {}".format(walmart.shape[1]))
print()

print("Basics statistics: ")
data_desc = walmart.describe(include='all')
display(data_desc)
print()

print("Percentage of missing values: ")
display(100*walmart.isnull().sum()/walmart.shape[0])

Number of rows : 150
Number of columns : 8

Basics statistics: 


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
count,150.0,132,136.0,138.0,132.0,136.0,138.0,135.0
unique,,85,,,,,,
top,,19-10-2012,,,,,,
freq,,4,,,,,,
mean,9.866667,,1249536.0,0.07971,61.398106,3.320853,179.898509,7.59843
std,6.231191,,647463.0,0.271831,18.378901,0.478149,40.274956,1.577173
min,1.0,,268929.0,0.0,18.79,2.514,126.111903,5.143
25%,4.0,,605075.7,0.0,45.5875,2.85225,131.970831,6.5975
50%,9.0,,1261424.0,0.0,62.985,3.451,197.908893,7.47
75%,15.75,,1806386.0,0.0,76.345,3.70625,214.934616,8.15



Percentage of missing values: 


Store            0.000000
Date            12.000000
Weekly_Sales     9.333333
Holiday_Flag     8.000000
Temperature     12.000000
Fuel_Price       9.333333
CPI              8.000000
Unemployment    10.000000
dtype: float64

Some ideas for dealing with missing values: 
- The rows with the missing values in the column 'Weekly_Sales' (target variable) will be dropped as they will not be useful for the model.
- If there is a row where the date is known but the holiday flag is missing, is may be possible to deduce the value of holiday flag based on the date (i.e. determine whether there was a holiday on a given week).
- It does not seem likely that missing dates could be easily inferred from the data in other columns.

### Visualizing the distribution of numeric features

In [16]:
num_features = ['Weekly_Sales', 'Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']
for i in range(len(num_features)):
    fig = px.histogram(walmart[num_features[i]])
    fig.show()

### Visualizing correlation between variables

In [7]:
# Correlation matrix
corr_matrix = walmart.corr(numeric_only=True).round(2)

import plotly.figure_factory as ff

fig = ff.create_annotated_heatmap(corr_matrix.values,
                                  x = corr_matrix.columns.tolist(),
                                  y = corr_matrix.index.tolist())


fig.show()

### Visualizing pairwise dependencies between variables

In [23]:
# Visualize pairwise dependencies
fig = px.scatter_matrix(walmart)
fig.update_layout(
        title = go.layout.Title(text = "Bivariate analysis", x = 0.5), showlegend = False, 
            autosize=False, height=800, width = 800)
fig.show()


iteritems is deprecated and will be removed in a future version. Use .items instead.



### Preprocessing with Pandas

In [24]:
# Dropping rows for which there is the value of the target variable 'Weekly_Sales' is not indicated
rows_to_keep = (~(walmart['Weekly_Sales'].isnull()))
walmart = walmart.loc[rows_to_keep,:].reset_index(drop = True)

In [26]:
# Converting values in the column 'Date' from object data type to datetime
walmart["Date"] = pd.to_datetime(walmart["Date"], format='%d-%m-%Y')

# Creating columns with information about year, month, day, week number and day of the week information
walmart["Year"] = pd.DatetimeIndex(walmart["Date"]).year
walmart["Month"] = pd.DatetimeIndex(walmart["Date"]).month
walmart["Day"] = pd.DatetimeIndex(walmart["Date"]).day
walmart["Week_Number"] = walmart["Date"].dt.isocalendar().week
walmart['Day_of_Week'] = pd.to_datetime(walmart['Date']).dt.day_name()

# Checking the resulting dataframe
walmart.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Week_Number,Day_of_Week
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,7.0,Friday
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,12.0,Friday
2,11.0,NaT,1244390.03,0.0,84.57,,214.556497,7.346,,,,,
3,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,21.0,Friday
4,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,21.0,Friday


It looks like the day of the week is always the same for all rows where the date is indicated. We will check this idea using pandas method 'nunique()'.

In [17]:
#checking the number of unique values for each column:
print(walmart.nunique())

Store            20
Date             85
Weekly_Sales    136
Holiday_Flag      2
Temperature     130
Fuel_Price      120
CPI             135
Unemployment    104
Year              3
Month            12
Day              30
Week_Number      46
Day_of_Week       1
dtype: int64


It does turn that all dates in the dataset fall on the same day of the week, Friday. I suppose that the reason for that that the diffrerent stores included in the dataset are required to report weekly sales data on Fridays. This way, the date indicated for a row in the dataset is simply the date that corresponds to the Friday when a report was filed and in itself has no particular interest for explaining the amount of sales. The same goes for the day of the week as it is always fixed.

For this reason, it not useful to include the columns 'Date' and 'Day_of_Week' as the explanatory variables for training a machine learning model. 

'Date' column can still be used to infer some missing values for the column 'Holiday_Flag'. I will consider that the stores report their sales each Friday for the period including last week's Friday to this week's Thursday, but excluding this week's Friday (= the day of the report). So if a store files a weekly report on the 8th of February, and there was no holiday during the week covered by the report (i.e. from the 1st of February to the 7th of February included), the holiday flag for such week will be set to 0.

In [29]:
no_flags_df = walmart[~walmart['Date'].isna() & walmart['Holiday_Flag'].isna()]
no_flags_df

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Week_Number,Day_of_Week
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,7,Friday
14,6.0,2010-04-30,1498080.16,,68.91,2.78,211.894272,7.092,2010.0,4.0,30.0,17,Friday
40,7.0,2011-08-26,629994.47,,57.6,3.485,194.379637,8.622,2011.0,8.0,26.0,34,Friday
45,1.0,2011-08-05,1624383.75,,91.65,3.684,215.544618,7.962,2011.0,8.0,5.0,31,Friday
50,14.0,2011-03-25,1879451.23,,41.76,3.625,184.994368,8.549,2011.0,3.0,25.0,12,Friday
67,1.0,2010-08-27,1449142.92,,85.22,2.619,211.567306,7.787,2010.0,8.0,27.0,34,Friday
82,9.0,2010-07-09,485389.15,,78.51,2.642,214.65643,6.442,2010.0,7.0,9.0,27,Friday
108,9.0,2010-06-18,513073.87,,82.99,2.637,215.016648,6.384,2010.0,6.0,18.0,24,Friday
123,4.0,2011-07-08,2066541.86,,84.59,3.469,129.1125,5.644,2011.0,7.0,8.0,27,Friday


creating dates of beginning and the end of the reporting period

In [38]:
# Creating columns that will 
walmart['Start_of_reporting_period'] = walmart['Date'] - pd.offsets.Week()
walmart['End_of_reporting_period'] = walmart['Date'] - pd.offsets.Day()

In [39]:
walmart.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Week_Number,Day_of_Week,Start_of_reporting_period,End_of_reporting_period
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,7.0,Friday,2011-02-11,2011-02-17
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,12.0,Friday,2011-03-18,2011-03-24
2,11.0,NaT,1244390.03,0.0,84.57,,214.556497,7.346,,,,,,NaT,NaT
3,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,21.0,Friday,2010-05-21,2010-05-27
4,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,21.0,Friday,2010-05-21,2010-05-27


In [35]:
dates = no_flags_df['Date']
type(dates[0])

for date in dates:
    if date in dates & 
    

pandas._libs.tslibs.timestamps.Timestamp

In [None]:
"""
for date in de
if holiday in week before data:
holiday flag = 1
else: 1"""

In [19]:
print(walmart["Unemployment"].min())
print(walmart["Unemployment"].mean() - 3*walmart["Unemployment"].std())

5.143
2.807296842152213


In [20]:
print(walmart["Unemployment"].max())
print(walmart["Unemployment"].mean() + 3*walmart["Unemployment"].std())


14.313
12.523867092274017


In [22]:
# after extracting infromation about the day of the week for each date, we can conclude that the column Weekly_Sales contains data of sales 
# per week: in fact, there is only one modality of the day of the week : Friday. 
# it seems that all stores reported their weekly sales on Friday. Therefore, the day of the month does not seem to have incfulence on the sales:
# as it is simply the date of that falls on Friday. We can leave it out of analysis for future.
# Therefore, we will drop the Date column as well as Day_of_Week.
#columns_to_drop = ['Date']
  
print("Dropping useless columns...")  
#walmart = walmart.drop(columns_to_drop, axis=1) # axis = 1 indicates that we are dropping along the column axis  

Dropping useless columns...


In [23]:
walmart.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Week_Number,Day_of_Week
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,7.0,Friday
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,12.0,Friday
2,11.0,NaT,1244390.03,0.0,84.57,,214.556497,7.346,,,,,
3,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,21.0,Friday
4,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,21.0,Friday


In [24]:
print(walmart.shape)

(136, 13)


In [25]:
walmart['Week_Number'] = walmart['Week_Number'].astype(float)

In [31]:
# Deleting rows with outliers
columns_with_outliers = ["Temperature", "Fuel_Price", "CPI", "Unemployment"]
for column in columns_with_outliers:
    mask_outliers = np.abs(walmart[column]-walmart[column].mean()) <= (3*walmart[column].std())
    mask_null = walmart[column].isnull()
    walmart = walmart[mask_outliers|mask_null]
    print(column)
    print(walmart.shape)

print()
print(walmart.shape)

Temperature
(136, 13)
Fuel_Price
(136, 13)
CPI
(136, 13)
Unemployment
(131, 13)

(131, 13)


In [26]:
# Separate target variable Y from features X
print("Separating labels from features...")
features_list = ['Store', 'Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Year', 'Month', 'Week_Number']
target_variable = "Weekly_Sales"

X = walmart.loc[:,features_list]
Y = walmart.loc[:,target_variable]

print("...Done.")
print()

print('Y : ')
print(Y.head())
print()
print('X :')
print(X.head())

Separating labels from features...
...Done.

Y : 
0    1572117.54
1    1807545.43
2    1244390.03
3    1644470.66
4    1857533.70
Name: Weekly_Sales, dtype: float64

X :
   Store  Holiday_Flag  Temperature  Fuel_Price         CPI  Unemployment  \
0    6.0           NaN        59.61       3.045  214.777523         6.858   
1   13.0           0.0        42.38       3.435  128.616064         7.470   
2   11.0           0.0        84.57         NaN  214.556497         7.346   
3    6.0           0.0        78.89       2.759  212.412888         7.092   
4    4.0           0.0          NaN       2.756  126.160226         7.896   

     Year  Month  Week_Number  
0  2011.0    2.0          7.0  
1  2011.0    3.0         12.0  
2     NaN    NaN          NaN  
3  2010.0    5.0         21.0  
4  2010.0    5.0         21.0  


In [27]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [28]:
#walmart[np.abs(walmart["Temperature"]-walmart["Temperature"].mean()) <= (3*walmart["Temperature"].std())]

#mask_outliers = np.abs(walmart["Temperature"]-walmart["Temperature"].mean()) <= (3*walmart["Temperature"].std())

In [29]:
#mask_null = walmart["Temperature"].isnull()



In [30]:
#walmart = walmart[mask_outliers|mask_null]

walmart.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Week_Number,Day_of_Week
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011.0,2.0,18.0,7.0,Friday
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011.0,3.0,25.0,12.0,Friday
2,11.0,NaT,1244390.03,0.0,84.57,,214.556497,7.346,,,,,
3,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010.0,5.0,28.0,21.0,Friday
4,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010.0,5.0,28.0,21.0,Friday


### Preprocessing with Scikit-Learn

In [8]:
# We decide which features will be treated as numeric and which features will be treates as categorical
numeric_features = ["Temperature", "Fuel_Price", "CPI", "Unemployment", "Year", "Month", "Week_Number"]
categorical_features = ["Store", "Holiday_Flag"]

print('Numeric features ', numeric_features)
print('Categorical features ', categorical_features)

Numeric features  ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Year', 'Month', 'Week_Number']
Categorical features  ['Store', 'Holiday_Flag']


In [9]:
# Create pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # missing values will be replaced by columns' mean
    ('scaler', StandardScaler())
])

In [10]:
# Create pipeline for categorical features
categorical_transformer = Pipeline(
    steps=[
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

In [11]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [12]:
type(X_train)

NameError: name 'X_train' is not defined

In [None]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train.head())
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5]) # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test.head()) 
X_test = preprocessor.transform(X_test) # Don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print('...Done.')
print(X_test[0:5,:]) # MUST use this syntax because X_test is a numpy array and not a pandas DataFrame anymore
print()

Performing preprocessings on train set...
     Store  Holiday_Flag  Temperature  Fuel_Price         CPI  Unemployment  \
108    9.0           NaN        82.99       2.637  215.016648         6.384   
10    18.0           0.0        52.02       2.878  132.763355         9.331   
2     11.0           0.0        84.57         NaN  214.556497         7.346   
51    10.0           0.0        86.87       3.666  130.719633         7.170   
100   18.0           0.0        69.12       2.906  132.293936           NaN   

       Year  Month  Week_Number  
108  2010.0    6.0         24.0  
10   2010.0   10.0         41.0  
2       NaN    NaN          NaN  
51   2012.0    7.0         27.0  
100  2010.0    5.0         21.0  
...Done.
  (0, 0)	1.2095885387164262
  (0, 1)	-1.5030447983421744
  (0, 2)	0.9817446167889817
  (0, 3)	-0.7772480351323763
  (0, 4)	-1.0929987380203936
  (0, 5)	-0.08547043234472847
  (0, 6)	-0.08247078083024066
  (0, 14)	1.0
  (0, 27)	1.0
  (1, 0)	-0.5819095875543866
  (1, 1)	-

In [None]:
type(X_train)

scipy.sparse._csr.csr_matrix

In [None]:
type(X_test)

scipy.sparse._csr.csr_matrix

### Train model

In [None]:
# Train model
print("Train model...")
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


### Performance assessment

In [None]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = regressor.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[ 500569.51603206 1237331.82151387 1385042.96071659 1917373.70451348
 1111569.33950361  543048.76997223 2035767.51086657  325972.2504496
  629653.46233845 1626142.26595641 1463611.05345894  801352.09356539
  511397.2338083   950108.60934864  443815.04265417 2081917.2764612
 1105789.61949898  406133.93039103 2035734.92584917  425403.50580818
  489202.22623596 2095009.72524357 2473439.05952228 1628007.93015076
  638267.86359533  951287.27406269 1974509.13194652 1991226.71775267
 2248965.04104031  963911.89917096  146571.31631684 1630582.53163378
 1800886.58329832  502026.39043306  790028.16887628 1399061.31614107
  554113.33731753 1035948.64084071  551372.02882724 1945801.25958607
  361022.97742297 1623580.65390985 1024818.26483245 1924892.84597287
 1773957.73389721 2049520.50919287 1909637.02578186 2027585.44246001
 2158071.34473775  561917.11811558  669268.4086758   923833.67102555
  397856.26135784 1974220.70730508 1561818.77693021  985295.86074

In [None]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = regressor.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on test set...
...Done.
[ 840799.03830228  452566.33246259 1763994.21068185 1519352.18882509
  268778.00368232  410265.12393465 1365981.39969738 1979946.28799135
 1443154.13323911  237053.4225129   809910.58171014 1706712.37654931
 1963454.42869825 1740059.8960244  2032068.491468   1400945.47726072
  615710.76159071 2319221.41284497 2028673.07128784  877990.45299961
 2012290.25406866  575950.80384428 1935210.3473146   380156.66511916
 1947693.97003954 2167548.74874529 2436862.52008686 1242639.680926  ]



In [None]:
# Print R^2 scores
print("R2 score on training set : ", r2_score(Y_train, Y_train_pred))
print("R2 score on test set : ", r2_score(Y_test, Y_test_pred))

R2 score on training set :  0.9731135098085808
R2 score on test set :  0.9265989216777618


### Interpreting the model's coefficients
As we've standardized our features, we can use the coefficients of the regression to estimate the importance of each feature for the prediction. The model's parameters are saved in a `.coef_` attribute:

In [None]:
regressor.coef_

array([  -72775.1536421 ,   -17997.7773473 ,    68664.971783  ,
        -103048.05153977,   -24174.58469344,   201579.22173998,
        -120001.14140768,   416873.40375303, -1204201.73920573,
         607533.6627719 , -1360210.88602864,    77094.31840043,
        -998380.89963036,  -785159.58888251, -1076627.36566318,
         565314.74603851,  -129029.81892276,   -28522.75818051,
         501024.22431814,   578728.28350411,  -725392.87811951,
       -1125087.16014412,  -711016.26623718,  -297465.80015826,
         -19907.63037901,   400603.7239188 ,    23609.31757418,
         -50154.31408147])

In [None]:
column_names = []
for name, pipeline, features_list in preprocessor.transformers_: # loop over pipelines
    if name == 'num': # if pipeline is for numeric variables
        features = features_list # just get the names of columns to which it has been applied
    else: # if pipeline is for categorical variables
        features = pipeline.named_steps['encoder'].get_feature_names_out() # get output columns names from OneHotEncoder
    column_names.extend(features) # concatenate features names
        
print("Names of columns corresponding to each coefficient: ", column_names)

Names of columns corresponding to each coefficient:  ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Year', 'Month', 'Week_Number', 'Store_2.0', 'Store_3.0', 'Store_4.0', 'Store_5.0', 'Store_6.0', 'Store_7.0', 'Store_8.0', 'Store_9.0', 'Store_10.0', 'Store_11.0', 'Store_12.0', 'Store_13.0', 'Store_14.0', 'Store_15.0', 'Store_16.0', 'Store_17.0', 'Store_18.0', 'Store_19.0', 'Store_20.0', 'Holiday_Flag_1.0', 'Holiday_Flag_nan']


In [None]:
# Create a pandas DataFrame
coefs = pd.DataFrame(index = column_names, data = regressor.coef_.transpose(), columns=["coefficients"])
coefs

Unnamed: 0,coefficients
Temperature,-72775.15
Fuel_Price,-17997.78
CPI,68664.97
Unemployment,-103048.1
Year,-24174.58
Month,201579.2
Week_Number,-120001.1
Store_2.0,416873.4
Store_3.0,-1204202.0
Store_4.0,607533.7


In [None]:
# Compute abs() and sort values
feature_importance = abs(coefs).sort_values(by = 'coefficients')
feature_importance

Unnamed: 0,coefficients
Fuel_Price,17997.78
Store_19.0,19907.63
Holiday_Flag_1.0,23609.32
Year,24174.58
Store_12.0,28522.76
Holiday_Flag_nan,50154.31
CPI,68664.97
Temperature,72775.15
Store_6.0,77094.32
Unemployment,103048.1


In [None]:
# Plot coefficients
fig = px.bar(feature_importance, orientation = 'h')
fig.update_layout(showlegend = False, 
                  margin = {'l': 120} # to avoid cropping of column names
                 )
fig.show()

In [None]:
# Perform 10-fold cross-validation to evaluate the generalized R2 score obtained with a Ridge model
print("10-fold cross-validation...")
regressor_ridge = Ridge()
scores_ridge = cross_val_score(regressor_ridge, X_train, Y_train, cv=10)
print('The cross-validated R2-score is : ', scores_ridge.mean())
print('The standard deviation is : ', scores_ridge.std())

10-fold cross-validation...
The cross-validated R2-score is :  0.8220450861869619
The standard deviation is :  0.16224467591417496


In [None]:
# Perform grid search
print("Grid search...")

# Grid of values to be tested
params_ridge = {
    'alpha': [0.0, 0.1, 0.5, 1.0]
    #'alpha': [0.05, 0.07, 0.1, 0.13, 0.15]
}
gridsearch_ridge = GridSearchCV(regressor_ridge, param_grid = params_ridge, cv = 10) # cv : the number of folds to be used for CV
gridsearch_ridge.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch_ridge.best_params_)
print("Best R2 score : ", gridsearch_ridge.best_score_)

Grid search...
...Done.
Best hyperparameters :  {'alpha': 0.0}
Best R2 score :  0.9286921934124592


In [None]:
# Perform 10-fold cross-validation to evaluate the generalized R2 score obtained with a Ridge model
print("10-fold cross-validation...")
regressor_lasso = Lasso()
scores_lasso = cross_val_score(regressor_lasso, X_train, Y_train, cv=10)
print('The cross-validated R2-score is : ', scores_lasso.mean())
print('The standard deviation is : ', scores_lasso.std())

10-fold cross-validation...
The cross-validated R2-score is :  0.9285595754407432
The standard deviation is :  0.044802976365541224



Objective did not converge. You might want to increase the number of iterations. Duality gap: 417608267482.0883, tolerance: 4083378107.3628845


Objective did not converge. You might want to increase the number of iterations. Duality gap: 355578780839.6839, tolerance: 3932903720.442386


Objective did not converge. You might want to increase the number of iterations. Duality gap: 196091536202.38776, tolerance: 3708257723.7851915


Objective did not converge. You might want to increase the number of iterations. Duality gap: 315020625340.3378, tolerance: 4078238985.280954


Objective did not converge. You might want to increase the number of iterations. Duality gap: 295373614462.42615, tolerance: 3926608335.10814


Objective did not converge. You might want to increase the number of iterations. Duality gap: 326011123506.78864, tolerance: 3923008340.0776596


Objective did not converge. You might want to increase the number of iterations. Duality gap: 216131945742.15234, tolerance: 41149

In [None]:
# Perform grid search
print("Grid search...")

# Grid of values to be tested
params_lasso = {
    'alpha': [0.0, 0.1, 0.5, 1.0]
    #'alpha': [0.05, 0.07, 0.1, 0.13, 0.15]
}
gridsearch_lasso = GridSearchCV(regressor_lasso, param_grid = params_lasso, cv = 10) # cv : the number of folds to be used for CV
gridsearch_lasso.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch_lasso.best_params_)
print("Best R2 score : ", gridsearch_lasso.best_score_)

# max_iter: pas assez de données pour converger vers la bonne solution; il y a une limite par défaut 
# qui évite de faire une boucle infinie
# essayer max_iter = 10000


With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 512326638596.7431, tolerance: 4083378107.3628845


With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 486689824645.9551, tolerance: 3932903720.442386


With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 435246397669.6919, tolerance: 3708257723.7851915


With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator


Objective did not converge. You might want to increase the number of iterations. Duality gap: 564072703387.4749, tolerance: 40782

Grid search...



Objective did not converge. You might want to increase the number of iterations. Duality gap: 542273324082.72974, tolerance: 4089939651.2523823


Objective did not converge. You might want to increase the number of iterations. Duality gap: 461345482397.3135, tolerance: 4083378107.3628845


Objective did not converge. You might want to increase the number of iterations. Duality gap: 413441605885.2301, tolerance: 3932903720.442386


Objective did not converge. You might want to increase the number of iterations. Duality gap: 280987637955.4911, tolerance: 3708257723.7851915


Objective did not converge. You might want to increase the number of iterations. Duality gap: 412842884690.67334, tolerance: 4078238985.280954


Objective did not converge. You might want to increase the number of iterations. Duality gap: 375140571247.40125, tolerance: 3926608335.10814


Objective did not converge. You might want to increase the number of iterations. Duality gap: 424195956963.7915, tolerance: 392300

...Done.
Best hyperparameters :  {'alpha': 1.0}
Best R2 score :  0.9285595754407432



Objective did not converge. You might want to increase the number of iterations. Duality gap: 295373614462.42615, tolerance: 3926608335.10814


Objective did not converge. You might want to increase the number of iterations. Duality gap: 326011123506.78864, tolerance: 3923008340.0776596


Objective did not converge. You might want to increase the number of iterations. Duality gap: 216131945742.15234, tolerance: 4114916973.4894753


Objective did not converge. You might want to increase the number of iterations. Duality gap: 118791051083.01593, tolerance: 4028268339.4012356


Objective did not converge. You might want to increase the number of iterations. Duality gap: 306667926579.23267, tolerance: 4179097721.3789816


Objective did not converge. You might want to increase the number of iterations. Duality gap: 364159248389.63367, tolerance: 4089939651.2523823


Objective did not converge. You might want to increase the number of iterations. Duality gap: 355686439444.04584, tolerance: 