# Euan Barlow DAIP 3

The following code creates four linear regression models:

1. Base Model: Independent Variables are IsHoliday, Temperature, Fuel_Price, CPI, Unemployment, Year, Month and the Dependent Variable is Weekly_Sales
2. Added Stores as Dummy Variables
3. Added Stores and Type as Dummy Variables
4. Added Stores, Type and Dept. as Dummy Variables

#### Source 
https://www.youtube.com/watch?v=BFgbfk3LYtw

### 1. Import Packages

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression, LassoLars, BayesianRidge, HuberRegressor, ARDRegression, PassiveAggressiveRegressor

### 2. Load Datasets

In [2]:
data_by_dept = pd.read_csv("C:/Users/elija/OneDrive - University of Strathclyde/DAIP Project 3/data_by_dept.csv")
data_by_store = pd.read_csv("C:/Users/elija/OneDrive - University of Strathclyde/DAIP Project 3/data_by_store.csv")
data_by_store_type = pd.read_csv("C:/Users/elija/OneDrive - University of Strathclyde/DAIP Project 3/data_by_store.csv")
data_by_week = pd.read_csv("C:/Users/elija/OneDrive - University of Strathclyde/DAIP Project 3/data_by_week.csv")
data_by_week_with_means = pd.read_csv("C:/Users/elija/OneDrive - University of Strathclyde/DAIP Project 3/data_by_week_with_means.csv")

### 3. Data Exploration

In [3]:
data_by_store.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          6435 non-null   object 
 1   Store         6435 non-null   int64  
 2   IsHoliday     6435 non-null   bool   
 3   Temperature   6435 non-null   float64
 4   Fuel_Price    6435 non-null   float64
 5   MarkDown1     2280 non-null   float64
 6   MarkDown2     1637 non-null   float64
 7   MarkDown3     2046 non-null   float64
 8   MarkDown4     1965 non-null   float64
 9   MarkDown5     2295 non-null   float64
 10  CPI           6435 non-null   float64
 11  Unemployment  6435 non-null   float64
 12  Type          6435 non-null   object 
 13  Size          6435 non-null   int64  
 14  Weekly_Sales  6435 non-null   float64
dtypes: bool(1), float64(10), int64(2), object(2)
memory usage: 710.2+ KB


### 4. Base Model Linear Regression - "Data by week with means"

#### Data Exploration

In [4]:
data_by_week_with_means.head()

Unnamed: 0,Date,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,Weekly_Sales
0,01/04/2011,False,47.942362,3.602446,,,,,,170.335928,8.107779,43458991.19
1,01/06/2012,False,72.566966,3.748265,7304.495617,,,,3091.389303,175.292541,7.395012,48281649.72
2,01/07/2011,False,78.201634,3.678057,,,,,,171.075955,8.057111,47578519.5
3,01/10/2010,False,69.386998,2.733064,,,,,,168.004935,8.43396,42239875.87
4,02/03/2012,False,44.817214,3.695368,16660.722836,,,,3535.804692,174.420717,7.481656,46861034.97


#### Data Manipulation - Drop Markdown Columns

In [5]:
# drop markdown columns
data_by_week_with_means = data_by_week_with_means.drop(["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"], 1)

#### Data Manipulation - Year and Month

In [6]:
# create columns for year and month, drop date column
data_by_week_with_means['Year'] = data_by_week_with_means['Date'].apply(lambda x: x[-4:])
data_by_week_with_means['Month'] = data_by_week_with_means['Date'].apply(lambda x: x[3:5])

data_by_week_with_means = data_by_week_with_means.drop('Date', 1)

#### Split and Scale Data

In [7]:
y = data_by_week_with_means['Weekly_Sales'].copy()
X = data_by_week_with_means.drop('Weekly_Sales', axis=1).copy()

In [8]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

#### Linear Regression Model

In [10]:
lm_model = LinearRegression()

In [11]:
lm_model.fit(X_train, y_train)

LinearRegression()

In [12]:
print("Linear Regression Accuracy:", lm_model.score(X_test, y_test))
print("Linear Regression RMSE: ", mean_squared_error(y_test, lm_model.predict(X_test), squared=False))

Linear Regression Accuracy: -2.3693754416752255
Linear Regression RMSE:  4344660.222404703


#### Other Regression Models

In [13]:
models = [LassoLars(), BayesianRidge(), ARDRegression(), PassiveAggressiveRegressor(max_iter=1000000), HuberRegressor()]
for model in models:
    model.fit(X_train, y_train)
    print(model, "Accuracy: ", model.score(X_test, y_test))
    print(model, "RMSE: ", mean_squared_error(y_test, model.predict(X_test), squared=False))

LassoLars() Accuracy:  -2.369359315148841
LassoLars() RMSE:  4344649.825171928
BayesianRidge() Accuracy:  -0.23789571872873005
BayesianRidge() RMSE:  2633438.2857416496
ARDRegression() Accuracy:  -0.23789571872872983
ARDRegression() RMSE:  2633438.2857416496
PassiveAggressiveRegressor(max_iter=1000000) Accuracy:  -0.5769134955289936
PassiveAggressiveRegressor(max_iter=1000000) RMSE:  2972248.0424799426
HuberRegressor() Accuracy:  -0.08093309256962189
HuberRegressor() RMSE:  2460823.81539906


### 5. Linear Regression - "Data by Store"

#### Data Manipulation - Drop Markdown Columns

In [14]:
data_by_store = data_by_store.drop(["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"], 1)

#### Data Manipulation - Drop Type Column

In [15]:
data_by_store = data_by_store.drop(["Type"], 1)

#### Data Manipulation - Year and Month

In [16]:
data_by_store['Year'] = data_by_store['Date'].apply(lambda x: x[-4:])
data_by_store['Month'] = data_by_store['Date'].apply(lambda x: x[3:5])

data_by_store = data_by_store.drop('Date', 1)

#### Data Manipulation - Create Dummies for Stores and Type

In [17]:
#create dummy function

def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies=pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [18]:
#dummies for stores

data_by_store = onehot_encode(data_by_store, column='Store', prefix="store")

In [19]:
data_by_store

Unnamed: 0,IsHoliday,Temperature,Fuel_Price,CPI,Unemployment,Size,Weekly_Sales,Year,Month,store_1,...,store_36,store_37,store_38,store_39,store_40,store_41,store_42,store_43,store_44,store_45
0,False,59.17,3.524,214.837166,7.682,151315,1495064.75,2011,04,1,...,0,0,0,0,0,0,0,0,0,0
1,False,55.43,3.524,214.488691,7.931,202307,1800171.36,2011,04,0,...,0,0,0,0,0,0,0,0,0,0
2,False,68.76,3.524,218.211418,7.574,37392,374556.08,2011,04,0,...,0,0,0,0,0,0,0,0,0,0
3,False,56.99,3.521,128.719935,5.946,205863,1900246.47,2011,04,0,...,0,0,0,0,0,0,0,0,0,0
4,False,61.50,3.524,215.402441,6.489,34875,314316.55,2011,04,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,True,25.19,2.829,191.255700,7.508,196321,1001790.16,2010,12,0,...,0,0,0,0,0,1,0,0,0,0
6431,True,49.67,3.148,127.087677,9.003,39690,428953.60,2010,12,0,...,0,0,0,0,0,0,1,0,0,0
6432,True,48.61,2.943,203.417684,10.210,41062,534740.30,2010,12,0,...,0,0,0,0,0,0,0,1,0,0
6433,True,26.79,2.868,127.087677,7.610,39910,241937.11,2010,12,0,...,0,0,0,0,0,0,0,0,1,0


#### Split and Scale Data

In [20]:
y = data_by_store['Weekly_Sales'].copy()
X = data_by_store.drop('Weekly_Sales', axis=1).copy()

In [21]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

#### Linear Regression Model

In [23]:
lm_model = LinearRegression()

In [24]:
lm_model.fit(X_train, y_train)

LinearRegression()

In [25]:
print("Linear Regression Accuracy:", lm_model.score(X_test, y_test))
print("Linear Regression RMSE: ", mean_squared_error(y_test, lm_model.predict(X_test), squared=False))

Linear Regression Accuracy: 0.9231717756078652
Linear Regression RMSE:  158293.76980501576


#### Other Regression Models

In [26]:
models = [LassoLars(), BayesianRidge(), ARDRegression(), PassiveAggressiveRegressor(), HuberRegressor()]
for model in models:
    model.fit(X_train, y_train)
    print(model, "Accuracy: ", model.score(X_test, y_test))
    print(model, "RMSE: ", mean_squared_error(y_test, model.predict(X_test), squared=False))

LassoLars() Accuracy:  0.9229911674763196
LassoLars() RMSE:  158479.7194353456
BayesianRidge() Accuracy:  0.9229929204162994
BayesianRidge() RMSE:  158477.915700579
ARDRegression() Accuracy:  0.9224172621343859
ARDRegression() RMSE:  159069.1552725609
PassiveAggressiveRegressor() Accuracy:  0.9118501079840241
PassiveAggressiveRegressor() RMSE:  169556.45043241006
HuberRegressor() Accuracy:  0.9139307183773122
HuberRegressor() RMSE:  167543.47275920422


### 6. Linear Regression - "Data by Store and Type"

#### Data Manipulation - Drop Markdown Columns

In [27]:
data_by_store_type = data_by_store_type.drop(["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"], 1)

#### Data Manipulation - Year and Month

In [28]:
data_by_store_type['Year'] = data_by_store_type['Date'].apply(lambda x: x[-4:])
data_by_store_type['Month'] = data_by_store_type['Date'].apply(lambda x: x[3:5])

data_by_store_type = data_by_store_type.drop('Date', 1)

#### Data Manipulation - Create Dummies for Stores and Type

In [29]:
#create dummy function

def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies=pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [30]:
#dummies for stores

data_by_store_type = onehot_encode(data_by_store_type, column='Store', prefix="store")

In [31]:
#dummies for type

data_by_store_type = onehot_encode(data_by_store_type, column='Type', prefix="type")

In [32]:
data_by_store_type

Unnamed: 0,IsHoliday,Temperature,Fuel_Price,CPI,Unemployment,Size,Weekly_Sales,Year,Month,store_1,...,store_39,store_40,store_41,store_42,store_43,store_44,store_45,type_A,type_B,type_C
0,False,59.17,3.524,214.837166,7.682,151315,1495064.75,2011,04,1,...,0,0,0,0,0,0,0,1,0,0
1,False,55.43,3.524,214.488691,7.931,202307,1800171.36,2011,04,0,...,0,0,0,0,0,0,0,1,0,0
2,False,68.76,3.524,218.211418,7.574,37392,374556.08,2011,04,0,...,0,0,0,0,0,0,0,0,1,0
3,False,56.99,3.521,128.719935,5.946,205863,1900246.47,2011,04,0,...,0,0,0,0,0,0,0,1,0,0
4,False,61.50,3.524,215.402441,6.489,34875,314316.55,2011,04,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,True,25.19,2.829,191.255700,7.508,196321,1001790.16,2010,12,0,...,0,0,1,0,0,0,0,1,0,0
6431,True,49.67,3.148,127.087677,9.003,39690,428953.60,2010,12,0,...,0,0,0,1,0,0,0,0,0,1
6432,True,48.61,2.943,203.417684,10.210,41062,534740.30,2010,12,0,...,0,0,0,0,1,0,0,0,0,1
6433,True,26.79,2.868,127.087677,7.610,39910,241937.11,2010,12,0,...,0,0,0,0,0,1,0,0,0,1


#### Split and Scale Data

In [33]:
y = data_by_store_type['Weekly_Sales'].copy()
X = data_by_store_type.drop('Weekly_Sales', axis=1).copy()

In [34]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

#### Linear Regression Model

In [36]:
lm_model = LinearRegression()

In [37]:
lm_model.fit(X_train, y_train)

LinearRegression()

In [38]:
print("Linear Regression Accuracy:", lm_model.score(X_test, y_test))
print("Linear Regression RMSE: ", mean_squared_error(y_test, lm_model.predict(X_test), squared=False))

Linear Regression Accuracy: 0.9211536493394825
Linear Regression RMSE:  159311.5521306117


#### Other Regression Models

In [39]:
models = [LassoLars(), BayesianRidge(), ARDRegression(), PassiveAggressiveRegressor(), HuberRegressor()]
for model in models:
    model.fit(X_train, y_train)
    print(model, "Accuracy: ", model.score(X_test, y_test))
    print(model, "RMSE: ", mean_squared_error(y_test, model.predict(X_test), squared=False))

LassoLars() Accuracy:  0.9212025961582317
LassoLars() RMSE:  159262.09515403517
BayesianRidge() Accuracy:  0.9212157340392141
BayesianRidge() RMSE:  159248.81772628127
ARDRegression() Accuracy:  0.9202406715427018
ARDRegression() RMSE:  160231.2477547553
PassiveAggressiveRegressor() Accuracy:  0.9093363538612498
PassiveAggressiveRegressor() RMSE:  170833.50862440717
HuberRegressor() Accuracy:  0.9116261924621053
HuberRegressor() RMSE:  168662.39136914324


### 7. Linear Regression - "Data by Store, Type and Dept"

#### Data Exploration

In [40]:
data_by_dept.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,Type,Size
0,1,1,05/02/2010,24924.5,False,42.31,2.572,,,,,,211.096358,8.106,A,151315
1,1,1,12/02/2010,46039.49,True,38.51,2.548,,,,,,211.24217,8.106,A,151315
2,1,1,19/02/2010,41595.55,False,39.93,2.514,,,,,,211.289143,8.106,A,151315
3,1,1,26/02/2010,19403.54,False,46.63,2.561,,,,,,211.319643,8.106,A,151315
4,1,1,05/03/2010,21827.9,False,46.5,2.625,,,,,,211.350143,8.106,A,151315


#### Data Manipulation - Drop Markdown Columns

In [41]:
data_by_dept = data_by_dept.drop(["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"], 1)

#### Data Manipulation - Year and Month

In [42]:
data_by_dept['Year'] = data_by_dept['Date'].apply(lambda x: x[-4:])
data_by_dept['Month'] = data_by_dept['Date'].apply(lambda x: x[3:5])

data_by_dept = data_by_dept.drop('Date', 1)

#### Data Manipulation - Create Dummies for Stores, Type and Department

In [43]:
#create dummy function

def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies=pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [44]:
#dummies for stores

data_by_dept = onehot_encode(data_by_dept, column='Store', prefix="store")

In [45]:
#dummies for type

data_by_dept = onehot_encode(data_by_dept, column='Type', prefix="type")

In [46]:
#dummies for dept

data_by_dept = onehot_encode(data_by_dept, column='Dept', prefix="dept")

#### Split and Scale Data

In [47]:
y = data_by_dept['Weekly_Sales'].copy()
X = data_by_dept.drop('Weekly_Sales', axis=1).copy()

In [48]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

#### Linear Regression Model

In [50]:
lm_model = LinearRegression()

In [51]:
lm_model.fit(X_train, y_train)

LinearRegression()

In [52]:
print("Linear Regression Accuracy:", lm_model.score(X_test, y_test))
print("Linear Regression RMSE: ", mean_squared_error(y_test, lm_model.predict(X_test), squared=False))

Linear Regression Accuracy: 0.6620206294733166
Linear Regression RMSE:  13214.24253244887


#### Other Regression Models

In [53]:
models = [LassoLars(), BayesianRidge(), ARDRegression(), PassiveAggressiveRegressor(), HuberRegressor()]
for model in models:
    model.fit(X_train, y_train)
    print(model, "Accuracy: ", model.score(X_test, y_test))
    print(model, "RMSE: ", mean_squared_error(y_test, model.predict(X_test), squared=False))

LassoLars() Accuracy:  0.6072990920219425
LassoLars() RMSE:  14243.87421536951
BayesianRidge() Accuracy:  0.6620051034672292
BayesianRidge() RMSE:  13214.54604513596
ARDRegression() Accuracy:  0.6619860471042419
ARDRegression() RMSE:  13214.918562057623
PassiveAggressiveRegressor() Accuracy:  0.6195075089583636
PassiveAggressiveRegressor() RMSE:  14020.716966461781
HuberRegressor() Accuracy:  0.6272912553213021
HuberRegressor() RMSE:  13876.56481086923
