# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("sales_data.csv")

In [4]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


# Hypothesis generation

### The aim is to build a predictive model and find out the sales of each product at a particular store.

- geography of the store could influece sales of certain items. Also the season / time of year
  - e.g. snow shovels are not going to sell as well in florida as they are in minnesota

- How densely populated the are is where the store is located
  - e.g a store in downtown chicago is going to sell more than a store in rural wherever
   
- The layout of the store
  - e.g. items near checkout will be seen more and therefore people more likely to buy. You can't buy what you can't see
  
- Item price
  - More expensive items tend to be bought less frequently

# EDA

In [5]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [30]:
df.shape

(8523, 12)

In [6]:
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [50]:
df.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

In [31]:
df["Item_Weight"].mean()

12.857645184136183

In [32]:
df[["Item_Type", "Item_Weight"]].groupby("Item_Type").mean()

Unnamed: 0_level_0,Item_Weight
Item_Type,Unnamed: 1_level_1
Baking Goods,12.277108
Breads,11.346936
Breakfast,12.768202
Canned,12.305705
Dairy,13.426069
Frozen Foods,12.867061
Fruits and Vegetables,13.224769
Hard Drinks,11.400328
Health and Hygiene,13.142314
Household,13.384736


In [47]:
df[["Outlet_Identifier", "Outlet_Size", "Outlet_Location_Type"]].value_counts()

Outlet_Identifier  Outlet_Size  Outlet_Location_Type
OUT027             Medium       Tier 3                  935
OUT013             High         Tier 3                  932
OUT035             Small        Tier 2                  930
OUT046             Small        Tier 1                  930
OUT049             Medium       Tier 1                  930
OUT018             Medium       Tier 3                  928
OUT019             Small        Tier 1                  528
dtype: int64

In [46]:
df[["Outlet_Identifier", "Item_Outlet_Sales"]].groupby("Outlet_Identifier").mean()

Unnamed: 0_level_0,Item_Outlet_Sales
Outlet_Identifier,Unnamed: 1_level_1
OUT010,339.351662
OUT013,2298.995256
OUT017,2340.675263
OUT018,1995.498739
OUT019,340.329723
OUT027,3694.038558
OUT035,2438.841866
OUT045,2192.384798
OUT046,2277.844267
OUT049,2348.354635


Outlet 10, 17, 45. outlet size not specified

In [14]:
for col in df.columns:
    print(df[col].value_counts())

FDG33    10
FDW13    10
FDW26     9
FDP25     9
FDX20     9
         ..
FDN52     1
FDC23     1
FDO33     1
FDQ60     1
FDY43     1
Name: Item_Identifier, Length: 1559, dtype: int64
12.150    86
17.600    82
13.650    77
11.800    76
15.100    68
          ..
6.775      2
9.420      1
6.520      1
5.400      1
7.685      1
Name: Item_Weight, Length: 415, dtype: int64
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64
0.000000    526
0.076975      3
0.072238      2
0.080766      2
0.058543      2
           ... 
0.024343      1
0.041823      1
0.023154      1
0.047783      1
0.031007      1
Name: Item_Visibility, Length: 7880, dtype: int64
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat             

# Data Cleaning

In [28]:
df["Item_Fat_Content"] = df["Item_Fat_Content"].map({"LF"     : "Low Fat", 
                                                    "reg"     : "Regular", 
                                                    "low fat" : "Low Fat",
                                                    "Low Fat" : "Low Fat",
                                                    "Regular" : "Regular"})

In [49]:
df.fillna({"Item_Weight": df["Item_Weight"].mean(),
          "Outlet_Size": "Unspecified"},
          axis=0,
          inplace=True)

# Data Encoding

normalizing numerical data,
reduce dimesions
try mean squared error
root mean squared error

In [143]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [144]:
normalize = MinMaxScaler()
standardize = StandardScaler()

In [145]:
num_feats = df.select_dtypes("float64")

In [146]:
normalized = normalize.fit_transform(num_feats)
standardized = standardize.fit_transform(num_feats)

In [154]:
df_normalized = df.copy()
df_normalized[num_feats.columns] = normalized

df_standardized = df.copy()
df_standardized[num_feats.columns] = standardized

In [155]:
df_normalized.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,0.282525,Low Fat,0.048866,Dairy,0.927507,OUT049,1999,Medium,Tier 1,Supermarket Type1,0.283587
1,DRC01,0.081274,Regular,0.058705,Soft Drinks,0.072068,OUT018,2009,Medium,Tier 3,Supermarket Type2,0.031419


In [156]:
df_standardized.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,-0.841872,Low Fat,-0.970732,Dairy,1.747454,OUT049,1999,Medium,Tier 1,Supermarket Type1,0.910601
1,DRC01,-1.641706,Regular,-0.908111,Soft Drinks,-1.489023,OUT018,2009,Medium,Tier 3,Supermarket Type2,-1.01844


In [157]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


In [158]:
y = df["Item_Outlet_Sales"]
X = df.drop("Item_Outlet_Sales", axis=1)

y_normalized = df_normalized["Item_Outlet_Sales"]
X_normalized = df_normalized.drop("Item_Outlet_Sales", axis=1)

y_standardized = df_standardized["Item_Outlet_Sales"]
X_standardized = df_standardized.drop("Item_Outlet_Sales", axis=1)

In [159]:
X = pd.get_dummies(X)
X_normalized = pd.get_dummies(X_normalized)
X_standardized = pd.get_dummies(X_standardized)

In [160]:
X.shape

(8523, 1602)

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [65]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

## Task
Split your data in 80% train set and 20% test set.

In [161]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
Xn_train, Xn_test, yn_train, yn_test = train_test_split(X_normalized, y_normalized, train_size=0.8)
Xs_train, Xs_test, ys_train, ys_test = train_test_split(X_standardized, y_standardized, train_size=0.8)

**Baseline Model**

In [125]:
baseline = LinearRegression()
baseline_ridge = Ridge()
baseline_lasso = Lasso()
baseline.fit(X_train, y_train)
baseline_ridge.fit(X_train, y_train)
baseline_lasso.fit(X_train, y_train)

Lasso()

In [126]:
# Linear Regression
print(baseline.score(X_train, y_train))
print(baseline.score(X_test, y_test))

0.657770743340025
-723371847116165.4


In [127]:
# Ridge Regression
print(baseline_ridge.score(X_train, y_train))
print(baseline_ridge.score(X_test, y_test))

0.6526476725553219
0.5168003585940522


In [128]:
# Lasso Regression
print(baseline_lasso.score(X_train, y_train))
print(baseline_lasso.score(X_test, y_test))

0.5611596605869549
0.5800836028005485


## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [77]:
from sklearn.model_selection import GridSearchCV

In [129]:
ridge = Ridge()
params = {"alpha": [52, 54, 56, 58, 60, 62, 64, 66, 68]}
gs_ridge = GridSearchCV(ridge, params, n_jobs=-1)
gs_ridge.fit(X_train, y_train)

GridSearchCV(estimator=Ridge(), n_jobs=-1,
             param_grid={'alpha': [52, 54, 56, 58, 60, 62, 64, 66, 68]})

In [130]:
print(gs_ridge.best_params_)
print(gs_ridge.best_score_)

{'alpha': 66}
0.5540155902776314


In [131]:
lasso = Lasso()
params = {"alpha": [1, 5, 10, 15, 20, 50]}
gs_lasso = GridSearchCV(lasso, params, n_jobs=1)
gs_lasso.fit(X_train, y_train)

GridSearchCV(estimator=Lasso(), n_jobs=1,
             param_grid={'alpha': [1, 5, 10, 15, 20, 50]})

In [132]:
print(gs_lasso.best_params_)
print(gs_lasso.best_score_)

{'alpha': 5}
0.5553256265139784


## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

In [133]:
best_ridge = Ridge(alpha=66)
best_lasso = Lasso(alpha=5)

In [134]:
best_ridge.fit(X_train, y_train)
best_lasso.fit(X_train, y_train)
baseline.fit(X_train, y_train)

LinearRegression()

In [137]:
print(baseline_ridge.score(X_test, y_test))
print(best_ridge.score(X_test, y_test))
print()
print(best_lasso.score(X_test, y_test))
print(baseline_lasso.score(X_test, y_test))
print()
print(baseline.score(X_test, y_test))

0.5168003585940522
0.5784566679985881

0.5796712285977355
0.5800836028005485

-723371847116165.4


In [116]:
pred0 = baseline_ridge.predict(X_test)
pred1 = gs_ridge.predict(X_test)

In [117]:
pred0

array([2207.05918498,  725.12579596, 1601.14535255, ..., 1746.43385678,
        795.83381174, 2544.22388514])

In [118]:
pred1

array([2499.55343181,  881.30922735, 1568.78546121, ..., 1599.99032842,
        995.09063067, 2391.92932747])

In [119]:
y

0       3735.1380
1        443.4228
2       2097.2700
3        732.3800
4        994.7052
          ...    
8518    2778.3834
8519     549.2850
8520    1193.1136
8521    1845.5976
8522     765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

### building model with normalized data

In [162]:
baseline = LinearRegression()
baseline_ridge = Ridge()
baseline_lasso = Lasso()
baseline.fit(Xn_train, yn_train)
baseline_ridge.fit(Xn_train, yn_train)
baseline_lasso.fit(Xn_train, yn_train)

Lasso()

In [163]:
# Linear Regression
print(baseline.score(Xn_train, yn_train))
print(baseline.score(Xn_test, yn_test))

0.6594944220644912
-2.412660615736065e+17


In [164]:
# Ridge Regression
print(baseline_ridge.score(Xn_train, yn_train))
print(baseline_ridge.score(Xn_test, yn_test))

0.6577792742848529
0.4785630715603981


In [165]:
# Lasso Regression
print(baseline_lasso.score(Xn_train, yn_train))
print(baseline_lasso.score(Xn_test, yn_test))

0.0
-0.00014645010856506602


### model with standardized data

In [166]:
baseline = LinearRegression()
baseline_ridge = Ridge()
baseline_lasso = Lasso()
baseline.fit(Xs_train, ys_train)
baseline_ridge.fit(Xs_train, ys_train)
baseline_lasso.fit(Xs_train, ys_train)

Lasso()

In [167]:
# Linear Regression
print(baseline.score(Xs_train, ys_train))
print(baseline.score(Xs_test, ys_test))

0.6621953593422041
-113635881079132.8


In [168]:
# Ridge Regression
print(baseline_ridge.score(Xs_train, ys_train))
print(baseline_ridge.score(Xs_test, ys_test))

0.6570056971346163
0.49463401424248055


In [169]:
# Lasso Regression
print(baseline_lasso.score(Xs_train, ys_train))
print(baseline_lasso.score(Xs_test, ys_test))

0.0
-0.003325948510356058


### Being more selective with features

In [174]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


In [179]:
y1 = df["Item_Outlet_Sales"]
X1 = df.drop(["Item_Outlet_Sales", "Item_Identifier"], axis=1)
X1 = pd.get_dummies(X1)

In [180]:
X1.shape

(8523, 43)

In [181]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, train_size=0.8)

In [182]:
baseline = LinearRegression()
baseline_ridge = Ridge()
baseline_lasso = Lasso()
baseline.fit(X1_train, y1_train)
baseline_ridge.fit(X1_train, y1_train)
baseline_lasso.fit(X1_train, y1_train)

Lasso()

In [183]:
# Linear Regression
print(baseline.score(X1_train, y1_train))
print(baseline.score(X1_test, y1_test))

0.5618400851115506
0.5681501618717084


In [184]:
# Ridge Regression
print(baseline_ridge.score(X1_train, y1_train))
print(baseline_ridge.score(X1_test, y1_test))

0.5618399898543747
0.5681498194712831


In [185]:
# Lasso Regression
print(baseline_lasso.score(X1_train, y1_train))
print(baseline_lasso.score(X1_test, y1_test))

0.5617090515477721
0.5686031472375463


# Conclusion

So a ridiculous amount of dummy variables led to the ridiculously negative r2 score with linear regression.  
Only using 43 variables it performs much more normally.  
Ridge & Lasso regression were robust and it didn't seem to be affected by the 1000 dummy variables.   
data scaling didn't help with the dummy variable set.