# Perdiction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

---------------------

### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import pandas as pd
import numpy as np

**task**
predict Item_Outlet_Sales

In [2]:
df = pd.read_csv("sales_data.csv")

In [3]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


# Data Cleaning

In [4]:
df["Item_Fat_Content"] = df["Item_Fat_Content"].map({"LF"     : "Low Fat", 
                                                    "reg"     : "Regular", 
                                                    "low fat" : "Low Fat",
                                                    "Low Fat" : "Low Fat",
                                                    "Regular" : "Regular"})

In [5]:
df.fillna({"Item_Weight": df["Item_Weight"].mean(),
          "Outlet_Size": "Unspecified"},
          axis=0,
          inplace=True)

In [6]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Unspecified,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


We have covered how to prepare a dataset and the process of feature engineering two weeks ago. In addition, we have already created Lasso and Ridge regressions on Monday. Today, we will be working with the ensemble methods. 

-------------------------
### Model Building: Ensemble Models

Try out the different ensemble models (Random Forest Regressor, Gradient Boosting, XGBoost)
- **Note:** Spend some time on the documention for each of these models.
- **Note:** As you spend time on this challenge, it is suggested to review how each of these models work and how they compare to each other.

Calculate the **mean squared error** on the test set. Explore how different parameters of the model affect the results and the performance of the model. (*Stretch: Create a visualization to display this information*)

- Use GridSearchCV to find optimal paramaters of models.
- Compare agains the Lasso and Ridge Regression models from Monday.

**Questions to answer:**
- Which ensemble model performed the best? 

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

In [8]:
def print_metrics(y_true, y_pred, y_true_train, y_pred_train):
    print("--r2--")
    print("train", metrics.r2_score(y_train, y_pred_train))
    print("test ", metrics.r2_score(y_test, y_pred))
    print("--MSE--")
    print("train", metrics.mean_squared_error(y_train, y_pred_train))
    print("test ", metrics.mean_squared_error(y_test, y_pred))
    print("--MAE--")
    print("train", metrics.mean_absolute_error(y_train, y_pred_train))
    print("test ", metrics.mean_absolute_error(y_test, y_pred))

In [9]:
y = df["Item_Outlet_Sales"]
X = df.drop(["Item_Outlet_Sales", "Item_Identifier", "Outlet_Identifier"], axis=1)
X = pd.get_dummies(X)

In [10]:
X.columns

Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Item_Fat_Content_Low Fat',
       'Item_Fat_Content_Regular', 'Item_Type_Baking Goods',
       'Item_Type_Breads', 'Item_Type_Breakfast', 'Item_Type_Canned',
       'Item_Type_Dairy', 'Item_Type_Frozen Foods',
       'Item_Type_Fruits and Vegetables', 'Item_Type_Hard Drinks',
       'Item_Type_Health and Hygiene', 'Item_Type_Household', 'Item_Type_Meat',
       'Item_Type_Others', 'Item_Type_Seafood', 'Item_Type_Snack Foods',
       'Item_Type_Soft Drinks', 'Item_Type_Starchy Foods', 'Outlet_Size_High',
       'Outlet_Size_Medium', 'Outlet_Size_Small', 'Outlet_Size_Unspecified',
       'Outlet_Location_Type_Tier 1', 'Outlet_Location_Type_Tier 2',
       'Outlet_Location_Type_Tier 3', 'Outlet_Type_Grocery Store',
       'Outlet_Type_Supermarket Type1', 'Outlet_Type_Supermarket Type2',
       'Outlet_Type_Supermarket Type3'],
      dtype='object')

In [11]:
X.shape

(8523, 33)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

### Ultra Baseline

In [13]:
X_tr, X_te, y_tr, y_te = train_test_split(np.array(df["Item_MRP"]).reshape(-1,1), df["Item_Outlet_Sales"], train_size=0.8)

In [14]:
rf1 = RandomForestRegressor()
rf1.fit(X_tr, y_tr)
y_pred1 = rf1.predict(X_te)
y_pred1_tr = rf1.predict(X_tr)
print_metrics(y_te, y_pred1, y_tr, y_pred1_tr)

--r2--
train -0.5917113220744485
test  -0.6051341972109086
--MSE--
train 4627743.5058144005
test  4701109.558757861
--MAE--
train 1656.1533448644436
test  1689.526960459872


### Baseline Random Forest

In [15]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

RandomForestRegressor()

In [16]:
y_pred_train = rf.predict(X_train)
y_pred = rf.predict(X_test)

In [17]:
print_metrics(y_test, y_pred, y_train, y_pred_train)

--r2--
train 0.9366372971951553
test  0.5524117614924162
--MSE--
train 184220.802069694
test  1310894.3477073754
--MAE--
train 295.64040040305076
test  807.0588849196481


In [18]:
rf.feature_importances_

array([0.0509055 , 0.10265819, 0.44542053, 0.0383726 , 0.00487461,
       0.00520374, 0.00425861, 0.00299964, 0.00180888, 0.00470193,
       0.00610574, 0.00558743, 0.00767613, 0.00256023, 0.00408993,
       0.00675756, 0.00296532, 0.0015856 , 0.00133232, 0.0075251 ,
       0.00494739, 0.00251231, 0.00219896, 0.00482057, 0.0064063 ,
       0.00468136, 0.00470253, 0.00373835, 0.00317409, 0.19454021,
       0.00213283, 0.00156587, 0.05718967])

### Grid Search Random Forest

In [19]:
params = {"max_features": ["auto", "sqrt"],
          "n_jobs": [6],
         "n_estimators": [75, 100, 125, 150, 200, 500],
         "min_samples_split": [20, 50, 75, 100],
         "min_samples_leaf": [20, 50, 75, 100]
         }

In [20]:
%%time
rf_grid = GridSearchCV(rf, params)
rf_grid.fit(X_train, y_train)

CPU times: user 2min 21s, sys: 25.2 s, total: 2min 47s
Wall time: 4min 45s


GridSearchCV(estimator=RandomForestRegressor(),
             param_grid={'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': [20, 50, 75, 100],
                         'min_samples_split': [20, 50, 75, 100],
                         'n_estimators': [75, 100, 125, 150, 200, 500],
                         'n_jobs': [6]})

In [21]:
print(rf_grid.best_params_)
print(rf_grid.best_score_)

{'max_features': 'auto', 'min_samples_leaf': 50, 'min_samples_split': 50, 'n_estimators': 125, 'n_jobs': 6}
0.600092212964124


In [22]:
y_pred_train = rf_grid.predict(X_train)
y_pred = rf_grid.predict(X_test)

In [23]:
print_metrics(y_test, y_pred, y_train, y_pred_train)

--r2--
train 0.6285851530779407
test  0.5864683375693511
--MSE--
train 1079851.9944976657
test  1211149.6063567519
--MAE--
train 724.1485997455866
test  784.1927078976307


In [26]:
# best_mse_rf = 999999999

In [28]:
mse = metrics.mean_squared_error(y_test, y_pred)
if mse < best_mse_rf:
    best_mse_rf = mse
    best_params_rf = rf_grid.best_params_

print(best_mse_rf)
print(best_params_rf)

1211149.6063567519
{'max_features': 'auto', 'min_samples_leaf': 50, 'min_samples_split': 50, 'n_estimators': 125, 'n_jobs': 6}


### Baseline xgboost

In [29]:
from xgboost import XGBRegressor

In [30]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [31]:
y_pred = xgb.predict(X_test)

In [32]:
print_metrics(y_test, y_pred, y_train, y_pred_train)

--r2--
train 0.6285851530779407
test  0.5376503661681564
--MSE--
train 1079851.9944976657
test  1354127.453562364
--MAE--
train 724.1485997455866
test  826.5906434100864


### Grid search xgboost

In [41]:
params = {"n_estimators": [5, 10, 20, 100],
         "max_depth": [3, 5, 10],
         "booster": ["gbtree", "dart"],
          "n_jobs": [6],
          "gamma": [0, 0.1, 1]
         }

In [42]:
%%time
xgb_grid = GridSearchCV(xgb, params)
xgb_grid.fit(X_train, y_train)

CPU times: user 6min 34s, sys: 18.5 s, total: 6min 53s
Wall time: 1min 10s


GridSearchCV(estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                    colsample_bylevel=1, colsample_bynode=1,
                                    colsample_bytree=1, gamma=0, gpu_id=-1,
                                    importance_type='gain',
                                    interaction_constraints='',
                                    learning_rate=0.300000012, max_delta_step=0,
                                    max_depth=6, min_child_weight=1,
                                    missing=nan, monotone_constraints='()',
                                    n_estimators=100, n_jobs=8,
                                    num_parallel_tree=1, random_state=0,
                                    reg_alpha=0, reg_lambda=1,
                                    scale_pos_weight=1, subsample=1,
                                    tree_method='exact', validate_parameters=1,
                                    verbosity=None),
             param_grid={

In [43]:
print(xgb_grid.best_params_)
print(xgb_grid.best_score_)

{'booster': 'gbtree', 'gamma': 0, 'max_depth': 3, 'n_estimators': 10, 'n_jobs': 6}
0.5978385693596501


In [44]:
y_pred_train = xgb.predict(X_train)
y_pred = xgb_grid.predict(X_test)

In [45]:
print_metrics(y_test, y_pred, y_train, y_pred_train)

--r2--
train 0.8550869105109211
test  0.5867887011654174
--MSE--
train 421320.49919490324
test  1210211.3269491096
--MAE--
train 467.5358646733129
test  781.2580793845437


In [46]:
#best_mse_xgb = 999999999

In [47]:
mse_xgb = metrics.mean_squared_error(y_test, y_pred)
if mse_xgb < best_mse_xgb:
    best_mse_xgb = mse_xgb
    best_params_xgb = xgb_grid.best_params_

print(best_mse_xgb)
print(best_params_xgb)

1206243.8325766828
{'booster': 'dart', 'max_depth': 5, 'n_estimators': 10}
