## Big Mart Sales

## Problem Statement

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.


## About the DataSet :
We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for test data set.

|Features|Description|
|-----|-----|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particular store. This is the outcome variable to be predicted.|
|source|is it a train data or test data point|

 

### Load the data

Just to inform you guys that this data is preprocessed with necessary missing value imputation and feature engineering and encoding of features.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

sales_data = pd.read_csv('./data/sales.csv',index_col=False)

sales_data.head()

Unnamed: 0,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Type,Item_Visibility,Item_Weight,Outlet_Establishment_Year,source,Outlet_Years,Item_Fat_Content_0,...,Outlet_Identifier_0,Outlet_Identifier_1,Outlet_Identifier_2,Outlet_Identifier_3,Outlet_Identifier_4,Outlet_Identifier_5,Outlet_Identifier_6,Outlet_Identifier_7,Outlet_Identifier_8,Outlet_Identifier_9
0,FDA15,249.8092,3735.138,Dairy,0.016047,9.3,1999,train,14,1,...,0,0,0,0,0,0,0,0,0,1
1,DRC01,48.2692,443.4228,Soft Drinks,0.019278,5.92,2009,train,4,0,...,0,0,0,1,0,0,0,0,0,0
2,FDN15,141.618,2097.27,Meat,0.01676,17.5,1999,train,14,1,...,0,0,0,0,0,0,0,0,0,1
3,FDX07,182.095,732.38,Fruits and Vegetables,0.0,19.2,1998,train,15,0,...,1,0,0,0,0,0,0,0,0,0
4,NCD19,53.8614,994.7052,Household,0.0,8.93,1987,train,26,0,...,0,1,0,0,0,0,0,0,0,0


### Data preparation 

In [2]:
#Drop the columns which have been converted to different types:
sales_data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)

#Drop unnecessary columns:
sales_data.drop(['source'],axis=1,inplace=True)

In [3]:
sales_data.head()

Unnamed: 0,Item_Identifier,Item_MRP,Item_Outlet_Sales,Item_Visibility,Item_Weight,Outlet_Years,Item_Fat_Content_0,Item_Fat_Content_1,Item_Fat_Content_2,Outlet_Location_Type_0,...,Outlet_Identifier_0,Outlet_Identifier_1,Outlet_Identifier_2,Outlet_Identifier_3,Outlet_Identifier_4,Outlet_Identifier_5,Outlet_Identifier_6,Outlet_Identifier_7,Outlet_Identifier_8,Outlet_Identifier_9
0,FDA15,249.8092,3735.138,0.016047,9.3,14,1,0,0,1,...,0,0,0,0,0,0,0,0,0,1
1,DRC01,48.2692,443.4228,0.019278,5.92,4,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
2,FDN15,141.618,2097.27,0.01676,17.5,14,1,0,0,1,...,0,0,0,0,0,0,0,0,0,1
3,FDX07,182.095,732.38,0.0,19.2,15,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,NCD19,53.8614,994.7052,0.0,8.93,26,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0


### Lets try to plot and predict the model by using polynomial features with degree 15?
**Note: The polynomial model takes time to run with all the records, take a subset of rows.**

### Advantages of using Polynomial Regression:
```python
* Broad range of function can be fit under it.
* Polynomial basically fits wide range of curvature.
* Polynomial provides the best approximation of the relationship between dependent and independent variable.
```
### Disadvantages of using Polynomial Regression
```python
* These are too sensitive to the outliers.
* The presence of one or two outliers in the data can seriously affect the results of a nonlinear analysis.
* In addition there are unfortunately fewer model validation tools for the detection of outliers in 
  nonlinear regression than there are for linear regression.
```

### Create a baseline regression model and observe the error measured.

### What will happen to R-Square score if you increase the no. of predictors in your model.Use all features for prediction and implement a linear regression model

### Does your model faces the issue of Heteroskedacity. How to detect it ?

### Let's have a look at the model coefficients of our model 

### How will you deal with non-linearity in your model ?

Hint : Lasso, Ridge and Elastic net regularization might be of some help.

**LASSO REGRESSION**

### Implementing Lasso regression and understanding the differences between the two

### What if we created a bunch of train/test splits, calculated the testing accuracy for each, and averaged the results together?

Hint: Take a look at the concept cross validation.