<a href="https://colab.research.google.com/github/giramakshay/retail_sales_prediction/blob/main/ML_Capstone_Regression_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Name: Retail Sales Prediction

##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Contributor**     - Akshay Giram

# **Project Summary**



In this project we will develop a machine learning model to pridict sales for the Rossmann drug stores. Managers are required to have predictions of daily sales for upto six weeks in advance. This model will take into consideration various factors that affect sales, like promotions, competition, school and state holidays, seasonality, and locality.

The dataset contains 1 million entries for sales data across 1115 stores.

We will use linear regression to create a machine learning model for prediction of sales.

# **Problem Statement**

***To generate day wise sales predictions for upcoming six weeks from the sales data from across 1100 Rossmann drug stores.***

The prediction model will use linear regression as the machine learning algorithm.

### **Overview of project structure:** ###
* EDA:  Understanding the data, features, and their relations
* Data clean up: Handling missing values and outliers
* Feature Engineering: Feature encoding and feature selection
* Preprocessing: scaling/standardization, data wrangling
* Model Implementation: model selection, hyperparameter tuning, regularization
* Model explainability: model performance, feature importance, conclusion

# EDA

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
sales_df = pd.read_csv("https://github.com/giramakshay/retail_sales_prediction/raw/main/dataset/Rossmann%20Stores%20Data.csv",
                       dtype={'StateHoliday':str}, parse_dates=['Date']) # setting dtype to avoid mixed content warning
stores_df = pd.read_csv("https://github.com/giramakshay/retail_sales_prediction/raw/main/dataset/store.csv")

In [3]:
sales_df.sample(3)

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
621308,1039,7,2013-12-22,0,0,0,0,0,0
80890,611,3,2015-05-20,6085,427,1,1,0,0
625873,29,2,2013-12-17,11765,873,1,1,0,0


In [4]:
stores_df.sample(3)

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
940,941,a,a,1200.0,12.0,2011.0,1,31.0,2013.0,"Jan,Apr,Jul,Oct"
228,229,d,c,17410.0,4.0,2007.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
436,437,c,c,430.0,,,1,50.0,2010.0,"Jan,Apr,Jul,Oct"


In [5]:
sales_df.shape

(1017209, 9)

In [6]:
stores_df.shape

(1115, 10)

In [7]:
sales_df['Date'].dt.year.value_counts()

2013    406974
2014    373855
2015    236380
Name: Date, dtype: int64

We have data for 2013, 2014 and 2015

## **Dataset Information**

In [8]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   Store          1017209 non-null  int64         
 1   DayOfWeek      1017209 non-null  int64         
 2   Date           1017209 non-null  datetime64[ns]
 3   Sales          1017209 non-null  int64         
 4   Customers      1017209 non-null  int64         
 5   Open           1017209 non-null  int64         
 6   Promo          1017209 non-null  int64         
 7   StateHoliday   1017209 non-null  object        
 8   SchoolHoliday  1017209 non-null  int64         
dtypes: datetime64[ns](1), int64(7), object(1)
memory usage: 69.8+ MB


In [9]:
stores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1112 non-null   float64
 4   CompetitionOpenSinceMonth  761 non-null    float64
 5   CompetitionOpenSinceYear   761 non-null    float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB


## **Understanding variables**

As we can see, the sales dataset has over a million records (1017209) and 9 attributes. We will see brief details about each attribute.

### **Sales dataset:**

* **Store** - store id to uniquely identify a store
* **DayOfWeek** - Day of week (in numeric form)
* **Sales** - total sales of the given record (target variable)
* **Customers** - total number of customers on a given day in a given store which generated total sales
* **Open** - whether the store was open: 0 = closed, 1 = open
* **Promo** - indicates whether a store is running a promo on that day 1 = yes, 0 = no promo
* **StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
* **SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools 1 = yes, 0 = no



### **Stores dataset:**

* **Store** - store id to uniquely identify a store
* **StoreType** - differentiates between 4 different store models: a, b, c, d
* **Assortment** - describes an assortment level: a = basic, b = extra, c = extended
* **CompetitionDistance** - distance in meters to the nearest competitor store
* **CompetitionOpenSinceMonth** - the approximate month of the time the nearest competitor was opened
* **CompetitionOpenSinceYear** - the approximate year of the time the nearest competitor was opened
* **Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
* **Promo2SinceWeek** - the calendar week when the store started participating in Promo2
* **Promo2SinceYear** - the year when the store started participating in Promo2
* **PromoInterval** - the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store