# Introduction
In this exercise, a Time Series Analysis (TSA) will be performed. This will be done using the Retail Data Analytics data from Kaggle: https://www.kaggle.com/manjeetsingh/retaildataset 

This contains three different csv files that will be merged together.

In [1]:
# Import the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Read the datasets
features = pd.read_csv("Features data set.csv", parse_dates = ['Date'])
sales = pd.read_csv("sales data-set.csv", parse_dates = ['Date'])
stores = pd.read_csv("stores data-set.csv")

display(features.head())
display(sales.head())
display(stores.head())

Unnamed: 0,Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
0,1,2010-05-02,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-12-02,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-05-03,46.5,2.625,,,,,,211.350143,8.106,False


Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-05-02,24924.5,False
1,1,1,2010-12-02,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-05-03,21827.9,False


Unnamed: 0,Store,Type,Size
0,1,A,151315
1,2,A,202307
2,3,B,37392
3,4,A,205863
4,5,B,34875


## Merge the datasets
In this paragraph, the datasets will be merged into one.

NOTE: feature `IsHoliday` will be used as key to prevent unnecessary extra features.

In [10]:
df = pd.merge(features, sales, on = ['Store', 'Date', 'IsHoliday'], suffixes = ('_features', '_sales'))
df = pd.merge(df, stores, on = 'Store')

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421570 entries, 0 to 421569
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   Store         421570 non-null  int64         
 1   Date          421570 non-null  datetime64[ns]
 2   Temperature   421570 non-null  float64       
 3   Fuel_Price    421570 non-null  float64       
 4   MarkDown1     150681 non-null  float64       
 5   MarkDown2     111248 non-null  float64       
 6   MarkDown3     137091 non-null  float64       
 7   MarkDown4     134967 non-null  float64       
 8   MarkDown5     151432 non-null  float64       
 9   CPI           421570 non-null  float64       
 10  Unemployment  421570 non-null  float64       
 11  IsHoliday     421570 non-null  bool          
 12  Dept          421570 non-null  int64         
 13  Weekly_Sales  421570 non-null  float64       
 14  Type          421570 non-null  object        
 15  Size          421

In [12]:
print(df.Dept.unique())
print(df.Store.unique())

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31 32 33 34 35 36 37 38 40 41 42 44 45 46 47 48 49 51 52
 54 55 56 58 59 60 67 71 72 74 79 80 81 82 83 85 87 90 91 92 93 94 95 97
 98 78 96 99 77 39 50 43 65]
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45]


## Clean the data
In this paragraph, the data will be cleaned. This will be done by doing the following steps:
- Deal with null values
- Deal with outliers
- Dealing with unexpected values

# Data exploration and visualizations
- Sales to check for seasonality
- Sales compared to temperature
- Sales compared to unemployment rate
- Sales compared to fuel price
- Average sales for holidays vs non-holidays

Check if target has seasonality and what kind

# Model training and predicting
