## Exploratory analysis and visualisation

### This is where we look at the data and try to gain insights from it

We begin by importing the necessary modules

In [None]:
# Import Dependencies
%matplotlib inline

# Start Python Imports
import math, time, random, datetime

# Data Manipulation
import pandas as pd

# Visualization 
import matplotlib.pyplot as plt
# from quilt.data.ResidentMario import missingno_data
# import missingno as msno

import seaborn as sns
plt.style.use('seaborn-whitegrid')


from IPython.display import display

## loading the data

In [None]:
# Import train & test data 
train = pd.read_csv('/Users/anthonymiyoro/Desktop/Rossmann-Store-Sales-master/train.csv')
test = pd.read_csv('/Users/anthonymiyoro/Desktop/Rossmann-Store-Sales-master/test.csv')
store=pd.read_csv('/Users/anthonymiyoro/Desktop/Rossmann-Store-Sales-master/store.csv')
# sample_submission = pd.read_csv('C:/Users/David/Desktop/rossmann/sample_submission.csv') # example of what a submission should look like

## Data description

 Most of the fields are self-explanatory. The following are descriptions for those that aren't.

##### Id  -an Id that represents a (Store, Date) duple within the test set

##### Store - a unique Id for each store

##### Sales - the turnover for any given day (this is what you are predicting)

##### Customers - the number of customers on a given day

##### Open - an indicator for whether the store was open: 0 = closed, 1 = open

##### StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends.  
                      a = public holiday, 
                      b = Easter holiday, 
                      c = Christmas, 
                      0 = None
                      
##### SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

##### StoreType - differentiates between 4 different store models: a, b, c, d

##### Assortment - describes an assortment level: a = basic, b = extra, c = extended

##### CompetitionDistance - distance in meters to the nearest competitor store

##### CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

##### Promo - indicates whether a store is running a promo on that day

##### Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 
                          0 = store is not participating, 
                          1 = store is participating
                          
##### Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2

##### PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

## What missing values are there?

Where are the holes in our data?

These are rows which are missing a value or have NaN instead of something like the rest of the column.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
store.isnull().sum()

## What datatypes are in the dataframe?

In [None]:
train['Date'] = pd.to_datetime(train['Date'], dayfirst=False)
test['Date'] = pd.to_datetime(test['Date'], dayfirst=False)

In [None]:
# Different data types in the dataset
train.dtypes

In [None]:
# Different data types in the dataset
test.dtypes

## Let's explore each of these features individually

### Target Feature: Sales

Description:  the turnover for any given day

In [None]:
train.head(3)

There are 3 types of holidays a, b, and c. We will convert them to 1, 2, 3 to represent them

In [None]:
train['StateHoliday'].unique()

convert categorical data to numeric data for state holiday

### The different types of school holidays

In [None]:
train['SchoolHoliday'].unique()

### View of data

When theh

In [None]:
train['Open'].unique()

## analysing test data set

In [None]:
test.shape

In [None]:
test['Year'] = pd.DatetimeIndex(test['Date']).year
test['Month'] = pd.DatetimeIndex(test['Date']).month

In [None]:
test.head()

### Here we try and see all the information related to a particular store. We can see that data is recorded for every day of the stores operatio for 3 years.

In [None]:
print(train.ix[train['Store'] == 22])

### We now take a look at the StateHoliday feature. '0' represents a working day while 'a'  represents a public holiday. We can change this to 0 and 1

In [None]:
test['StateHoliday'].unique()

In [None]:
test['SchoolHoliday'].unique()

convert categorical data into state holiday into numerical data 

In [None]:
test.dtypes

In [None]:
test.head(5)

In [None]:
train.head(5)

# Store Dataset

#### We now convert the store types to numerical data types

In [None]:
# store.loc[store['StoreType'] == 'a', 'StoreType'] = 1
# store.loc[store['StoreType'] == 'b', 'StoreType'] = 2
# store.loc[store['StoreType'] == 'c', 'StoreType'] = 3
# store.loc[store['StoreType'] == 'd', 'StoreType'] = 4
# store['StoreType'] = store['StoreType'].astype(int, copy=False)

In [None]:
print('levels :', store['StoreType'].unique(), '; data type :', store['StoreType'].dtype)

#### The assortment of the store refers to the 3 different types of stores. We will convert this categorical data to numerical data 

In [None]:
store.loc[store['Assortment'] == 'a', 'Assortment'] = 1
store.loc[store['Assortment'] == 'b', 'Assortment'] = 2
store.loc[store['Assortment'] == 'c', 'Assortment'] = 3
store['Assortment'] = store['Assortment'].astype(int, copy=False)

In [None]:
print('levels :', store['Assortment'].unique(), '; data type :', store['Assortment'].dtype)

In [None]:
store['PromoInterval'].unique()

In [None]:
store.dtypes

#### PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

We will now fill in the blank fields in the dataset with 0s

In [None]:
store = store.fillna(0)
store.head(5)

### There are 3 different types pf promotion dates. Ww will convert this to numerical data by representing the different promotion dates with 1, 2 and 3

In [None]:
store.loc[store['PromoInterval'] == 'Jan,Apr,Jul,Oct', 'PromoInterval'] = 1
store.loc[store['PromoInterval'] == 'Feb,May,Aug,Nov', 'PromoInterval'] = 2
store.loc[store['PromoInterval'] == 'Mar,Jun,Sept,Dec', 'PromoInterval'] = 3
store['PromoInterval'] = store['PromoInterval'].astype(int, copy=False)

### We then merge and export the train and test datasets together below

In [None]:
train_store = pd.merge(train, store, how = 'left', on='Store')

In [None]:
train_store.head(5)

In [None]:
export_csv = train_store.to_csv(r'/Users/anthonymiyoro/Desktop/train_store.csv', index=None, header=True)

## Visual Exploration

#### We will begin by graphing the number of sales per store type

In [None]:
train_store.set_index('Date', drop=False, inplace=True)

We can see the number of sales when compiled monthly below:

In [None]:
y = train_store['Sales'].resample('MS').sum()

y['2013':]

In [None]:
y['2013'].plot(figsize=(15, 6))
plt.show()

In [None]:
y['2014'].plot(figsize=(15, 6))
plt.show()

In [None]:
y['2015'].plot(figsize=(15, 6))
plt.show()

In [None]:
sns.barplot(x="StoreType", y="Sales", data=train_store, order=["a", "b", "c", "d"])

Then we will look at the different store assortments

##### Assortment - describes an assortment level: a = basic, b = extra, c = extended

In [None]:
sns.barplot(x="Assortment", y="Sales", data=train_store)

We can also look at the different State Holidays

##### StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends.  
                      a = public holiday, 
                      b = Easter holiday, 
                      c = Christmas, 
                      0 = None

In [None]:
sns.barplot(x="StateHoliday", y="Sales", data=train_store)

Next, we look at how the PromoInterval affects the sales.

##### PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

1 == 'Jan,Apr,Jul,Oct' <br>
2 == 'Feb,May,Aug,Nov' <br>
3 == 'Mar,Jun,Sept,Dec'<br>

In [None]:
sns.barplot(x="PromoInterval", y="Sales", data=train_store)

#### We can also look at the average number of sales per week

It appeards as though sunday has the lowest number of sales

In [None]:
sns.barplot(x="DayOfWeek", y="Sales", data=train_store);

There also appear to be more sales when it isn't a public holiday.

In [None]:
train.loc[train['StateHoliday'] == '0', 'StateHoliday'] = 0
train.loc[train['StateHoliday'] == 'a', 'StateHoliday'] = 1
train.loc[train['StateHoliday'] == 'b', 'StateHoliday'] = 2
train.loc[train['StateHoliday'] == 'c', 'StateHoliday'] = 3
train['StateHoliday'] = train['StateHoliday'].astype(int, copy=False)

In [None]:
train.head(10)

In [None]:
train['Date']=pd.to_datetime(train['Date'],dayfirst=False)

In [None]:
Date = train['Date']

pd.DataFrame({"year": Date.dt.year,
              "month": Date.dt.month,
              "day": Date.dt.day,
              #"hour": Date.dt.hour,
              "dayofyear": Date.dt.dayofyear,
              "week": Date.dt.week,
              "weekofyear": Date.dt.weekofyear,
              "dayofweek": Date.dt.dayofweek,
              "weekday": Date.dt.weekday,
              "quarter": Date.dt.quarter,
             })