# Walmart sales dataset exploration
### Authors: Jase Miguel Correa and Ana Maria Pinto

In this file we are going to load the walmart weekly sales dataset from the kaggle competition to understand the information they are giving from the weekly sales of 45 stores from USA. This dataset contains the weekly sales from 2010-02-05 to 2012-11-01.

In [1]:
import pandas as pd

## The Dataset

The Walmart sales forecast dataset has the following four files:
 * train.csv
 * test.csv
 * features.csv
 * stores.csv

The train and test datasets contains information about each department of the 45 Walmart stores, the date of the week and if there is a holiday in that week, the difference between train and test files is that train contains the weekly sales, while the test doesn't.
The features files contains extra information abput several of the dates in the training test, this additional data can be used together with the train data to obtain a better resulta when making the sales prediction. Finally, the stores files contains the 45 Walmart stores, their size and type of store.

### Loading the data using pandas

#### Train
This is the historical training data, which covers to 2010-02-05 to 2012-11-01.Train contains the following columns:

 * Store: the store number
 * Dept: the department number
 * Date: the week
 * Weekly_Sales:  sales for the given department in the given store
 * IsHoliday: whether the week is a special holiday week

In [2]:
train = pd.read_csv('data/train.csv', parse_dates=['Date'])
train.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-02-05,24924.5,False
1,1,1,2010-02-12,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-03-05,21827.9,False


#### Features

Features contains additional data related to the store, department, and regional activity for the given dates. Features contains the following columns:

 * Store: the store number
 * Date: the week
 * Temperature: average temperature in the region
 * Fuel_Price: cost of fuel in the region
 * MarkDown1-5: anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
 * CPI: the consumer price index
 * Unemployment; the unemployment rate
 * IsHoliday; whether the week is a special holiday week


In [3]:
features = pd.read_csv('data/features.csv', parse_dates=['Date'])
features.head()

Unnamed: 0,Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
0,1,2010-02-05,42.31,2.572,,,,,,211.096358,8.106,False
1,1,2010-02-12,38.51,2.548,,,,,,211.24217,8.106,True
2,1,2010-02-19,39.93,2.514,,,,,,211.289143,8.106,False
3,1,2010-02-26,46.63,2.561,,,,,,211.319643,8.106,False
4,1,2010-03-05,46.5,2.625,,,,,,211.350143,8.106,False


#### Stores
The stores file contains three columns:
 * Store: numbered from 1 to 45, is the number of the anonymized store that participated to build this dataset.
 * Type: type of the store.
 * Size: the size of the store.

Walmart didn't provide much information about what is the type of the store, also they didn't provide the units of the store size.

In [4]:
stores = pd.read_csv('data/stores.csv')
stores.head()

Unnamed: 0,Store,Type,Size
0,1,A,151315
1,2,A,202307
2,3,B,37392
3,4,A,205863
4,5,B,34875


### Ploting the data

In [5]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

In [6]:
train_date_list = train['Date'].unique()
train_date_list.sort()
print(len(train_date_list), "weeks")

143 weeks


In [7]:
train_store_list = train['Store'].unique()
train_store_list.sort()
print(len(train_store_list), "stores, numbered as:\n", train_store_list)

45 stores, numbered as:
 [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45]


In [8]:
train_dept_list = train['Dept'].unique()
train_dept_list.sort()
print(len(train_dept_list), "departments, numbered as:\n", train_dept_list)

81 departments, numbered as:
 [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 54 55 56 58 59 60 65 67 71 72 74 77 78 79 80 81 82 83 85 87 90
 91 92 93 94 95 96 97 98 99]


In [None]:
fig, ax = plt.subplots(figsize=(15,7))
weekly_sales = train[["Date", "Weekly_Sales"]] 
weekly_sales.set_index("Date",inplace=True)

ax.bar(weekly_sales.index, weekly_sales["Weekly_Sales"])