# Walmart sales data analysis

## Aim

To predict aggregate monthly sales using Regression models over Walmart dataset.

In [1]:
import pandas as pd

## Loading Data into dataframes

In [2]:
train = pd.read_csv("./data/train.csv")
stores = pd.read_csv("./data/stores.csv")
features = pd.read_csv("./data/features.csv")

## Exploring data

**Total rows are 8190.**

**There are twelve columns.**

In [3]:
features.info() #can be shown on the web page

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8190 entries, 0 to 8189
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         8190 non-null   int64  
 1   Date          8190 non-null   object 
 2   Temperature   8190 non-null   float64
 3   Fuel_Price    8190 non-null   float64
 4   MarkDown1     4032 non-null   float64
 5   MarkDown2     2921 non-null   float64
 6   MarkDown3     3613 non-null   float64
 7   MarkDown4     3464 non-null   float64
 8   MarkDown5     4050 non-null   float64
 9   CPI           7605 non-null   float64
 10  Unemployment  7605 non-null   float64
 11  IsHoliday     8190 non-null   bool   
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 712.0+ KB


- Date is recognised as an "Object" by pandas.
- It means that it is not recognised as any pre-defined Python type

### Getting an overview of data

In [4]:
features.describe()
#can also be shown on the web page
# Analysis and calculations regarding quantitative columns

Unnamed: 0,Store,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment
count,8190.0,8190.0,8190.0,4032.0,2921.0,3613.0,3464.0,4050.0,7605.0,7605.0
mean,23.0,59.356198,3.405992,7032.371786,3384.176594,1760.10018,3292.935886,4132.216422,172.460809,7.826821
std,12.987966,18.678607,0.431337,9262.747448,8793.583016,11276.462208,6792.329861,13086.690278,39.738346,1.877259
min,1.0,-7.29,2.472,-2781.45,-265.76,-179.26,0.22,-185.17,126.064,3.684
25%,12.0,45.9025,3.041,1577.5325,68.88,6.6,304.6875,1440.8275,132.364839,6.634
50%,23.0,60.71,3.513,4743.58,364.57,36.26,1176.425,2727.135,182.764003,7.806
75%,34.0,73.88,3.743,8923.31,2153.35,163.15,3310.0075,4832.555,213.932412,8.567
max,45.0,101.95,4.468,103184.98,104519.54,149483.31,67474.85,771448.1,228.976456,14.313


In [5]:
# Including object
# Date column
features.describe(include=object)

Unnamed: 0,Date
count,8190
unique,182
top,2012-07-20
freq,45


In [6]:
# Including object
# Date column
features.describe(include=bool)

Unnamed: 0,IsHoliday
count,8190
unique,2
top,False
freq,7605


In [7]:
features.count()

Store           8190
Date            8190
Temperature     8190
Fuel_Price      8190
MarkDown1       4032
MarkDown2       2921
MarkDown3       3613
MarkDown4       3464
MarkDown5       4050
CPI             7605
Unemployment    7605
IsHoliday       8190
dtype: int64

In [8]:
# Counting Null values
features.isna().sum()

Store              0
Date               0
Temperature        0
Fuel_Price         0
MarkDown1       4158
MarkDown2       5269
MarkDown3       4577
MarkDown4       4726
MarkDown5       4140
CPI              585
Unemployment     585
IsHoliday          0
dtype: int64

In [9]:
print(len(stores))
stores.isna().sum()

45


Store    0
Type     0
Size     0
dtype: int64

In [10]:
print(len(train))
train.isna().sum()

421570


Store           0
Dept            0
Date            0
Weekly_Sales    0
IsHoliday       0
dtype: int64

## References

- https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
- https://medium.com/analytics-vidhya/walmart-sales-forecasting-d6bd537e4904
- https://stackoverflow.com/questions/5137497/find-current-directory-and-files-directory
- https://stackoverflow.com/questions/19790790/splitting-dataframe-into-multiple-dataframes
- https://stackoverflow.com/questions/38913965/make-the-size-of-a-heatmap-bigger-with-seaborn
- https://datascience.stackexchange.com/questions/9159/when-to-choose-linear-regression-or-decision-tree-or-random-forest-regression
- https://datascience.stackexchange.com/questions/6838/when-to-use-random-forest-over-svm-and-vice-versa
- https://realpython.com/pandas-python-explore-dataset/