# Data Overview & Exploratory

It does not go in-depth into any particular topic -
check out [Kaggle - Predict Future Sales Competition](https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales/overview) for more information.

Check out the content pages bundled with this sample book to see more.

#### Module Imports

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# from sklearn import *

import matplotlib.style as style 

style.use('seaborn-darkgrid')
sns.set_context('notebook')
sns.set_palette('gist_heat')

### Overview of Data Files

In [24]:
os.listdir('../csv_folder')

['sales_train.csv',
 'shops.csv',
 'test.csv',
 'item_categories.csv',
 'items.csv']

- We will need to join the ***items*** and the ***sales_train*** dataframes
- The ***shops*** and ***item_categories*** only contain the names of id values
- The final file ***test*** will be used when submitting predictions

---
##### Sales Data

In [25]:
sales = pd.read_csv('../csv_folder/sales_train.csv')
shops = pd.read_csv('../csv_folder/shops.csv')
item_cats = pd.read_csv('../csv_folder/item_categories.csv')
items = pd.read_csv('../csv_folder/items.csv')

sales.head(10)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0
5,10.01.2013,0,25,2564,349.0,1.0
6,02.01.2013,0,25,2565,549.0,1.0
7,04.01.2013,0,25,2572,239.0,1.0
8,11.01.2013,0,25,2572,299.0,1.0
9,03.01.2013,0,25,2573,299.0,3.0


- `date` and `date_block_num` are our time series values 
- `shop_id` and `item_id` are our index
- `item_price` and `item_cnt_day` are our values

---
##### Key Tables (shops, items, categorties)

In [19]:
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


The names of the shops are in russian, should find some encoding that is capable of handling these

In [5]:
item_cats.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


As shown above, the ***shops*** and ***item_categories*** only contain the names of id values. These may be useful later but for now we will use only the id values.

In [9]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


Showing the ***test*** data for reference

In [7]:
train_df = pd.read_csv('../csv_folder/sales_train.csv', 
                      dtype={'shop_id':'int8', 
                            'item_id':'int16',
                            'item_cnt_month':'int32',
                            'date_block_num':'int8'})

This is the provided training data set and will be used as the primary dataframe for training.  
We will use the other dataframes to compliment the information stored here.