In [1]:
"""                                    !!!
In this competition, you will predict sales for the thousands of product families sold
at Favorita stores located in Ecuador. The training data includes
!! dates, store and product information, whether that item was being promoted, as well as the sales numbers.!!
Additional
files include supplementary information that may be useful in building your models.


File Descriptions and Data Field Information


train.csv
The training data, 
!!!
comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
store_nbr identifies the store at which the products are sold.
family identifies the type of product sold.
sales gives the total sales for a product family at a particular store at a given date. 
Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance,
as opposed to 1 bag of chips).
onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.


test.csv
The test data, having the same features as the training data. You will predict the target sales for the dates in this file.
The dates in the test data are for the 15 days after the last date in the training data.


sample_submission.csv
A sample submission file in the correct format.


stores.csv
Store metadata, including city, state, type, and cluster.
cluster is a grouping of similar stores.


oil.csv
Daily oil price. Includes values during both the train and test data timeframes. 
(Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)


holidays_events.csv
Holidays and Events, with metadata
NOTE: Pay special attention to the transferred column. 
A holiday that is transferred officially falls on that calendar
day, but was moved to another date by the government. A transferred 
day is more like a normal day than a holiday. To find the day that it was 
actually celebrated, look for the corresponding row where type is Transfer. 
For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09
to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge 
are extra days that are added to a holiday (e.g., to extend the break across a long weekend). 
These are frequently made up by the type Work Day which is a day not normally scheduled for work 
(e.g., Saturday) that is meant to payback the Bridge.
Additional holidays are days added a regular calendar holiday, for example, as typically happens around 
Christmas (making Christmas Eve a holiday).
Additional Notes
Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. 
Supermarket sales could be affected by this.
A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating 
water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.



https://www.kaggle.com/competitions/store-sales-time-series-forecasting

"""
print("ok")

ok


In [2]:
# help, ?, ??
# ctrl-u - up
# %timeit

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings("ignore")

In [3]:
data_path = "data/"
holiday_events_data = pd.read_csv(data_path + "holidays_events.csv")
oil_data = pd.read_csv(data_path + "oil.csv")
stores_data = pd.read_csv(data_path + "stores.csv")
transaction_data = pd.read_csv(data_path + "transactions.csv")

test_data = pd.read_csv(data_path + "test.csv")
train_data = pd.read_csv(data_path + "train.csv")

In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB


In [5]:

holiday_events_data['date'] = pd.to_datetime(holiday_events_data['date'], format = "%Y-%m-%d")
oil_data['date'] = pd.to_datetime(oil_data['date'], format = "%Y-%m-%d")
transaction_data['date'] = pd.to_datetime(transaction_data['date'], format = "%Y-%m-%d")

train_data['date'] = pd.to_datetime(train_data['date'], format = "%Y-%m-%d")
test_data['date'] = pd.to_datetime(test_data['date'], format = "%Y-%m-%d")

In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   id           int64         
 1   date         datetime64[ns]
 2   store_nbr    int64         
 3   family       object        
 4   sales        float64       
 5   onpromotion  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 137.4+ MB


In [7]:
holiday_events_data.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [8]:
oil_data.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [9]:
stores_data.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [10]:
transaction_data.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [11]:
train_data.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [12]:
test_data.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [13]:
num_of_stores = train_data['store_nbr'].unique().__len__() # 54 stores
num_of_prods = train_data['family'].unique().__len__() # 33 products
print(num_of_stores)
print(num_of_prods)

54
33
