## Objective

The evaluation metric for this competition is ***Root Mean Squared Logarithmic Error***.

The `RMSLE` is calculated as:

$$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$$

where:

- $ n $ is the total number of instances,
     
- $\hat{y}$ is the predicted value of the target for instance (i),
   
- $y_i$ is the actual value of the target for instance (i), and,
 
- $log$ is the natural logarithm.

For each id in the test set, you must predict a value for the sales variable. The file should contain a header and have the following format:

    ```
    id,sales
    3000888,0.0
    3000889,0.0
    3000890,0.0
    3000891,0.0
    3000892,0.0
    etc.
    ```


In [1]:
import lux
lux.config.default_display = "lux"
import pandas as pd
from dateutil.parser import *

### holidays

In [2]:
holidays=pd.read_csv("../data/raw/holidays_events.csv", parse_dates=['date'], infer_datetime_format=True, index_col='date')
# holidays['date'] = pd.to_datetime(holidays['date'], format='ISO8601')
holidays

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

### stores

In [3]:
stores=pd.read_csv("../data/raw/stores.csv")
stores.info()

<class 'lux.core.frame.LuxDataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


In [4]:
stores

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

### transactions

In [5]:
transactions=pd.read_csv("../data/raw/transactions.csv", parse_dates=['date'],infer_datetime_format=True,chunksize=10000)
# transactions['date'] = pd.to_datetime(transactions['date'], format='ISO8601')
transactions

<pandas.io.parsers.readers.TextFileReader at 0x7f77eddb3fa0>

### train

In [6]:
train=pd.read_csv("../data/raw/train.csv", parse_dates=['date'], infer_datetime_format=True)
# train['date'] = pd.to_datetime(train['date'], format='ISO8601')
train

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

### test

In [7]:
test=pd.read_csv("../data/raw/test.csv", parse_dates=['date'],infer_datetime_format=True,chunksize=10000)
# test['date'] = pd.to_datetime(test['date'], format='ISO8601')
test

<pandas.io.parsers.readers.TextFileReader at 0x7f77cc9a2070>

### oil

In [8]:
oil=pd.read_csv("../data/raw/oil.csv", parse_dates=['date'],infer_datetime_format=True)
# oil['date'] = pd.to_datetime(oil['date'], format='ISO8601')
oil

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

In [9]:

print(f"{stores.shape = }")
print(f"{transactions.shape = }")
print(f"{oil.shape = }")
print(f"{train.shape = }")
print(f"{test.shape = }")

stores.shape = (54, 5)


AttributeError: 'TextFileReader' object has no attribute 'shape'

In [None]:
print(f"{set((holidays.columns)) = }")
print(f"{set((stores.columns)) = }")
print(f"{set((transactions.columns)) = }")
print(f"{set((oil.columns)) = }")
print(f"{set((train.columns)) = }")
print(f"{set((test.columns)) = }")

In [None]:
stores_holidays = set((stores.columns)).intersection(set((holidays.columns)))
stores_holidays


In [None]:
transactions_oil = set((transactions.columns)).intersection(set((oil.columns)))
transactions_oil

In [None]:
transactions_stores = set((transactions.columns)).intersection(set((stores.columns)))
transactions_stores

In [None]:
train_transactions_stores = set((train.columns)).intersection(set((stores.columns)), set((transactions.columns)))
train_transactions_stores

## data cleaning with klib

In [None]:
import klib
klib.missingval_plot(holidays)
print(f"{holidays.shape = }")
holidays_cleaned = klib.data_cleaning(holidays)

In [None]:
klib.missingval_plot(stores)
print(f"{stores.shape = }")
stores_cleaned = klib.data_cleaning(stores)

In [None]:
stores_cleaned

In [None]:
klib.missingval_plot(transactions)
print(f"{transactions.shape = }")
transactions_cleaned = klib.data_cleaning(transactions)

In [None]:
klib.missingval_plot(oil)
print(f"{oil.shape = }")
oil_cleaned = klib.data_cleaning(oil)

In [None]:
klib.missingval_plot(train)
print(f"{train.shape = }")
train_cleaned = klib.data_cleaning(train)