#### Import Required Libraries

In [6]:
import pandas as pd
import numpy as np

## 1. Access the Data

The dataset for this project is from [Store Sales - Time Series Forecasting Challenge](https://www.kaggle.com/c/store-sales-time-series-forecasting) hosted on Kaggle. 

To access it, you'll first need to sign in to Kaggle and accept the competition's terms of use. Once authenticated, you can download the data directly from Kaggle, unzip it, and store it locally or on your preferred cloud storage. 

Alternatively, for a more automated approach, you can use the provided `data_extract.py` script. This script leverages the Kaggle API to download the data directly and saves it within the project's `data` folder.

In [None]:
# extract data from kaggle
%run ../src/data_extract.py

While the dataset includes multiple files, we'll focus only on `train.csv` file for our analysis.

In [9]:
file_path = '../data/train.csv'

# define data schema
dtypes = {
    'date': str, # initially read as string
    'store': int,
    'item': int,
    'sales': int
}

# read data as csv
df = pd.read_csv(
    file_path,
    header=0,
    dtype=dtypes
)

# convert datetime columns to datetime format
df['date'] = pd.to_datetime(df['date'])

df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,1,1,13
1,2013-01-02,1,1,11
2,2013-01-03,1,1,14
3,2013-01-04,1,1,13
4,2013-01-05,1,1,10


## 2. EDA

### Understand the data

In [12]:
# Check structure and data types
df.info()

# Descriptive statistics
df.describe()

# Check for missing values
df.isna().sum()

Unnamed: 0,date,store,item,sales
count,913000,913000.0,913000.0,913000.0
mean,2015-07-02 11:59:59.999999744,5.5,25.5,52.250287
min,2013-01-01 00:00:00,1.0,1.0,0.0
25%,2014-04-02 00:00:00,3.0,13.0,30.0
50%,2015-07-02 12:00:00,5.5,25.5,47.0
75%,2016-10-01 00:00:00,8.0,38.0,70.0
max,2017-12-31 00:00:00,10.0,50.0,231.0
std,,2.872283,14.430878,28.801144
