<h1 align = "center">Predict Future Sales</h1>

The challenge is to [*predict future sales*](https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales) for every product and store for the next (upcoming) month. All the data files are available in `data` directory, however they are ignored from GIT.

## Objective
The notebook is designed to serve the purpose of understanding the dataset, do feature engineering and finally make some predictions. The code also uses [**`googletrans`**](https://pypi.org/project/googletrans/) library to convert Russian language to English, and finally do some NLP to create a more meaningful bag of product category.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%precision 3
%matplotlib inline
sns.set_style('whitegrid');
plt.style.use('default-style');
np.set_printoptions(precision = 3, threshold = 15)

In [2]:
from time import ctime # will be used in logging, file/output directory create etc.
from os import makedirs # create directories dynamically, if not already done so manually
from os.path import join # keep directories `os`-independent
from copy import deepcopy # `pd.Dataframe` is mutable, so any `df` operation may need `deepcopy`
from tqdm import tqdm as TQ # provide progress bar for code completions
from uuid import uuid1 as UUID # keep output file name unique
from datetime import datetime as dt # formatting datetime objects

In [3]:
from utilities import read_file

**Global Constants**

In [4]:
ROOT = "."
DATA = join(ROOT, "data")

# define output directory
# this is defined on current date
today = dt.strftime(dt.strptime(ctime(), "%a %b %d %H:%M:%S %Y"), "%a, %b %d %Y")

OUTPUT_DIR = join(ROOT, "output", today)
makedirs(OUTPUT_DIR, exist_ok = True) # create dir if not exist

# set/change output file name
OUTPUT_FILE = f"{UUID()}.xlsx" # randomly generate names

# log/inform users of current output file name
print(f"Output File : {join(OUTPUT_DIR, OUTPUT_FILE)}") # use this syntax

Output File : .\output\Wed, Apr 06 2022\24b06545-b59d-11ec-abdd-5405db104a4e.xlsx


## Imports
Import all the data file, and process the data for ML modeling.

In [None]:
# check the contents of `data` directory
# this cell is meant to be run on *nix systems
!ls ./data/

In [5]:
train = read_file(join(DATA, "sales_train.csv"))
train.sample(5)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
1829057,2014-07-22,18,13,4244,734.0,1.0
488099,2013-05-16,4,54,1939,999.0,1.0
1957330,2014-09-18,20,44,7070,299.0,1.0
1903897,2014-08-14,19,51,9786,80.0,1.0
1420215,2014-02-01,13,52,1829,698.0,1.0
