# Time Series Analysis - Sales Forecasting 

## Team Members

1. Engin Pehlivan  090200769
2. Ozan Yeşil      090190325
3. Abdulsamet Balveren 090190751

## Dataset
In this project we will use Kaggle Competition dataset which includes more than one dataset. Data is taken from a market chain in Ecuador.  
Project data: [Data](https://www.kaggle.com/competitions/store-sales-time-series-forecasting)  
There are 7 different datasets. We will use data below:  
- `train` and `test` datasets are csv files that include same content which is the data that sales occured, product family and binary valeu of ongoing promotion. They differ in the date column where there is no interseciton between them.  
- `stores` dataset includes metadata about stores that make these sales. It includes city, state and type informations.
- `oil` dataset includes the daily oil prices which affects the economy of Ecuador.
- `holidays-events` give us information about whther that day is a work day or holiday or an event day.  

In `train` and `test` datasets there are 5 columns (except test dataset does not include `sales` column):  
`date` represents the transaction date in yyyy-mm-dd format. `store_nbr` represents the store code, every store has a unique store number.  `family` is the category of the product that is sold. `sales` is the amount that product is sold that specific date. `onpromotion` is the total number of items in a product family that were being promoted at a store at a given date. 

`transaction` dataset includes 3 columns:  
`date` and `store_nbr` are known from above. `transactions` is the sum of the transactions that has been made in each store in given date.  

`stores` dataset have 54 rows one for each store. Every row includes `city`, `state` information and also `cluster` which is clustered based on the similarity.  

`oil` dataset includes `date` and `dcoilwtco` columns which is the daily price of oil.  

`holiday_events` dataset shows different types of holidays in the country. `type` column tells the type of the day. `locale` column tells about the type of the holiday. `locale_name` tells the local area that is affected. `description` tells the detail of that day. `transferred` column  tells if that holiday is transferred or not. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government.  


**Dataset References**  

Store Sales - Time Series Forecasting. Kaggle, accessed 13 November 2022, [Link](https://www.kaggle.com/competitions/store-sales-time-series-forecasting)

## Description of the Problem
In this project we want to predict amount of sales for the product families sold at **Favorita** stores located in **Ecuador**.

At the end of the project we will be answered these questions:  
- What will be the amount of sales in the following 15 days for each store and each product family?
- What kind of effect does oil prices have on Ecuador?
- Is there a noticable change in sales on holidays?
- Which products should be promoted more in those special days?
- Which stores make more sales in the country?
- Comparison of stores in the same city in terms of sales.
- Clustring stores based on similarity.

For prediction we are planning to use: **Linear Regression, XGBoost, RandomForest**  
For clustering we will consider:**k-means, Hierarchical clustering** 

## Project Planning


### Project Pieces

**Literature Review:** In order to see other works in Time Series and understand what kind of new perspectives we can bring while learning the math behind it.   
**Data Manipulation:** Gathering data from different datasets and splitting it to do visualizations easier. For example, grouping markets and evaluating them seperately.  
**Visualization:** Exploratory Data Analysis. Before modelling we will try to understand data better so that we can chose the best models with better accuracy. We will use pie charts for the distribuiton of product sales and heatmaps to see the correlation between columns.  
**Modelling:** After understanding the topic better and cleaning our data we will create our model for forecasting. After that we will evaluate our model to see if we can improve it by changing the algorithm or feature engineering.  
**Analysis:** After all visualizations and modellings, we will cocnlude our models and graphics to answer our questions.

### Hardware and Software

We will use Python programming language and its libraries and Jupyter Notebook environment. We will be doing this project on our laptops with 8 GB RAM which will be enough.

In order to read and process large datasets there are several modules that solve this problem such as `parquet`, `Feather` and `Datatable`. We have already tried importing and processing our data, so we do not expect any problem about data size. 

## Who will do what?

Ozan Yeşil -> Data Manupilation & Literature review: Gathering different datasets and explore it before modelling. Encoding if it is necessary, especially for date columns.  
Engin Pehlivan -> Modelling: Constructing ML models mentioned above to predict and cluster.  
Abdulsamet Balveren -> Visualization & EDA: Heatmaps and correlation visualizations and analysis. Feature engineering(e.g. Using the interaction of featuers or removing a column or regularization of coefficients.)


### Calendar

* Literature review: 1 week
* Data Cleaning & Exploratory Data Analysis: 2 weeks 
* Modelling & Evaluation: 2 weeks
* Final Analysis & Uploading: 10 days

**Literature review will not be done seperately, it will be synchronized with other steps.**

## Sample Datasets

In [23]:
import os
import warnings
warnings.filterwarnings("ignore")
from kaggle.api.kaggle_api_extended import KaggleApi
os.environ['KAGGLE_USERNAME'] = "ozanyesil"
os.environ['KAGGLE_KEY'] = "2892cff7bd5b7c1d068fd7a098c692b9"


api = KaggleApi()
api.authenticate()

#The Content List of Football Events from Kaggle
#api.dataset_list_files('secareanualin/football-events').files

api.competition_download_file('store-sales-time-series-forecasting',
                              'test.csv', path='./')
api.competition_download_file('store-sales-time-series-forecasting',
                              'oil.csv', path='./')

test.csv: Skipping, found more recently modified local copy (use --force to force download)
oil.csv: Skipping, found more recently modified local copy (use --force to force download)


In [24]:
data = pd.read_csv('test.csv')
oil = pd.read_csv('oil.csv')

In [25]:
data.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [26]:
oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


## Atabey's notes

I like the detailed descriptions of the problems you'd like to explore. However, there are some important details missing. For example, you need to provide a sample of the dataset(s) and explain the pieces. This is a fairly large dataset (~130Mb) and you must explain its structure. Also, your description of how you are going to apply the ML algorithms to find answers to your specific questions is not clear. I need more details. I also need how you are going to solve memory problems if the data is too large for your ML models? Moreover, the division of labor is not specific enough. What do you mean by data manipulation? How are you going to select/engineer features? What do you mean by visualization?

Looks good. Go ahead.