# Cleaning M4 data: Weekly Finance

***

This code imports the weekly level time series used in the [M4](https://www.sciencedirect.com/journal/international-journal-of-forecasting/vol/36/issue/1) competition.

**Data Breakdown**

* One dataset per data frequency:
    - Yearly
    - Quarterly
    - Monthly
    - Weekly
    - Daily
    - Hourly
* Each dataset contains series from at least one domain:
    - Micro
    - Industry
    - Macro
    - Finance
    - Demographic
    - Other

*Using this dataset for the paper would allow us to examine not only the effects of data protection on the accuracy of various forecasting models, but how those effects interact with the traits of the data being forecasted (data frequency, domain, seasonality, trend, etc.)*

**Data for Initial Modeling**

For the purposes of getting the forecasting models and protection methods implemented, we select the series from one data frequency and one domain. The `Weekly` frequency `Finance` domain contains 164 series which seems to be a reasonable number to work with.

***

## Steps

* **Step 1**: Import weekly training data file `Weekly-train.csv` and `M4-info.csv` which identifies time series within each domain (e.g., Finance, Micro, etc.)
* **Step 2**: Using the series identifiers in `M4-info.csv`, select the `Finance` series from the `Weekly-train.csv` dataset.
* **Step 3**: Remove any time periods that contain missing values for any series - *we will take this step out in the future and will forecast the M4 testing data using all available training data*.
* **Step 4**: Save the `Weekly` domain `Finance` series to a `.csv` file.

**Step 1**

In [1]:
# import modules
import pandas as pd

In [2]:
# import training data for weekly time series
weekly_train = pd.read_csv("../../../Data/Train/Weekly-train.csv")

In [3]:
# import identifying file for time series
m4_info = pd.read_csv("../../../Data/M4-info.csv")

**Step 2**

Using information from the M4 paper, we know there are 164 weekly level time series from the Finance domain. This seems like a decent number for initially implementing our models and methods.

In [4]:
# store ids for Weekly time series in Micro category
time_freq = "Weekly"
sector = "Finance"
ts_ids = m4_info.loc[(m4_info.SP == time_freq) & (m4_info.category == sector),:]["M4id"]

In [5]:
# subset weekly time series data for Finance domain
ts = weekly_train.loc[weekly_train.V1.isin(ts_ids),:]

**Step 3**

In [6]:
# # sum the number of NA values in each column
# # use to remove any column with missing values
# ts = ts.loc[:, ts.isna().sum() == 0]

In [7]:
ts.shape

(164, 2598)

We are left with 164 series with 247 (one column is row names) measurements in each. Save the file with the row ID's removed.

**Step 4**

In [8]:
ts.iloc[:, 1:].to_csv("../../../Data/Train/Clean/full_m4_weekly_finance_clean.csv", index = False)