# Cleaning M3 data: Monthly Micro

***

## Steps

* **Step 1**: Import M3 monthly data in `monthly-m3.csv`.
* **Step 2**: Select the `MICRO` series using the `Category` variable.
* **Step 3**: Remove any time periods that contain missing values for any series - *we will take this step out in the future and will forecast the M4 testing data using all available training data*.
* **Step 4**: Save the M3 `Monthly` `Micro` domain series to a `.csv` file.

**Step 1**

In [1]:
# import modules
import pandas as pd
import numpy as np
import os

In [2]:
# import training data for weekly time series
train = pd.read_csv("../../../Data/M3/monthly-m3.csv")

**Step 2**

Using information from the M4 paper, we know there are 164 weekly level time series from the Finance domain. This seems like a decent number for initially implementing our models and methods.

In [3]:
train.Category.unique()

array(['MICRO       ', 'INDUSTRY    ', 'MACRO       ', 'FINANCE     ',
       'DEMOGRAPHIC ', 'OTHER       '], dtype=object)

In [4]:
# store ids for Weekly time series in Micro category
category = "MICRO       "
ts = train.loc[train.Category == category,].iloc[:,6:]

In [5]:
ts.sum(axis=0)

1      2263079.0
2      1863762.0
3      2057668.0
4      2102386.0
5      2068622.0
         ...    
140          0.0
141          0.0
142          0.0
143          0.0
144          0.0
Length: 144, dtype: float64

**Step 3**

In [6]:
# sum the number of NA values in each column
# use to remove any column with missing values
# ts = ts.loc[:, ts.isna().sum() == 0]

In [7]:
ts.shape

(474, 144)

We are left with 474 series with 143 (one column is row names) measurements in each (some series have missing values). Save the file with the row ID's removed.

**Step 4**

In [9]:
path = "../../../Data/Train/Clean/"
if not os.path.exists(path):
    os.makedirs(path)
    
ts.iloc[:, 1:].to_csv(path+"full_m3_monthly_micro_clean.csv", index = False)