In [1]:
import os
os.chdir('../..')

In [2]:
import epicas

In the previous lab, we have tried to import and merge cleaned data into Epicas' `StructuredData`. If you have not read it and do not know what `StructuredData` is, please see it [here](https://github.com/caominhduy/epicas/docs/ipynb/1_loading_and_merging.ipynb).

In this one, we are going to move on to the next important step of our pipeline: feature engineering. Don't worry, most of these steps are also automated so you should move very quickly to modeling!
![feature engineering meme](https://memegenerator.net/img/instances/70504510/all-feature-engineering-and-no-modeling-makes-quincy-a-dull-boy.jpg)

## Feature Engineering

We are implementing `EpiData` for this purpose. In this lab, let's try to cover everything. 

First, let's reload StructuredData.

In [3]:
jhu = epicas.StructuredData(
        'demo/datasets/covid.xz',
        location = 'FIPS',
        date = 'date',
        incidence = 'confirmed_cases',
        )

mobility = epicas.StructuredData(
        'demo/datasets/mobility.csv.gz',
        location = 'FIPS',
        date = 'date'
        )

population = epicas.StructuredData(
        'Reichlab_Population.csv',
        location = 'location',
        usecols = ['location', 'population']
        )

merged = jhu + mobility + population

Ok, let's start!

### Load `EpiData` from `StructuredData`

To do this, we need to specify 2 minimum hyperparameters: `StructuredData` and `y`. `y` is just the name of our target variable (the time-series that we are trying to forecast) from `StructuredData`. However, to make the later part of our pipeline more accurate and efficient, let's also specify `disease`. In this version, Epicas supports these infectious diseases:

+ 'influenza'
+ 'covid19'
+ 'covid19_alpha'
+ 'covid19_delta'
+ 'sars'
+ 'mers'
+ 'common_cold'
+ 'ebola'
+ 'measles'
+ 'mump'
+ 'hiv'
+ 'hantavirus'
+ 'polio'
+ 'chickenpox'

I strongly recommends trying to pick the disease that is closest to your forecasting disease. When in doubt, or none of the previous resembles your disease well, choose from the list below based on transmission type. Again, let's try to pick the closest one we can!

+ 'generic' (this is worse scenario if you are unsure)
+ 'generic_aerosol'
+ 'generic_body_fluid'
+ 'generic_fecal_oral'
+ 'generic_respiratory' (this includes droplets and aerosol)
+ 'generic_respiratory_droplet'

Each disease has its own usual incubation periods, transmission type, etc., of which Epicas has taken care. These will have smaller effect on training performance, but they should significantly narrows down the computational cost while feature engineering. In many cases, it also increases our model accuracy! 

Since the data we have loaded is COVID-19, and this does not specify the variant, so we are going with 'covid19'. We are also trying to forecast incident cases.

In [4]:
merged = epicas.EpiData(merged, y='incidence', disease='covid19')

Done!

### Imputation

Imputation is the process of replacing missing data points with substitutes. Since Epicas is built on top of Pandas, we are expecting similar options.

+ 'median': fill missing values with medians (this option is more robust to outliers)

+ 'mean': fill missing values with mean values

+ 'zero': fill missing values with 0

+ 'ffill': fill missing values by propagating last valid observations forward to next

+ 'bfill': fill missing values with next available observations

By default, if not specified, our default method will be `median`. Let's try...

In [5]:
merged = merged.imputation()

Done! Alternatively, if we want to ffill...

In [6]:
merged = merged.imputation(method='ffill')

Remember: this technique only uses observations we have to fill the observations we do not have, which a very naive solution. Thus, it can not replace the importance of a good datasets.

### Moving Average of Target Time-series

Many types of data inevitably fluctuate strongly over time. Some examples are stock price, amount of trade, price of Dogecoin (just kidding), etc. In our use case, epidemiology data is also fluctuating very heavily, especially during the epidemic outbreaks. Taking moving average may smooth out the data, allowing us human read the trends easier, and may improve overall performance of our model.

To take MA of target time-series:

In [7]:
merged = merged.target_to_ma(3)

In this example, we just calculate MA values of incidence and replace them in-place. Notice `3` that I specified as first argument, it means "taking moving averages with window size of 3."

### Normalization

There are many reasons for data normalization. For example, one of them is that data normalization helps our model be less sensitive to different scales of features. E.g., with population data we are expecting generally large whole numbers while with mobility data we are expecting real values between -1 and 1. A *very naive* model may put more weights on population data, which is very very bad!

To normalize features, just simply specify the subset of features! Let's normalize three columns: population, fb_movement_change, fb_stationary.

In [8]:
merged = merged.normalization(subset=['population', 'fb_movement_change', 'fb_stationary'])

Let's see if this worked.

In [9]:
print(merged)

EpiData['location', 'date', 'incidence', 'confirmed_cases_norm', 'fb_movement_change', 'fb_stationary', 'population']

         location       date  incidence  confirmed_cases_norm  \
0            1001 2020-02-15   0.000000                   0.0   
1            1001 2020-02-16   0.000000                   0.0   
2            1001 2020-02-17   0.000000                   0.0   
3            1001 2020-02-18   0.000000                   0.0   
4            1001 2020-02-19   0.000000                   0.0   
...           ...        ...        ...                   ...   
1344613     56045 2021-08-02  10.000000                 129.0   
1344614     56045 2021-08-03  10.000000                 144.0   
1344615     56045 2021-08-04   9.666667                 144.0   
1344616     56045 2021-08-05  10.000000                 144.0   
1344617     56045 2021-08-06   9.666667                 129.0   

         fb_movement_change  fb_stationary  population  
0                  0.439024       0.560976 

Nice, it worked! Notice how population is now narrowed to below 1, on par with mobility data.

### Lag Reduction

Finally! We are getting to the fun part. First, what is lag? Lag is understood as the delay between changes in regressors and changes in actual target variables. For example, intuitively, if mobility data peaked on some specific day, it would be on some day+n where real cases were recorded (considering incubation period, testing time, delay in test reports, etc.)

Assume we suspect it will take 21 days from a change in fb_movement_change reflects on incident cases.

In [10]:
merged = merged.lag_reduction(subset=['fb_movement_change'], sliding_window=21)

Optimal shift for fb_movement_change: 10


That was quick! As you can see, it does not matter how many days we pick, it always returns the best lag! However, if you pick a greater number, this process should take longer time, since you are giving it more works, so choose wisely...

If you do not have a preference, since you already specify a disease type, it will kick it! Let's try to do the same with fb_stationary, except we do not give it a specific range this time.

In [11]:
merged = merged.lag_reduction(subset=['fb_stationary'])

Optimal shift for fb_stationary: 12


Done!

### and more...

We have covered most of important feature engineering techniques that Epicas supports. But that's not everything. I suggest reading our documentation for others, such as:

- Feature Selection
- Outlier Removal
- Cumulative to incidences

Thank you for reading and see you next time in modeling with **ARIMA** (**A**uto**r**egressive **I**ntegrated **M**oving **A**verage)!