# Summary

This notebook walks through the process of analyzing time series data in the neighborhood of "events", in this case the events of interest are specifically windows of time where individuals report experiencing influenza-like illness (ILI). Specifically we are interested in pairing daily features computed via raw data collected from a commercial Fitbit device in order to quantify the impact of ILI on behavior and physiology.


**Learning Objectives:**
1. Combine reported maximal symptom dates with passively measured daily features in order to construct analysis windows for each individual
2. Use time series visualization techniques to better understand individual and population responses to ILI across measurement dimensions
3. Use a fixed effects regression framework to estimate the average ILI impact trajectory in the neighborhood of an ILI event
4. Construct a rudimentary machine learning pipeline to differentiate between ILI and control windows

**Notes**
- For this analysis we will use simulation data rather than actual data pulled from individuals. This is to preserve privacy of individuals. The underlying distribution of features is reasonably similar to what we observe empirically using actual ILI event data.

# Dependencies

We'll be using some standard data analysis libraries for this analysis.

In [None]:
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
pd.set_option('max.rows', 100)
pd.set_option('max.columns', 100)
sns.set(style='whitegrid')

%matplotlib inline

In [None]:
!pip freeze | grep -i lin

In [None]:
data_dir = '/Users/bbradshaw/'

# Reading in the data

For this analysis there are two fundamental data components:
1. **User reported events:** This table contains one row per user corresponding to the date where the user reported that their symptoms were at their worst
2. **User Fitbit Features**: This table contains multiple rows per user, corresponding to features derived from raw Fitbit data. There are 29 days per user. In econometric parlance, we would call this a *balanced panel*.

In [None]:
events = pd.read_csv(os.path.join(data_dir, 'user_events.csv'), parse_dates=['event_date'])
features = pd.read_csv(os.path.join(data_dir, 'features.csv'), parse_dates=['date'])

## Doing a quick inspection of the data

Let's take a peek at each of our tables to ensure we have an idea of what sort of data we are dealing with.

In [None]:
# Let's inspect the events table
events.head()

In [None]:
# Let's look at the distribution of reporte ILI event dates
events.event_date.hist(density=True)

It appears that flu season peaked around Februrary. These event dates are simulated and in actuality you wouldn't see such a symmetric "normal" distribution for flu incidence since the rate of increase up until peak flue season likely won't mirror the rate of decrease after the peak.

In [None]:
events.groupby('user_id').size().max(), events.user_id.nunique()

As expected, there is one row per user. Next, let's take a look at the features.

In [None]:
features.head()

There are three features we will be using throughout the analysis:
1. `steps_sum`: The daily sum of steps walked for a user on a given date
2. `sleep_disturbances`: The estimated number of sleep disturbances measured on a given night's sleep
3. `resting_heart_rate`: A user's estimated resting heart rate on a given date

## Aligning user time series to a common distance from event

One of the issues we have here is that we need to "align" the behavioral and physiological features with the event dates reported by users. The idea is that if we know approximately "when" flu events took place, we could then do some investigation of interesting anomolies within the time series surrounding those events.

We'll do precisely that! For each user-date, we will generate a "relative index date" which is simply the integer valued number of days from the reported peak symptom date (negative values imply dates before the reported peak symptom date, positive values after).

In [None]:
# First join each user's event date with features
features = features.merge(events, on='user_id', how='inner')

In [None]:
# Now we have dates and event dates
features.head()

In [None]:
features['relative_idx'] = (features.date - features.event_date) / pd.Timedelta(days=1)

In [None]:
features.head()

Perfect! We now have a `relative_idx` variable that specifies for each user-day observation, how far (in days) that day is from peak symptom severity. The reason we do this is so that we can align the activity data with a cosistent notion of when the event occured across users.

# Preliminary time series analysis: visualization

In general, the best approach to analysis is to start with the simplest possible approach that makes sense. Many times, *plotting* data in a reasonable manner is a great way to get an understanding of te underlying dynamics of the problem at hand. We'll do just that.

Our approach will be as follows:
- For each user create a `relative_idx` column that specifies how far a day is from the peak reporte symptom date (we already did this)
- Plot the mean value across all user time series (for each feature) and use the bootstrap to get an estimate of confidence about the mean
- Take a look at how feature time series change in the neighborhood of ILI events

Let's implement!

In [None]:
# Seaborn actually makes this quite easy
for f in ['steps_sum', 'sleep_disturbances', 'resting_heart_rate']:
    plt.figure(figsize=(20,12))
    sns.lineplot(x='relative_idx', y=f, data=features, ci=95, n_boot=10000, color='purple', alpha=0.6)

Wow! So there is clearly some signal here. A few observations here:
- Average steps decrease in the neighborhood of a flu event
- Average sleep disturbances increase in the neighborhood of a flu event
- Average resting heart rate increases in the neighborhood of a flu event

Note that here we are making observations about the mean, not about individual responses to ILI. It may be useful to plot a random sample of *individual* time series feature trajectories.

In [None]:
# Seaborn actually makes this quite easy
for f in ['steps_sum', 'sleep_disturbances', 'resting_heart_rate']:
    plt.figure(figsize=(20,12))
    sns.lineplot(
        x='relative_idx',
        y=f,
        data=features,
        ci=95,
        n_boot=10000,
        color='purple',
        alpha=0.8
    )

    sns.lineplot(
        x='relative_idx',
        y=f,
        data=features.merge(user_events[['user_id']].sample(frac=0.01, random_state=42), on='user_id'),
        color='black',
        alpha=0.6,
        units='user_id',
        estimator=None
    )

One point the above graphs makes is that even though the average of the feature trajectories shows a clear pattern, individual trajectories are quite noisy. This is something to keep in mind if we attempt to build a *prediction* model that attempts to distinguish windows of time containing a flu event from control windows that do not contain a flu event.

# Estimating the impact of ILI events on behavior and physiology: Fixed effects regression

While the bootstrap method above is great for a first pass exploration, it isn't a robust analytically framework that allows us to make strong inferential claims. One way we can model how a feature changes in the neighborhood of an ILI event is to use wht econometricians call "the fixed effects estimator" or the "within estimator" The idea is that since we have multiple measurements per subject, a standard OLS model would be biased since residuals of the model are no longer independent from one another (since blocks of observations are generated from a single individual). We won't go into the details of fixed effects regression modeling (indeed entire classes are taught on the subject). The idea here is that we will model the average value of the feature as a function of distance from peak symtom date, while accounting for unobserved heterogeniety that is fixed at the level of the individual.

Luckily there is already a python library that implements the estimation routine for us: `linearmodels`.

In [None]:
import linearmodels