In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%run ../preprocessing.py
pd.set_option("display.max_columns", 8)

# 1. Processing cycles information
- Each cycle for each woman needs to be expanded such that there is a row per calendar day of the cycle and with a boolean indicator for period days, the cycle_id and the day in cycle. This is done with the help of `expand_cycle` function, e.g.:

In [3]:
cycles.head(1)

Unnamed: 0,user_id,cycle_id,cycle_start,cycle_length,period_length,expected_cycle_length
0,03009100-a1fa-4fad-bf9b-6102c690f3be,1,2016-11-27,24.0,1.0,29.0


In [4]:
expand_cycle(cycles.iloc[0])

Unnamed: 0_level_0,Unnamed: 1_level_0,cycle_id,day_in_cycle,period
user_id,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-11-27,1,1,1
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-11-28,1,2,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-11-29,1,3,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-11-30,1,4,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-12-01,1,5,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-12-02,1,6,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-12-03,1,7,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-12-04,1,8,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-12-05,1,9,0
03009100-a1fa-4fad-bf9b-6102c690f3be,2016-12-06,1,10,0


- With the function `expand_cycles` we can iterate over all the cycles:

In [5]:
# Expand cycles so that there is a line per date (active or not) with a boolean indicator of period
cycles_processed = expand_cycles(cycles)

# 2. Processing tracked symptoms
- We need to one hot encode the symptoms and aggregate them such that there is one line per user per active day. Each column will be a boolean indicator of a tracked symptom for a given user and active day. There can be more than one tracked symptom per line.

In [6]:
# Expand tracking so that there is a line per active date with a one hot encoded symtoms.
tracking_processed = process_tracking(tracking)

# 3. Merging extracted features from cycles and tracking
- By doing the merge, information from cycles will bring the boolean indicator of period days and guarantee that every inactive day is also considered. Inactive days are important because RNN need to take into account evolution over time. Intuitively speaking, the RNN "memorizes" how many inactive days are common between tracking activity.
- The first 81 columns refer to symptom, where the first 16 are the symptoms to be predicted
- The last 3 columns refer to cycle information --> cycle_id, day_in_cycle and period

In [7]:
# Merge cycles and tracking information
features = pd.merge(
    tracking_processed,
    cycles_processed,
    left_index=True, right_index=True, how='outer'
).fillna(0)

In [8]:
features

Unnamed: 0_level_0,Unnamed: 1_level_0,happy,pms,sad,sensitive_emotion,...,pregnancy_test_pos,cycle_id,day_in_cycle,period
user_id,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-05,0.0,0.0,0.0,0.0,...,0.0,3,1,1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-06,0.0,0.0,0.0,0.0,...,0.0,3,2,1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-07,0.0,0.0,0.0,0.0,...,0.0,3,3,1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-08,0.0,0.0,0.0,0.0,...,0.0,3,4,1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-09,0.0,0.0,0.0,0.0,...,0.0,3,5,1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-10,0.0,0.0,0.0,0.0,...,0.0,3,6,0
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-11,0.0,0.0,0.0,0.0,...,0.0,3,7,0
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-12,0.0,0.0,0.0,0.0,...,0.0,3,8,0
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-13,0.0,0.0,0.0,0.0,...,0.0,3,9,0
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-14,0.0,0.0,0.0,0.0,...,0.0,3,10,0


# 4. Enumerate days for each user
- Tracking behavior evolves with time. Some users might be more active at first and then use the app less. The RNN could take into account for how long a user has been using the app to estimate if she will keep tracking symptoms.
- The first step for getting this information is to find the first day each user starting using the app:

In [9]:
# Find the first day the user started using the app
features = pd.merge(
    features,
    cycles.groupby('user_id')\
          .agg({'cycle_start': {'first_use': 'min'}})\
          .reset_index()\
          .set_index('user_id')['cycle_start'],
    left_index=True,
    right_index=True
)

- Get the absolute day by substracting first_use from date on the index

In [10]:
# Find the absolute day for each row from the day the user started using the app
absolute_day = (features.reset_index().date.dt.date - features.reset_index().first_use.dt.date).dt.days + 1
# Get back the index from features
absolute_day.index = features.index
features['absolute_day'] = absolute_day
# We no longer need 'first_use' column
features = features.drop('first_use', axis=1)

In [11]:
features

Unnamed: 0_level_0,Unnamed: 1_level_0,happy,pms,sad,sensitive_emotion,...,cycle_id,day_in_cycle,period,absolute_day
user_id,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-05,0.0,0.0,0.0,0.0,...,3,1,1,1
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-06,0.0,0.0,0.0,0.0,...,3,2,1,2
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-07,0.0,0.0,0.0,0.0,...,3,3,1,3
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-08,0.0,0.0,0.0,0.0,...,3,4,1,4
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-09,0.0,0.0,0.0,0.0,...,3,5,1,5
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-10,0.0,0.0,0.0,0.0,...,3,6,0,6
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-11,0.0,0.0,0.0,0.0,...,3,7,0,7
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-12,0.0,0.0,0.0,0.0,...,3,8,0,8
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-13,0.0,0.0,0.0,0.0,...,3,9,0,9
00035291-ebfc-4ad3-83da-f83e0cfceeb9,2016-10-14,0.0,0.0,0.0,0.0,...,3,10,0,10
