# Bed occupancy analysis using dcarte 

This notebook implements how to use dcarte to examine bed occupancy for the ukdri CRT in-house dataset. 


In [3]:
!pip install -U dcarte

Collecting dcarte
  Downloading dcarte-0.3.31-py3-none-any.whl (26 kB)
Installing collected packages: dcarte
  Attempting uninstall: dcarte
    Found existing installation: dcarte 0.3.30
    Uninstalling dcarte-0.3.30:
      Successfully uninstalled dcarte-0.3.30
Successfully installed dcarte-0.3.31


In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import os
import sys
# This line allows you to clone the repository and skip installing dcarte
import dcarte
# This line allows live changes to the imported file 
# %load_ext autoreload
# %autoreload 2

## Step 1 - load raw sleep dataset using dcarte

In [8]:
assert dcarte.__version__ == '0.3.31'


In [2]:
dcarte.domains()

Unnamed: 0,RAW,LEGACY,BASE,PROFILE,SLEEP_STUDY,LOOKUP,BED_HABITS
0,Activity,Device_Type,Doors,Activity_Dailies,Diurnal,Device_Types,Bed_Occupancy
1,Appliances,Doors,Entryway,Activity_Weeklies,Nocturnal,Homes,
2,Behavioural,Entryway,Habitat,Light,Whitings_Tidy,Patients,
3,Blood_Pressure,Flags,Kitchen,Physiology_Dailies,Withings_Nights,,
4,Door,Light,Motion,Physiology_Weeklies,Withings_Tidy,,
5,Encounter,Motion,Physiology,Sleep_Dailies,,,
6,Environmental,Observation,Sleep,Sleep_Weeklies,,,
7,Issue,Physiology,Transitions,Temperature,,,
8,Observation_Notes,Temperature,,,,,
9,Procedure,Wellbeing,,,,,


In [9]:
dcarte.load('Patients','lookup')

Finished Loading Patients in:                  0.1 seconds   


Unnamed: 0,id,birth_date,gender,organization,source
0,XvG4pZADpDNdWsQfKrAt9a,1939.0,f,ELM4RQF8R9dsU3CZEgtiTW,patients
1,5uPVPxLHYj7STkzq6uPN7r,1944.0,f,ELM4RQF8R9dsU3CZEgtiTW,patients
2,N4ii3j5ZdckoqLSYheVNuF,1944.0,m,ELM4RQF8R9dsU3CZEgtiTW,patients
3,LR6NDDApNPnbtFhjbhvnwb,1942.0,f,ELM4RQF8R9dsU3CZEgtiTW,patients
4,VVSj98uim1cjDR5ERK6kRy,1942.0,m,ELM4RQF8R9dsU3CZEgtiTW,patients
...,...,...,...,...,...
144,RxCHTWbYDmuRtYBfVxGtfP,1960.0,m,ELM4RQF8R9dsU3CZEgtiTW,patients
145,LkREF5vWKVcNDP5EkPdJu2,1937.0,f,Aqp5EaBSvDSXs6dYgUddWR,patients
146,9MKzv1CrANBHCfwTJ3pKDS,1945.0,f,ELM4RQF8R9dsU3CZEgtiTW,patients
147,KNdFgnVbDHDVXaSNeCoWGG,1935.0,m,ELM4RQF8R9dsU3CZEgtiTW,patients


In [3]:
df = dcarte.load('sleep','base',update=True)

Finished Loading sleep_mat in:                 1.9 seconds   
Finished Loading sleep in:                     3.9 seconds   


## Step 2 - Inspect the dataset and select columns of intrest
As we are intrested only in bed occupancy we can discard much of the dataload
The original dataset required more than 700MB of memory and after the selection we are reducing the memory load to less than 200MB

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10323249 entries, 0 to 10323248
Data columns (total 9 columns):
 #   Column            Dtype         
---  ------            -----         
 0   start_date        datetime64[ns]
 1   patient_id        object        
 2   home_id           object        
 3   state             category      
 4   heart_rate        float64       
 5   respiratory_rate  float64       
 6   snoring           object        
 7   source            object        
 8   timezone          object        
dtypes: category(1), datetime64[ns](1), float64(2), object(5)
memory usage: 718.7+ MB


### Clean datset
At this step we 
- Explicitly declare the factors of interest
- Add a `True` value for all rows that will be evidence that the sensor is activated  using the panda assign function 
- Change the type of the random patient id to categorical type 

In [22]:
factors = ['patient_id','start_date']
df = df[factors].assign(activity=True)
df.patient_id = df.patient_id.astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10323249 entries, 0 to 10323248
Data columns (total 3 columns):
 #   Column      Dtype         
---  ------      -----         
 0   patient_id  category      
 1   start_date  datetime64[ns]
 2   activity    bool          
dtypes: bool(1), category(1), datetime64[ns](1)
memory usage: 177.2 MB


## Step 3 - downsample to day occupancy  
At this step we use a method called piping to link multiple functions that will be excuted sequncialy. 
we do the following:  
- Change the index of the dataframe to be start_date which is of type datetime64 which means that we can use pandas datetime functions 
- We use the between_time function to filter out naps 
- We use groupby with patient_id to apply resampling on each individual patient indpendently 
- We then use the resample command notifying pandas to resample measures daily with the offset argumat signifying that the resampling is perfomed from midday till midday
- We apply this only to the activity column, where we sum all true values for these 24 hours windows and divide the values by 60 to transform the original minute sampling rate to hourely rates
- We assign a unit column to clarify the metric
- And finally we rename this new agragate as Time_in_bed to reflect what we are measuring 

In [23]:
df = (df.set_index('start_date').
         between_time('18:00', '12:00').
         groupby(['patient_id']).
         resample('24h', offset="12h").
         activity.sum()/60).reset_index()
df = df.assign(unit='hour')
df = df.rename(columns={'activity': 'Time_in_bed'})
df

Unnamed: 0,patient_id,start_date,Time_in_bed,unit
0,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-28 12:00:00,4.166667,hour
1,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-29 12:00:00,4.650000,hour
2,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-30 12:00:00,7.716667,hour
3,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-31 12:00:00,0.000000,hour
4,2GN1PHeHwRzNYQ7q4Nvg7g,2021-06-01 12:00:00,0.000000,hour
...,...,...,...,...
22871,YYGGmsEaf2gP4JNN9kMprV,2021-12-25 12:00:00,11.716667,hour
22872,YYGGmsEaf2gP4JNN9kMprV,2021-12-26 12:00:00,8.233333,hour
22873,YYGGmsEaf2gP4JNN9kMprV,2021-12-27 12:00:00,12.133333,hour
22874,YYGGmsEaf2gP4JNN9kMprV,2021-12-28 12:00:00,7.900000,hour


## Step 4 - wrap everything we did so far into a processing function 

In [36]:
def process_occupancy(df):
    """process_occupancy convert sleep_mat raw minute activity to daily occupancy

    process_occupancy takes the raw sleep_mat observations and aggragates those to
    Daily summeries sampled between '18:00'-'12:00' to ignore naps 
    It recieves a dcarte dataset object and returns a pandas DataFrame

    Args:
        obj ([dcarte dataset object]): [description]

    Returns:
        pandas DataFrame: A tabular time series of daily frequencies containing
        total minutes spent in bed  
    """
    factors = ['patient_id','start_date']
    df = df[factors].assign(activity=True)
    df.patient_id = df.patient_id.astype('category')
    df = (df.set_index('start_date').
            between_time('18:00', '12:00').
            groupby(['patient_id']).
            resample('24h', offset="12h").
            activity.sum()/60).reset_index()
    df = df.assign(unit='hour')
    df = df.rename(columns={'activity': 'Time_in_bed'})
    return df

In [37]:
df = dcarte.load('sleep')
df = process_occupancy(df)
df.head(5)

Finished Loading sleep_mat in:                 4.3 seconds   
Finished Loading sleep in:                     7.7 seconds   


Unnamed: 0,patient_id,start_date,Time_in_bed,unit
0,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-28 12:00:00,4.166667,hour
1,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-29 12:00:00,4.65,hour
2,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-30 12:00:00,7.716667,hour
3,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-31 12:00:00,0.0,hour
4,2GN1PHeHwRzNYQ7q4Nvg7g,2021-06-01 12:00:00,0.0,hour


## Step 5 - bin time using IQR 
In this step we are using IQR to clustering Time_in_bed into labels

In [32]:
def iqr_outcome(x, labels=['Low TIB', 'Normal TIB', 'High TIB']):
    """iqr_outcome Calculates the interquartile range of the cohort 
                   To develop a naive and meaningful baseline label

    Args:
        x (pd.Series): hourly time spent in bed
        labels (list, optional): [description]. Defaults to ['Low TIB','Normal TIB', 'High TIB'].

    Returns:
        pd.Series: categorical outcome labels
    """
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3-q1
    bins = [q1-iqr*1.5, q1, q3, q3+iqr*1.5]
    outcome = (pd.cut(x, bins, labels= labels).
               cat.add_categories(['OOB',
                                   'Extremely low',
                                   'Extremely high']))
    
    outcome[x>q3+iqr*1.5]='Extremely high'
    outcome[x<q1-iqr*1.5]='Extremely low'
    outcome[x==0]='OOB'
    outcome = outcome.cat.reorder_categories(['OOB',
                                              'Extremely low',
                                              'Low TIB',
                                              'Normal TIB', 
                                              'High TIB',
                                              'Extremely high'])
    return outcome

In [34]:
df['global_outcome'] = (
        df[['Time_in_bed']].
        apply(iqr_outcome)
    )
df

Unnamed: 0,patient_id,start_date,Time_in_bed,unit,global_outcome
0,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-28 12:00:00,4.166667,hour,Low TIB
1,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-29 12:00:00,4.650000,hour,Low TIB
2,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-30 12:00:00,7.716667,hour,Normal TIB
3,2GN1PHeHwRzNYQ7q4Nvg7g,2021-05-31 12:00:00,0.000000,hour,OOB
4,2GN1PHeHwRzNYQ7q4Nvg7g,2021-06-01 12:00:00,0.000000,hour,OOB
...,...,...,...,...,...
22871,YYGGmsEaf2gP4JNN9kMprV,2021-12-25 12:00:00,11.716667,hour,High TIB
22872,YYGGmsEaf2gP4JNN9kMprV,2021-12-26 12:00:00,8.233333,hour,Normal TIB
22873,YYGGmsEaf2gP4JNN9kMprV,2021-12-27 12:00:00,12.133333,hour,High TIB
22874,YYGGmsEaf2gP4JNN9kMprV,2021-12-28 12:00:00,7.900000,hour,Normal TIB


## Step 6 - make a function that ties everything together 

In [None]:
def process_outcome(self):
    """Process_outcome approximates alerts calculated based on normative time
    (constructed using all legacy data)
    
    Args:
        self ([LocalDataset]): [description]

    Returns:
        pd.DataFrame: a pandas data with patient_id start_date and time_in_bed
    """
    df = self.data
    df = process_occupancy(df)
    df['global_outcome'] = (
        df[['Time_in_bed']].
        apply(iqr_outcome)
    )
    return df

## Step 7 - make all this a local dcarte dataset 
For this we need to create a python module that has all the functions and one functions that ties all of the steps together 


In [38]:
%%writefile bed_occupancy.py
import pandas as pd
import numpy as np
import os
import sys
# This line allows you to clone the repository and skip installing dcarte
sys.path.insert(0, os.path.abspath("."))
import dcarte
from dcarte.local import LocalDataset

def process_occupancy(df):
    """process_occupancy convert sleep_mat raw minute activity to daily occupancy

    process_occupancy takes the raw sleep_mat observations and aggragates those to
    Daily summeries sampled between '18:00'-'12:00' to ignore naps 
    It recieves a dcarte dataset object and returns a pandas DataFrame

    Args:
        obj ([dcarte dataset object]): [description]

    Returns:
        pandas DataFrame: A tabular time series of daily frequencies containing
        total minutes spent in bed  
    """
    factors = ['patient_id','start_date']
    df = df[factors].assign(activity=True)
    df.patient_id = df.patient_id.astype('category')
    df = (df.set_index('start_date').
            between_time('18:00', '12:00').
            groupby(['patient_id']).
            resample('24h', offset="12h").
            activity.sum()/60).reset_index()
    df = df.assign(unit='hour')
    df = df.rename(columns={'activity': 'Time_in_bed'})
    return df

def iqr_outcome(x, labels=['Low TIB', 'Normal TIB', 'High TIB']):
    """iqr_outcome Calculates the interquartile range of the cohort 
                   To develop a naive and meaningful baseline label

    Args:
        x (pd.Series): hourly time spent in bed
        labels (list, optional): [description]. Defaults to ['Low TIB','Normal TIB', 'High TIB'].

    Returns:
        pd.Series: categorical outcome labels
    """
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3-q1
    bins = [q1-iqr*1.5, q1, q3, q3+iqr*1.5]
    outcome = (pd.cut(x, bins, labels= labels).
               cat.add_categories(['OOB',
                                   'Extremely low',
                                   'Extremely high']))
    
    outcome[x>q3+iqr*1.5]='Extremely high'
    outcome[x<q1-iqr*1.5]='Extremely low'
    outcome[x==0]='OOB'
    outcome = outcome.cat.reorder_categories(['OOB',
                                              'Extremely low',
                                              'Low TIB',
                                              'Normal TIB', 
                                              'High TIB',
                                              'Extremely high'])
    return outcome

def process_outcome(obj):
    """Process_outcome approximates alerts calculated based on normative time
    (constructed using all legacy data)
    
    Args:
        obj ([LocalDataset]): [description]

    Returns:
        pd.DataFrame: a pandas data with patient_id start_date and time_in_bed
    """
    df = process_occupancy(obj.datasets['sleep'])
    df['global_outcome'] = (
        df[['Time_in_bed']].
        apply(iqr_outcome)
    )
    return df

def main():
    dataset_name = 'bed_occupancy'
    datasets = {'sleep': dcarte.load('sleep')}
    pipeline = ['process_outcome']
    data_type = 'bed_habits'
    module = 'bed_occupancy'
    module_path = __file__
    LocalDataset(dataset_name = dataset_name,
                 datasets = datasets,
                 pipeline = pipeline,
                 data_type = data_type,
                 module_path = module_path,
                 module = module)

if __name__ == "__main__":
    main()  



Writing bed_occupancy.py


In [39]:
%run bed_occupancy.py

Finished Loading sleep_mat in:                 5.1 seconds   
Finished Loading sleep in:                     8.0 seconds   


## Step 8 - Check to see if the new local data set was registered properly.
We will use the `dcarte.domains()` function to see if our new bed habits domain was registered as a new domain and whether it contains the bed_occupancy dataset we just created.


In [45]:
dcarte.domains()

Unnamed: 0,RAW,BASE,LEGACY,PROFILE,SLEEP_STUDY,LOOKUP,BED_HABITS
0,Activity,Doors,Device_Type,Activity_Dailies,Diurnal,Device_Types,Bed_Occupancy
1,Appliances,Entryway,Doors,Activity_Weeklies,Nocturnal,Homes,
2,Behavioural,Habitat,Entryway,Light,Whitings_Tidy,Patients,
3,Blood_Pressure,Kitchen,Light,Physiology_Dailies,Withings_Nights,,
4,Door,Motion,Motion,Physiology_Weeklies,Withings_Tidy,,
5,Encounter,Physiology,Observation,Sleep_Dailies,,,
6,Environmental,Sleep,Physiology,Sleep_Weeklies,,,
7,Issue,Transitions,Temperature,Temperature,,,
8,Observation_Notes,,,,,,
9,Procedure,,,,,,


## Step 9 - download the Bed_Occupancy dataset using dcarte 

In [42]:
df = dcarte.load('bed_occupancy',update=False)

Finished Loading sleep_mat in:                 1.7 seconds   
Finished Loading sleep in:                     4.0 seconds   
Finished Loading bed_occupancy in:             4.2 seconds   
