# CHI Notebook 1 - Take a look at the data
### this notebook is in Python, and the goal is to examine and wrangle the CGM data using Pandas

# Continuous Glucose Monitoring and Activity Levels.

![Bryan Gibson](https://securembm.uuhsc.utah.edu/zeus/public/mbm-media/faculty-profile?facultyPK=FM00000406)

[Dr. Bryan Gibson](https://medicine.utah.edu/faculty/mddetail.php?facultyID=u0020501#tabAcademic) is a faculty member in the Department of Biomedical Informatics at the University of Utah developing research projects centered around helping patients monitor and adjust their own lifestyles to achieve better health. An example of this is his work with Type 2 diabetes patients.

Here is Dr. Gibson's description of this project:

**Background:**  Currently most individuals with Type 2 diabetes do not adequately understand their disease or the relation between their lifestyle and their health.  The data collected by wearable sensors could be used to address that problem.

**Goal:**  the goal of this project is to provide analysis and  interpretation of the data collected by wearable sensors  for patients with Type 2 Diabetes so that they can understand things like:

1. What problems with my  blood sugar  am I having each day?
    1. what percent of the day is the person's blood sugar too high?
	1. what percent of the day is the person's blood sugar too low?
1. When are those problems occurring ?
    1. Provide  a way for patients to visualize the  problems they are having 				with their blood glucose  for each day they 
    1. Provide  a way for patients to visualize the  problems they are having 				with their blood glucose  by their "average " day in the 						last week
1. Which behaviors are associated with those problems? 
    1. sleep?
    1. diet?
    1. activity?
1. For each of these behaviors provide an analysis of how these behaviors are associated with the individual's  daily glucose control
    1. sleep
    1. diet
    1. activity

		
**Definitions:** 
We will define  a "blood sugar problem" as 
* Low blood sugar < 70mg/dl as a criteria for Hypoglycemia
* High blood sugar > 180mg/dl as a criteria for Hyperglycemia 

In this module we will explore a subset of Dr. Gibson's data. In the process we will have the opportunity to practice writing functions and wrangle data with Pandas. Specifically, we will look at the data from a [continuous glucose monitor](https://www.niddk.nih.gov/health-information/diabetes/overview/managing-diabetes/continuous-glucose-monitoring) and examine the temporal behavior of these measurements.

In Notebook 2 you will be exploring this data more fully and answering more clinical questions.

In [1]:
from nose.tools import assert_true, assert_false, \
    assert_almost_equal, assert_equal, assert_raises
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import time
import datetime
from ipywidgets import interact, interactive, fixed
from collections import Counter

In [3]:
%matplotlib inline

## Problem 1.  (10 points)

Use pandas to read in the file `CGM_vals.csv`. Assign the created DataFrame to a variable `cgm`. Treat `"Low"` and `"High"` as missing values. Make sure that the column `DisplayTime` contains pandas `Timestamp` types and the column `Value` contains numeric values.

The tail of your data should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>ID</th>
      <th>DisplayTime</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>35766</th>
      <td>21</td>
      <td>8/30/17 12:11</td>
      <td>225.0</td>
    </tr>
    <tr>
      <th>35767</th>
      <td>21</td>
      <td>8/30/17 12:16</td>
      <td>218.0</td>
    </tr>
    <tr>
      <th>35768</th>
      <td>21</td>
      <td>8/30/17 12:21</td>
      <td>201.0</td>
    </tr>
    <tr>
      <th>35769</th>
      <td>21</td>
      <td>8/30/17 12:31</td>
      <td>142.0</td>
    </tr>
    <tr>
      <th>35770</th>
      <td>21</td>
      <td>8/30/17 13:51</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

In [4]:
### BEGIN SOLUTION
cgm = pd.read_csv("CGM_vals.csv", na_values = {"Value":["Low", "High"]})
cgm.head()
### END SOLUTION

Unnamed: 0,ID,DisplayTime,Value
0,1,12/7/16 21:00,229.0
1,1,12/7/16 21:05,246.0
2,1,12/7/16 21:10,240.0
3,1,12/7/16 21:15,253.0
4,1,12/7/16 21:20,247.0


In [5]:
assert_true(isinstance(cgm,pd.DataFrame))

In [6]:
assert_true(np.isnan(cgm['Value'].iloc[11104]))

In [7]:
assert_true(np.isnan(cgm['Value'].iloc[35770]))


In [8]:
assert_true(cgm.Value.dtype==np.float64)

In [38]:
### BEGIN HIDDEN TESTS
assert_true(np.isnan(cgm['Value'].iloc[32862]))
### END HIDDEN TESTS

## Problem 2 (10 points)

Write a function `ds2dt` that takes a string `s` containing a date and a format string `parse_str` and returns a `datetime` instance.

In [10]:
### BEGIN SOLUTION

pstr = """%m/%d/%y %H:%M"""
def ds2dt(s, parse_str):
    return datetime.datetime.strptime(s, parse_str)
### END SOLUTION


In [11]:
assert_true(isinstance(ds2dt("2016-12-07","%Y-%m-%d"), datetime.datetime))

In [12]:
### BEGIN HIDDEN TESTS
assert_equal(ds2dt("Jan 8, 05","%b %d, %y").year,2005)
### END HIDDEN TESTS

## Problem 3 (10 points)

Use the function `ds2dt` and the DataFrame `apply` method to convert the `DisplayTime` column from string values to `datetime.datetime` instances.

In [13]:
### BEGIN SOLUTION
cgm.DisplayTime = cgm.apply(lambda x: ds2dt(x['DisplayTime'],pstr), axis=1)
### END SOLUTION

In [14]:
assert_false(isinstance(cgm.DisplayTime[0],str))

In [15]:
assert_true(isinstance(cgm.DisplayTime[100],pd.Timestamp))

### Explore the DataFrame

In [16]:
ids = list(cgm.ID.unique())
ids.sort()
@interact(df=fixed(cgm) , _id=ids)
def disp(df, _id):
    df[df.ID==_id].plot(x='DisplayTime', y='Value')

interactive(children=(Dropdown(description='_id', options=(1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16,…

## Problem 4 (5 points)

Change the value assigned to `num_unique_ids` from `None` to the number of unique IDs found in the data set (`num_unique_ids` should be of type `int`).

In [17]:
num_unique_ids = None
### BEGIN SOLUTION
num_unique_ids = len(cgm.ID.unique())
### END SOLUTION

In [18]:
### BEGIN HIDDEN TESTS
assert_equal(num_unique_ids,20)
### END HIDDEN TESTS

## Problem 5 (20 points)

In this problem we will write a function that computes the differences between values in a column for each subject in the DataFrame.

Consider the DataFrame segment shown below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>ID</th>
      <th>DisplayTime</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1803</th>
      <td>1</td>
      <td>12/14/16 14:15</td>
      <td>77.0</td>
    </tr>
    <tr>
      <th>1804</th>
      <td>1</td>
      <td>12/14/16 14:20</td>
      <td>71.0</td>
    </tr>
    <tr>
      <th>1805</th>
      <td>1</td>
      <td>12/14/16 14:25</td>
      <td>64.0</td>
    </tr>
    <tr>
      <th>1806</th>
      <td>2</td>
      <td>1/18/17 23:44</td>
      <td>108.0</td>
    </tr>
    <tr>
      <th>1807</th>
      <td>2</td>
      <td>1/18/17 23:49</td>
      <td>103.0</td>
    </tr>
  </tbody>
</table>

We can compute the difference of `Value`, for example, between rows 1805 and 1804, since these are both values for subject 1. However, a difference between rows 1806 and 1805 does not make any sense because the values are for two different subjects (1 and 2). When we cannot compute a meaningful value, we would want to provide a missing value (`np.nan` for our measured values and `pd.NaT` for our times). Similarly, we cannot compute a value at row 0, since there is no prior value to subtract.

Write a function `diff_column` that computes the backward difference between values in a column, if the values are for the same subject. 

\begin{equation}
diff[i] = 
\begin{cases}
    value[i]-value[i-1],& \text{if } ID[i] = ID[i-1]\\
    np.nan,              & \text{otherwise}
\end{cases}
\end{equation}

We will achieve this functionality by providing a column name that we will use to determine that the subjects are the same, and another keyword argument `diff_column` to determine which column to compute the difference on.



In [39]:
def diff_column(df, control_column= "ID", diff_column="DisplayTime", null_value=pd.NaT):
    """
    Computes the backward difference of a column constrained by equality of values in a control column
    
    Input:
        df: A Pandas DataFrame
        control_column="ID" -- the name of the column to constrain equality
        diff_column="DisplayTime" -- the name of the column from which to compute the backwards difference
        null_value=pd.NaT -- the value to use when a difference cannot be computed
        
    Returns:
    
        A Pandas Series containing the differences
    """
### BEGIN SOLUTION
    deltas = [null_value]
    for i in range(1,cgm.shape[0]):
        if df.loc[i,control_column] != df.loc[i-1,control_column]:
            deltas.append(null_value)
        else:
            deltas.append(df.loc[i,diff_column]-df.loc[i-1,diff_column])
    return pd.Series(deltas)
### END SOLUTION

In [40]:
assert_equal(diff_column(cgm).shape, (35771,))

In [21]:
assert_true(pd.isnull(diff_column(cgm)[0]))

In [22]:
assert_true(pd.isnull(diff_column(cgm)[1806]))

In [23]:
assert_almost_equal(diff_column(cgm, control_column="ID", diff_column="Value", null_value=np.nan)[4],-6.0)

In [24]:
assert_true(np.isnan(diff_column(cgm, control_column="ID", diff_column="Value", null_value=np.nan)[19843]))

In [25]:
assert_true(np.isnan(diff_column(cgm, control_column="ID", diff_column="Value", null_value=np.nan)[30046]))

In [26]:
cgm["DiffTime"] = diff_column(cgm)
cgm["DiffValue"] = diff_column(cgm, control_column="ID", 
                               diff_column="Value", 
                               null_value=np.nan)
cgm.head()

Unnamed: 0,ID,DisplayTime,Value,DiffTime,DiffValue
0,1,2016-12-07 21:00:00,229.0,NaT,
1,1,2016-12-07 21:05:00,246.0,00:05:00,17.0
2,1,2016-12-07 21:10:00,240.0,00:05:00,-6.0
3,1,2016-12-07 21:15:00,253.0,00:05:00,13.0
4,1,2016-12-07 21:20:00,247.0,00:05:00,-6.0


#### The tail of your modified DataFrame should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>ID</th>
      <th>DisplayTime</th>
      <th>Value</th>
      <th>DiffTime</th>
      <th>DiffValue</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>35766</th>
      <td>21</td>
      <td>2017-08-30 12:11:00</td>
      <td>225.0</td>
      <td>00:05:00</td>
      <td>-3.0</td>
    </tr>
    <tr>
      <th>35767</th>
      <td>21</td>
      <td>2017-08-30 12:16:00</td>
      <td>218.0</td>
      <td>00:05:00</td>
      <td>-7.0</td>
    </tr>
    <tr>
      <th>35768</th>
      <td>21</td>
      <td>2017-08-30 12:21:00</td>
      <td>201.0</td>
      <td>00:05:00</td>
      <td>-17.0</td>
    </tr>
    <tr>
      <th>35769</th>
      <td>21</td>
      <td>2017-08-30 12:31:00</td>
      <td>142.0</td>
      <td>00:10:00</td>
      <td>-59.0</td>
    </tr>
    <tr>
      <th>35770</th>
      <td>21</td>
      <td>2017-08-30 13:51:00</td>
      <td>NaN</td>
      <td>01:20:00</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

## Problem 6 (35 Points)

In this problem you will write a function to compute the fractions of time that each subject has a problematic blood glucose measurement ($< 70 \text{ or }> 180$ per Dr. Gibson above). This will be computed by summing up the time deltas where the time delta is less than some threshold (in seconds). The cell below computes the most common time deltas. Five minutes is by far the most common time delta. Based on this I've arbitrarily set a time delta threshold of 1000 seconds.

In [27]:
pd.DataFrame(Counter(cgm.DiffTime).most_common(10), columns=["Time Delta", "Count"])

Unnamed: 0,Time Delta,Count
0,00:05:00,35339
1,00:10:00,234
2,00:15:00,49
3,00:04:00,25
4,NaT,20
5,00:20:00,20
6,00:25:00,18
7,00:06:00,16
8,00:40:00,6
9,00:35:00,5



Based on the docstring provided for the function and the nature of the provided tests, complete the function `cgm

**Hint:** Extracting time values from a column/series with `Timedelta` values is a bit tricky. You will need to use a syntax like this

```Python
my_data_frame["My Delta Time Columne"].dt.seconds
```


In [28]:
def cgm_problem_time_ratio(df, tlow=70, thigh=180, deltat_thresh=1000,
                           delta_time="DiffTime", values="Value",
                           control_column="ID"):
    """
    Computes the ratio of time a given subject has a problematic blood sugar measurement.
    
    Input:
        -- df: The Pandas DataFrame containing the cgm data.
        -- tlow: Numeric: threshold for low glucose
        -- thigh: Numeric: threshold for high glucose
        -- deltat_thresh: Numeric: threshold (in seconds) for adding monitor time.
        -- delta_time -- string: column name for time differences
        -- values -- string: column name with cgm values
        -- control_column -- string: column name for subject IDs
        
    Returns:
        A dictionary with keys subject IDs and values the ratio (decimal number) of the 
        elapsed time (in seconds) with problematic measurements divided by total elapsed time
        (in seconds). Elapsed times are computed by summations of Timedeltas
    """
    ### BEGIN SOLUTION
    assert type(df) == pd.DataFrame
    rslt = {}
    ids = df[control_column].unique()
    for _id in ids:
        
        
        tmp = df[df[control_column]==_id]
        bad_time = sum(tmp[(tmp[delta_time].dt.seconds < deltat_thresh) & 
                   ((tmp[values] > thigh) | (tmp[values] < tlow))][delta_time].dt.seconds)
        total_time = \
            sum(tmp[(tmp[delta_time].dt.seconds < deltat_thresh)][delta_time].dt.seconds)
        rslt[_id] = bad_time/total_time
    return rslt
    ### END SOLUTION
        

In [29]:
rslt = cgm_problem_time_ratio(cgm)
assert_true(isinstance(rslt, dict))

In [30]:
assert_true(15 in rslt)

In [31]:
assert_almost_equal(rslt[16], 0.013635410111181035)

In [32]:
remap = dict(zip(range(1,22),map(lambda x:"UU " + x.upper(),"abcdefghijklmnopqrstuvwxyz")))

cgm2 = cgm.rename({"ID":"Subject", "DisplayTime":"MeasurementTime",
                   "Value":"Blood Glucose (mg/dl)", "DiffTime":"TimeDelta",
                   "DiffValue":"ValueDelta"}, axis=1)
cgm2.Subject = cgm2.Subject.apply(lambda x: remap[x])
cgm2.head()

Unnamed: 0,Subject,MeasurementTime,Blood Glucose (mg/dl),TimeDelta,ValueDelta
0,UU A,2016-12-07 21:00:00,229.0,NaT,
1,UU A,2016-12-07 21:05:00,246.0,00:05:00,17.0
2,UU A,2016-12-07 21:10:00,240.0,00:05:00,-6.0
3,UU A,2016-12-07 21:15:00,253.0,00:05:00,13.0
4,UU A,2016-12-07 21:20:00,247.0,00:05:00,-6.0


In [33]:
rslt2 = cgm_problem_time_ratio(cgm2, delta_time="TimeDelta", values="Blood Glucose (mg/dl)",
                           control_column="Subject")

In [34]:
assert_almost_equal(rslt2["UU K"], 0.026472177201512695)

## Problem 7 (10 points)

#### We can estimate the rate of change for the glucose measurements by dividing columns

In [35]:
cgm2["dvdt"] = cgm2["ValueDelta"]/cgm2["TimeDelta"].dt.seconds

In [36]:
assert_almost_equal(cgm2[cgm2.Subject=="UU U"]["dvdt"].max(), 0.10333333333333333)

### You can use the cell below to explore the data.
![Exploring data](./exploring.png)

In [37]:
ids = list(cgm2.Subject.unique())
ids.sort()
@interact(df=fixed(cgm2) , _id=ids, col=["dvdt", "Blood Glucose (mg/dl)"])
def disp_delta(df, _id, col):
    df[(df.Subject==_id)& (df.TimeDelta.notnull())].plot(x="MeasurementTime", y=col)

interactive(children=(Dropdown(description='_id', options=('UU A', 'UU B', 'UU C', 'UU D', 'UU E', 'UU F', 'UU…