   DIABETIC PATIENT'S DATA ANALYSIS OF THE DATA FROM GLUCOSE MONITORS, ACTIVITY SENSORS, SLEEP SENSORS AND LOCATION MONITORS

CONTEXT OF THE PROJECT:
A type 1 diabetic patient uses a Continuous Glucose Monitoring (CGM) device to monitor the blood sugar and to keep track of the changes in the blood levels of the blood glucose levels. These CGM devices be used alone or in conjunction with digitally connected medical devices for the purpose of managing diabetes. One such digitally connected medical device is an automated insulin pump, typically called as automated insulin dosing (AID) systems.
It is logically to conclude that the amount of blood glucose are influenced by the amount to sugar consumed by the patient.
This project connects data from other health sensors to relate the influence of these parameters with the patient’s blood glucose. Co-relating blood glucose levels to multiple features can help in the holistic control of blood glucose levels for the patient. 

CONTENTS OF THE DATA:
	Raw data was collected from a type 1 diabetes patient who uses a CGM device for monitoring his blood glucose levels.
	Along with a CGM device, this patient also uses the following health monitors:
1.	A sleep monitoring device (Beddit) which records the features related to the patient’s sleep patterns
2.	An activity tracker
3.	A location monitor
DATA DICTIONARY:
1.	CGM data: Contained the blood glucose data (dependent variable) obtained from the CGM device. This device collects continuous data every 15 mins. This data set has 6839 instances with 4 features: 

-	Time: Contained the date time stamp of the event 
-	Record_Type: Contained data  about the type of the record.
o	Two types of record: 
	‘0’ means the machine automatically generates the blood glucose levels at the prefixed programmed time intervals 
	‘1’ means the blood glucose level is voluntarily measured by the patient.

-	Historic_ Glucose: Contained the data from the automatic measurements from the CGM device

-	Scan_Glucose: This feature contains the data from the voluntary measurements from the CGM device


2.	Beddit data: This data set contains data about the patient’s sleep pattern. It had 30 instances with 18 features. 

-	average_respiration_rate: Contained the data about the respiratory rate of the patient in breaths per minute

-	away_duration: Contained data about the amount of time (in seconds) the patient was out of the bed 	


-	beddit_user_id: Contained the device generated user ID	

-	date: Contained the date stamp of the event	


-	end_timestamp	id: Contained the time stamp about the end of the sleeping duration.	

-	resting_heart_rate: Contained data of the average heart rate of the patient during sleep and measured as beats/ minute

	
-	score_amount_of_sleep, score_awakenings, score_bed_exits, score_sleep_efficiency, score_sleep_latency, score_snoring :These features had empty instances (0 values)	
-	sleep_duration: Contained the total sleep time (in seconds)

-	start_timestamp: The time stamp of the sleeping cycle	


-	timezone: Contained the data about the time zone device works on ( here it works in “Berlin time” zone)	

-	updated: Contained the datetime stamp when the data was populated in the device	


-	wake_duration: Contained the time in seconds during which the patient was awake.

3.	Steps data: Contained the data about the patient’s activity profile. This dataset had 9  features and 31 instances.

-	active_time: Data about the the patient’s activity duration (seconds)
	
-	date: Contained the date of the event	


-	distance: Data about the user’s distance covered during the activity (meters)
	
-	id: Device generated reference IDs	


-	jawbone_user_id: Device generated reference IDs
	
-	steps: The count of the number of steps taken by the user to cover the specific distance 
-	steps_id: Device generated reference IDs	

-	timezone: “Berlin Zone”	


-	updated: date the data was saved to the device

INSPIRATION AND OBJECTIVES:
1.	To clean the raw data for analysis and derive insights from the data

2.	To create visualizations using the data using python and tableau to help generate insights and relation between the various features.

3.	To build a log reg model which can be used to decide which of these features has a better influence on the patient’s diabetic control apart from his diet.


In [4]:
#Importing the necessary python packages and modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [15]:
#Connecting to the data for analysis
glucose = pd.read_csv('cgm_date.csv')
steps = pd.read_csv('steps_date.csv')
sleep = pd.read_csv('beddit_date.csv')
gps = pd.read_csv('loc_date.csv')

In [6]:
# File naming explanation:

#'glucose' is the file that contains the data about the user glucose measurements from the sensor

#'steps' conatins the data from the activity monitor of the patient

#'sleep' contains the data from the device 'beddit' which tracks information about the users sleep timings

#'gps' contains users location data and the amount of time spent in a particular location.

In [5]:
# Exploratory data analysis

In [8]:
glucose.head(2)

Unnamed: 0,cgmdate,cgmtime,Historic_ Glucose,Scan_Glucose
0,100120170000,10019001734,72.0,
1,100120170000,10019001749,63.0,


In [10]:
steps.head(2)

Unnamed: 0,stepsdate,stepstime,active_time,distance,id,steps
0,102920170000,10019002036,5035,6741,2245,8897
1,102820170000,10019001926,4527,5389,2244,7486


In [11]:
sleep.head(2)

Unnamed: 0.1,Unnamed: 0,start_date,start_time,end_date,end_time,sleep_duration,wake_duration,away_duration,resting_heart_rate,avr_resp_rate
0,0,101720170000,10019002045,101820170000,10019000416,25898,360,0,0.0,13.499142
1,1,101620170000,10019002101,101720170000,10019000440,22611,4263,678,72.802734,14.175988


In [12]:
gps.head(2)

Unnamed: 0,enddate,endtime,startdate,starttime,time_spent H:MM,type
0,100620170000,10019001712,100620170000,10019001228,4:44,place 1
1,110920170000,10019001457,110920170000,10019001300,1:57,place 1


In [13]:
# Checking the null values in the datasets

In [None]:
#Selecting the relevant features in all the datsets

Features selected in glucose:
    1. cgmdate
    2. Historic_ Glucose
Features selected in sleep:
    1. end_date
    2. sleep_duration
    3. wake_duration
    4. resting_heart_rate
    5. avr_resp_rate
Features selected in steps:
    1. stepsdate
    2. active_time
    3. distance
    4. steps
Selecting the necessary features from the datasets.
features selected in location:
    1. enddate
    2. time_spent H:MM
    3. type

In [18]:
#.loc functions to lock the relevant features
glucose = glucose.loc[:,['cgmdate','Historic_ Glucose']]
sleep = sleep.loc [:,['end_date','sleep_duration','wake_duration','resting_heart_rate','avr_resp_rate']]
steps = steps.loc[:,['stepsdate','active_time','steps','distance']]
gps = gps.loc[:,['enddate','time_spent\nH:MM','type']]

In [None]:
#Apart from the datetime data points in the feature contained additional '0000'. 
#lambda x fucntion used to remove the additional '0000' in the instacnces in the feature.

In [19]:
glucose['cgmdate'] = glucose['cgmdate'].map(lambda x: str(x)[:-4])
sleep['end_date'] = sleep['end_date'].map(lambda x: str(x)[:-4])
steps['stepsdate'] = steps['stepsdate'].map(lambda x:str(x)[:-4])
gps['enddate'] = gps['enddate'].map(lambda x :str(x)[:-4])

In [24]:
glucose_sleep= sleep.merge(glucose, how = 'inner', left_on = 'end_date', right_on = 'cgmdate')
glucose_sleep_steps = glucose_sleep.merge(steps, how = 'inner', left_on = 'cgmdate', right_on = 'stepsdate')
glucose_sleep_steps_gps = glucose_sleep_steps.merge(gps, how = 'inner', left_on = 'cgmdate', right_on = 'enddate')


In [25]:
glucose_sleep_steps_gps.head(2)

Unnamed: 0,end_date,sleep_duration,wake_duration,resting_heart_rate,avr_resp_rate,cgmdate,Historic_ Glucose,stepsdate,active_time,steps,distance,enddate,time_spent H:MM,type
0,10182017,25898,360,0.0,13.499142,10182017,142.0,10182017,8423,17511,16813,10182017,3:02,place 9.6
1,10182017,25898,360,0.0,13.499142,10182017,142.0,10182017,8423,17511,16813,10182017,0:07,place unique


In [27]:
#FEATURE AND DIMESIONALITY REDUCTION
#In this pre-final ddataset, repeated features and irrelevant features were removed from the datset
df = glucose_sleep_steps_gps.loc[:,['cgmdate','steps', 'sleep_duration','time_spent','type','Historic_ Glucose' ]]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  This is separate from the ipykernel package so we can avoid doing imports until
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


In [None]:
#writing the new dataset to a .csv file
df.to_csv('final.csv')