# RISK PREDICTION PROJECT FOR DIABETES PATIENTS - README

## TABLE OF CONTENTS

- Background
- The Problem
- Data Collection
- Data Preparation + EDA
- Model Training & Design
- Evaluation & Deployment

<img src ="DiabetesProjectReadme.png" style = "width:800px; height:300px"/>

## BACKGROUND

The ever increasing demand for healthcare resources (from factors inclusive of ageing demographics, multiplicity of chronic conditions per individual, socio-economic status changes, a pivot towards personalised medicine etc) far outstrip supply (with constraints on clinical, financial and other resources within a publicly funded healthcare system).

This is due to a transition from Acute to Chronic illness as the major source of burden to the health system as well as to the quality of life of people living with disease conditions.

>4.9 million diabetics currently living in the UK(1). Diabetes alone results in approximately 10,000 
amputations/year (leg, toe or foot). It is one of the leading causes of preventable sight loss in the UK and 
increases patient predisposition to strokes, heart attacks, kidney failure etc. It is also recognised as a 
cause of premature death (>700 people/week) (1).

1 in 6 patients admitted to hospital has diabetes and diabetic patients are twice as likely to be admitted to hospital. Concurrently, there are cardiac, renal, metabolic and other consequences for patients with poorly controlled diabetes.

The NHS spends approximately £10 billion a year on diabetes (approximately 10% of entire NHS budget). Up to 80% of NHS budget for diabetes is used in treating its complications.

Avoidable hospital care provision for diabetic patients cost the NHS £3 billion in 2017/2018 (2). The 
increased hospital costs (40% of which come from non-elective and emergency care, are 3* higher than current 
costs of diabetes medication). The study also found that only 62% of the diabetic patients admitted had 
controlled their glycaemia in the past 3 months. 

The implication here is there is potential to mitigate hospital admissions for these patients if identified 
early enough with appropriate care provision initiated in a timely fashion.

Current preventive solutions include teams of diabetes nurse specialists (DSN) and General Practitioners, 
that co-ordinate care provision for these patients. Such hospital teams range from approximately 4 - 6 DSN 
(dependent on the patient numbers receiving care provision per NHS Trust). This highlights the mismatch 
described earlier between demand and supply. Approximately 4.9 million diabetic patients receiving care from 
XX DSN.

## THE PROBLEM

The focal point of the project is on the identification of diabetic patients at risk of avoidable hospital admission due to poor glycaemic conrol. This is likely due to a cohort of factors inclusive of:
    - non-compliance with diabetic medication
    - loss-to follow-up with their General Practitioners (primary care) &/or with their specialists (secondary care)
    - incorrect doses/dosing regimes by patients etc
    
    
The historical approach to identification of such at-risk patients has not been data-driven but instead, based on gut-instinct. This is inclusive of prior clinical reports, indicators(medical, social), identification of hospital inpatients with blood glucose levels outside of certain target ranges (4 - 20mM). A randomised process of (at-risk) patient selection therefore results. This is further compounded by the issue of requirements on a team of 4 - 6 DSN's with finite resources, for the provision of care to thousands of patients within their catchment area. It is thus highly likely that there are at-risk patients that are not being identified in advance of deterioration of their blood glucose control. 

>The objective of this project is to improve on the historical approach to identification of these at-risk diabetic patients. 
This clinical problem was further decomposed into a series of sub-problems which were addressed using a variety of statistical concepts and analytical techniques (inclusive of supervised and unsupervised ML algorithms)
The solutions to the sub-tasks were then composed to solve the overall problem at hand.

The problem was framed as 
     - A Supervised problem,  with a target variable of blood glucose levels. These values were provided within the original data and therefore  available for comparison against the predictions from the models trained on the remaining feature inputs.
     - Accuracy in regression modeling of expected future blood glucose values would provide one means for highlighting at risk diabetic patients likely to deteriorate and require hospital admission. This would be done by a numerical estimation of glucose values for each patient.
     - Accuracy in classification modeling would predict for each patient, which class he/she would belong to (normal blood glucose level group vs abnormal blood glucose level group)
     - An Unsupervised problem in which a clustering algorithm was used to group each individual patient within the cohort, in order to find natural groups by their similarities but not by a pre-specified target characteristic
     - A Time Series Analysis/Forecasting problem in which the focus was on forecasting future glucose values for each patient into specific time periods in the future.
     

## DATA COLLECTION

Datasets used in this project were obtained from the UCI ML repository. Anonymised datasets collected from 
70 diabetic patients in the early 1990's. 30 of these patient datasets were selected and pulled directly 
into a Jupyter notebook. These patient records were obtained from 2 sources:- an automatic electronic 
recording device and paper records. 
    
Features recorded in the datasets include:- 
    - DATE (of data collection) in MM-DD-YYYY format
    - TIME in MM-SS format
    - CODE field inclusive of 
        - Insulin types (Regular, Intermediate and Long-Acting Insulins)
        - Glucose Values (pre-breakfast, pre-lunch, pre-supper glucose values etc)
    - Other features were specified but had no recorded data (typical meal ingestion, exercise activity) 
    which would serve to increase the accuracy of the trained algorithms, if present
    -Other limitations of the data (likely due to data collection for a different purpose) but would have 
    further increased model accuracy if present include
        - HbA1c levels
        - Urea/Creatinine levels
        - Patient demographics
        - Recurrent hospital admissions
        - Renal/Cardiac failure etc
    
Above highlighted to demonstrate the mismatch between the problem this clinician set out to solve and the 
data available to solve the problem).
    
It was noted that the datasets collected were of different durations for each of the 70 patients within the 
cohort, with a resulting skew on analysis outputs. 

## DATA PREPARATION 
    
As part of data cleansing, i implemented the following steps:- 
    
    - removed/inferred missing values within the datasets
    - replaced anomalous values of insulin (eg 163)
    - feature engineering of attributes in the data into inputs for model training and development 
    - normalised and scaled data to ensure compatibility
    - selected a subset of the data for statistical analysis
    - converted "Date" and "Time" columns from character to "date" and "datetime" formats respectively
    
    
## EXPLORATORY DATA ANALYSIS

Was next undertaken to better understand the features present within selected datasets as well as to explore potential relationships beteween the features. It was 
discovered that despite the advertised presence of numerous features as inputs in the data, the only recorded
data available was on Regular and Intermediate Acting Insulins as well as blood glucose levels tested at 4 
different times of the day. 
This was the most significant limitation in the ability to predict at-risk diabetic patients in this project.

As shown in the facet plot below, Pre-Breakfast Blood Glucose levels (58) were the most frequently recorded readings by the patient, with similar recording counts for Pre-Lunch (60) and Pre-Supper (62) Blood Glucose levels 

<img src = "GlucoseReadingsimage.png" style = "width:1000px, height:300px"/>

The facet plot of Glucose values by Insulin type showed more frequent Glucose value recordings against Regular Insulin (33) vs Intermediate Insulin (34).

<img src = "InsulinGlucoseFacetPlot.png" style = "width:1000px, height:300px"/>

BoxPlot below shows average Pre - meal Blood Glucose levels that are slightly higher than normal (average normal pre-meal Blood Glucose levels range between 80 - 120 mg/dL. For diabetic patients, average pre-meal Blood Glucose levels should be 150mg/dL or less). Only median values for pre-lunch Blood Glucose (60) levels fall below 150mg/dL

<img src = "BoxPlotGlucoseReadings.png" style = width:800px; height:300px/>

## INFERENTIAL COHORT ANALYSIS

This was conducted after description of the data during the EDA stage, to obtain conclusions beyond the immediate data outputs alone. 
THe focus here was on use of datasets from a single patient to compare the average blood glucose values for that individual each month to that from the previous month. 
The objective was to ascertain if there was a statistically significant difference (whether improvement or deterioration) in average blood glucose levels each month over time.
The dependent T-test statistic was used to conduct this assessment, with a statistical significance (p-value) level of 0.05. This p-value (or lower) was never achieved and thus the null hypothesis could not be rejected. It was therefore concluded that there was NO statistically significant monthly change noted over time in average Blood Glucose levels for the patient. Cohen's D outputs were also similarly not significantly different for the patient between months.


## MODEL TRAINING
    
The focus of this section was to segment the cohort of diabetes patient groups that differed from each other 
with respect to their blood glucose levels (or on the prediction of the long-term risk of an individual 
patient). The target of these predictions was to identify the patients that would develop poorly controlled 
blood glucose levels (low or high), for earlier intervention by the health system. 
    
This was done to mimic real world scenarios in which there is no data-driven approach at present to ascertain
which patients are likely to have poorly controlled blood sugar levels in the next 3, 6, 9 or 12 months.
A limitation of this project was the lack of relevant features on which to make such predictions (with access
only to insulin levels and glucose levels as inputs). Ideally, this clinician would have chosen above 
mentioned attributes to improve the accuracy in the modeling process. 

The objective was to estimate for unknown values of glucose (at present or in the future), using models built
and tested using events/data from the past. 
    
Supervised Learning techniques (Classification and Regression) were used to create models that described the 
relationship between the feature-engineered inputs and the pre-defined target variable.
The data was split into subsets (train/test and cross validation) and labelled training dat was used as inputs
for the algorithms that were developed. 
    
Classification models were built to predict which (of two classes - poor blood glucose control or normal 
blood glucose control) class each diabetic patient featured within the cohort.
    
Regression models were built to predict (for each individual), the numerical value of the blood glucose levels.
The property predicted here was the blood glucose value that was forecast for each patient. 
    
Unsupervised learning (Clustering) algorithms were developed to group individual diabetic patients in the 
cluster together by their similarities but not driven by any specific purpose. Clustering was done to determine if any natural groups existed within the datasets. This could potentially form the basis for targeted treatment approaches/personalisation of therapeutics to specific patient groups with an increased likelihood of success.

### REGRESSION MODELING 

Datasets from 30 diabetic patients were pooled together for model training and testing. A dataframe of >12,000 observations was obtained. As stated above, features available within the data for model development and testing were limited. This had a significant impact on the accuracy of the models subsequently developed.
A baseline model was fit with outputs suggesting that each additional dose of Insulin given would raise the blood glucose level by 0.73mM (which is incorrect and not reflective of the pharmacology of insulin action.

Baseline models were subsequently used to make predictions for different time periods.
<img src = "LinearBaselineModelPredictions.png" style = "width:1000px; height:500px"/>

Next, linear models were trained separately on different parts of the training data (morning, afternoon, evening and nighttime datasets).

Next, Random Forest models were fitted using cross validation and subsequently tuned.
Finally, GLMNET models were fitted using cross validation techniques and subsequently used to predict Blood Glucose values.

Summary statistics used to explain model performance were the adjusted R-squared and the RSE.
The adjusted R-squared summarised the correlation between the feature inputs and teh model outputs. The closer to 1, the better the model fit to the data (and thus, the better the model performance). 
The sigma (RSE) reflected the difference between the actual/recorded glucose values and those predicted by the models.
Both metrics were slightly improved with the Random Forest and GLMNET models when compared with the linear model outputs

### Classification Modeling 
These were built to predict for each datapoint from each patient, which of 2 classes - poor glucose control (<40 or >200mg/dL) or normal glucose 
control (40 - 200 mg/dL) the dataset (& ultimately each patient) would belong.

A Logistic Regression model was fit which was trained on 60% of the data and predicted on the remaining 40% (test set).
A Confusion Matrix (2 way frequency table) was created to compare the predicted classes against actual classes. As the table below shows, the model accurately predicted
on 409 datasets (true positives) & 1026 (true negatives). 16 false negatives were identified as well. 

<img src = "ClassificationConfusionMatrix.png" style = "width:500px; height:500px"/>

An ROC curve displayed the model predictions against actual values in test dataset

<img src = "ClassificationROCCurve.png" style = "width:1000px; height:500px"/>

The process was repeated with Random Forest and Gradient Boost Models

### Clustering Modeling
An Unsupervised (k-means) Clustering algorithm was created to group clusters of Diabetic patients along similar characteristics that were not pre-defined.
2 rounds of k-means clustering were conducted with a comparison of the output patient clusters which were of different sizes and distributions. The implication
here was that the groups within each cluster were likely not made up of similar patient characteristics

<img src = "ClusteringFirstCluster.png" style = "width:500px: height:200px"/>
<img src = "ClusteringSecondClust.png" style = "width:500px: height:200px"/>

Finally, an elbow plot was created which showed the optimal k-means value was estimated at 2 or 3, given the available data.

### Time Series Modeling
Time Series Analysis & Forecasting modeling focused on datasets from patient 1 alone. 
Average glucose readings for each day included within the datasets were calculated.
Naive Bayes time series models, random & automatic ARIMA models and finally Prophet Models were built and used to forecast future average daily blood glucose 
values
<img src = "TimeSeriesProphetModeling.png" style = "width:1000px; height:500px"/>

CONCLUSION

The objective of this project has been a focused outlook on the data-driven process of identification of at-risk diabetic patients likely to require (avoidable) hospital admission in the 
short term (6 months or less). 
This project showcases the potential of AI/ML as tech enablers in support of this process. Such work requires clinician engagement from inception, in formulating the clinical problems 
amenable to ML, identifying the features most likely to yield accurate results and co-creating the solutions alongside AI/ML practitioners. This will foster trust and adoption across the 
wider clinical community.

Such potential for the delivery of precision medicine at the patient level is made possible as each individualised patient trajectory can be "charted" using the tech enablers as well as their individual risk profiles (genetics etc) and responses to treatment. There is also the ability to compute multiple risk scores for various different clinical conditions present within a single patient (as well as their additive impacts on each other). This permits the development of a holistic view of each patient health status at a point in time (through static data) as well as over a period of time (through time series data).

Diabetes Mellitus as a chronic condition lends itself well to the model of a personalised treatment approach for individuals living with the disease condition. This is because the treatment response is highly heterogenous, hard to predict, with associated risks (toxicity etc) if not taken as prescribed. 

Extending the time series model in clinical decision support systems for personalised treatment of patients with diabetes, it is possible to estimate/predict the individual patient response to diabetic treatment(s) in advance, optimal treatment times for each patient and how to select among multiple diabetic medications over time. All of which would serve to optimise the health
status of the individual.


The focus of this project has been on identification of patients with diabetes for a more targeted focus on health optimisation from the health system. In the drive to personalise healthcare for patients with disease conditions, technology will help answer the question regarding the unique chararcteristics of each patient & an estimation of treatment effects at the individual level (treatment regimes, specific doses and dosing intervals). This solves the problem of care provision that is respectful of and responsive to individual patient need. This project has 
not focused on how we better engage diabetic patients to make the necessary changes to improve their health. These include a focus on patient preference and values, likely to have a stronger influence on self-motivation of patients. This mindset shift is essential for the successful long-term self management of chronic conditions to which the health system is not currently adapted to support. 
This is extremely relevant in personalising healthcare for patients required to commit to a lifetime (in chronic cases) of self-management of their chronic disease conditions like Diabetes, which may have no cure.