# Diabetes Assessement
Machine learning project

## Domain Description

Non-technical Description of Key Concepts in Outpatient Monitoring
and Management of Insulin Dependent Diabetes Mellitus (IDDM) for the
AAAI Spring Symposium on Intepreting Clinical Data.


The following text is provided to orient you to the the diabetes data
set. It is meant as a quick introduction to the pertinent issues in
this domain for potential participants of the AAAI Spring Symposium on
Interpreting Clinical Data.  However, it is not meant to be a rigorous
or comprehensive review of the subject.

Isaac  Kohane, AIM-94 Co-Chair
8/27/1993
aim-94@camis.stanford.edu

------------------------------------------------------------------------

Patients with IDDM are insulin deficient. This can either be due to a)
low or absent production of insulin by the beta islet cells of the
pancreas subsequent to an auto-immune attack or b) insulin-resistance,
typically associated with older age and obesity, which leads to a
relative insulin-deficiency even though the insulin levels might be
normal.

Regardless of cause, the lack of adequate insulin effect has multiple
metabolic effects. However, once a patient is diagnosed and is
receiving regularly scheduled exogenous (externally administered)
insulin, the principal metabolic effect of concern is the potential
for hyperglycemia (high blood glucose). Chronic hyperglycemia over a
period of several years puts a patient at risk for several kinds of
micro and macrovascular problems (e.g. retinopathy). Consequently, the
goal of therapy for IDDM is to bring the average blood glucose as close
to the normal range as possible. As explained below, current therapy
makes this goal a very challenging (and often frustrating) one for
most patients. One important consideration is that due to the
inevitable variation of blood glucose (BG) around the mean, a lower mean
will result in a higher frequency of unpleasant and sometimes
dangerous low BG levels.

### OTHER SOURCES OF INFORMATION

If you want to learn more about the outpatient treatment of IDDM, most
of the standard medical or endocrinological textbooks have large
sections on this subject. Alternatively, the local chapters of the
Juvenile Diabetes Foundation and American Diabetes Association may be
able to provide you with some helpful practical information. Finally,
feel free to send e-mail to aim-94@camis.stanford.edu. One of the program
committee members is an endocrinologist and will be pleased to answer
technical/medical questions.

# Source:

Michael Kahn, MD, PhD, Washington University, St. Louis, MO

https://archive.ics.uci.edu/ml/datasets/Diabetes

# Data Set Information:

Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided "logical time" slots (breakfast, lunch, dinner, bedtime). For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps.

Diabetes data files (data-[01-70]) consist of data sets covering several weeks' to months' worth of outpatient care on 70 patients.

Diabetes files consist of four fields per record. Each field is separated by a tab and each record is separated by a newline.

**File Names and format:**
<ol>
<li>Date in MM-DD-YYYY format</li>
<li>Time in XX:YY format</li>
<li>Code</li>
<li>Value</li>
</ol>


### Outpatient management.

Outpatient management of IDDM relies principally on three
interventions: diet, excercise and exogenous insulin. Proper treatment
requires careful consideration of all three interventions. 

**The Code field is deciphered as follows:**

### INSULIN

- 33 = Regular insulin dose
- 34 = NPH insulin dose
- 35 = UltraLente insulin dose

One of insulin's principal effects is to increase the uptake of
glucose in many of the tissues (e.g. in adipose/fat tissue) and
thereby reduce the concentration glucose in blood.  Patients
with IDDM administer insulin to themselves by subcutaneous injection.
Insulin doses are given one or more times a day, typically before
meals and sometimes also at bedtime. Many insulin regimens are devised
to have the peak insulin action coincide with the peak rise in BG
during meals. In order to achieve this, a combination of several
preparations of insulin may be administered. Each insulin formulation
has its own characteristic time of onset of effect (O), time of peak
action (P) and effective duration (D). These times can be significantly
affected by many factors such as the site of injection (e.g. much more
rapid absorption in the abdomen than in the thigh) or whether the
insulin is a human insulin or an animal extract. The times I have
listed below are rough approximations and I am sure that I could find
an endocrinologist with different estimates.

- Regular Insulin: O 15-45 minutes P 1-3 hours D 4-6 hours
- NPH Insulin: O 1-3 hours P 4-6 hours D: 10-14 hours
- Ultralente: O: 2-5 hours. P (not much of a peak) D 24-30 hours.

### GLUCOSE CONCENTRATIONS

- 48 = Unspecified blood glucose measurement
- 57 = Unspecified blood glucose measurement
- 58 = Pre-breakfast blood glucose measurement
- 59 = Post-breakfast blood glucose measurement
- 60 = Pre-lunch blood glucose measurement
- 61 = Post-lunch blood glucose measurement
- 62 = Pre-supper blood glucose measurement
- 63 = Post-supper blood glucose measurement
- 64 = Pre-snack blood glucose measurement
- 65 = Hypoglycemic symptoms

BG concentration will vary even in individuals with normal pancreatic
hormonal function.  A normal pre-meal BG ranges approximately 80-120 mg/dl. 
A normal post-meal BG ranges 80-140 mg/dl. The target range for an individual 
with diabetes mellitus is very controversial. I will cut the Gordian knot on 
this issue by noting that it would be very desirable to keep 90% of all BG 
measurements < 200 mg/dl and that the average BG should be 150 mg/dl or less. 
Note that it  takes a lot of work, attention and (painful) BG checks to reach 
this target range. Conversely, an average BG > 200 (over several years) is 
associated with a poor long-term outcome. That is, the risk of vascular 
complications of the high BG is signicantly elevated.

Hypoglycemic (low BG) symptoms fall into two classes. Between 40-80 mg/dl,
the patient feels the effect off the adrenal hormone epinephrine as the BG
regulation systems attempt to reverse the low BG.  These so-called 
adrenergic symptoms (headache, abdominal pain, sweating) are useful, if
unpleasant, cues to the patient that their BG is falling dangerously. Below
40 mg/dl, the patient's brain is inadequately supplied with glucose and
the symptoms become those of poor brain function (neuroglycopenic
symptoms). These include: lethargy, weakness, disorientation, seizures and
passing out.

### DIET

- 66 = Typical meal ingestion
- 67 = More-than-usual meal ingestion
- 68 = Less-than-usual meal ingestion

Another vast subject but (suffice it to say for the purposes of users
of the data set) in brief: a larger meal will lead to a longer and
possibly higher elevation of blood glucose. The actual effect depends on
a host of variables, notably the kind of food ingested. For instance,
fat causes delayed emptying of the stomach and therefore a slower rise in BG
than a starchy meal without fat. Missing a meal or eating a meal of smaller
than usual size will put the patient at risk for low BG in the hours that follow
the meal.

### EXERCISE

- 69 = Typical exercise activity
- 70 = More-than-usual exercise activity
- 71 = Less-than-usual exercise activity
- 72 = Unspecified special event

Exercise appears to have multiple effects on BG control. Two important
effects are: increased caloric expenditure and a possibly independent
increase in the sensitivity of tissues to insulin action.  BG can fall
during exercise but also quite a few hours afterwards. For instance,
strenuous exercise in the mid-afternoon can be associated with low BG
after dinner. Also, too strenuous exercise with associated mild
dehydration can lead to a transient increase in BG.

# Attribute Information:

Diabetes files consist of four fields per record. Each field is separated by a tab and each record is separated by a newline.

**File Names and format:**
<ol>
<li>Date in MM-DD-YYYY format</li>
<li>Time in XX:YY format</li>
<li>Code</li>
<li>Value</li>
</ol>

In [1]:
# Load libraries
import sys, os
import csv, json
import pandas as pd
import glob
from pandas.plotting import scatter_matrix
import scipy as sp
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

In [2]:
# Check the versions of libraries
 
print('Python: {}'.format(sys.version))
print('scipy: {}'.format(sp.__version__))
print('numpy: {}'.format(np.__version__))
print('numpy: {}'.format(np.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('pandas: {}'.format(pd.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('csv: {}'.format(csv.__version__))
print('json: {}'.format(json.__version__))

Python: 3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
scipy: 1.3.1
numpy: 1.17.2
numpy: 1.17.2
matplotlib: 3.1.1
pandas: 0.25.1
sklearn: 0.21.3
csv: 1.0
json: 2.0.9


###  Load data from multiple tab separated value (TSV) files in a single directory

In [14]:
# directory path of the data files in tab delimited format
dpath = '//Users/seckart/Projects/Python/ml-projects/Diabetes-Data/'

# dataframe field headers
names = ['date', 'time','code','value']
dataset = pd.DataFrame()
patient_id = []

# walk through directory of data files enumerated as data-01, data-02, etc.
for data_files in glob.glob(dpath + 'data-*'):
    # create patient identifier
    dir_path = data_files.split('data-')
    patient_no = int(dir_path[1])
    print(patient_no)
    patient_id.clear()
    # Load dataset from each tab delimited (\t) file
    single_dataset = pd.read_csv(data_files, index_col=0, parse_dates=True, sep='\t', names=names)
    single_dataset.insert(loc=0, column='patient', value = patient_no)
    print(single_dataset.head(10))
#   Append data from each file to a single dataframe
    dataset = dataset.append(single_dataset, ignore_index = True)

# confirm number of rows and columns in final dataframe
print(dataset.shape)

# head
print(dataset.head(10))
print('')
print(dataset.tail(10))

# descriptions
print(dataset.describe())

# class distribution
print(dataset.groupby('code').size())

# save dataframe for later
dataset.to_pickle(os.path.join(dpath,'diabetes_data.pickle'))

1
            patient   time  code  value
date                                   
1991-04-21        1   9:09    58    100
1991-04-21        1   9:09    33      9
1991-04-21        1   9:09    34     13
1991-04-21        1  17:08    62    119
1991-04-21        1  17:08    33      7
1991-04-21        1  22:51    48    123
1991-04-22        1   7:35    58    216
1991-04-22        1   7:35    33     10
1991-04-22        1   7:35    34     13
1991-04-22        1  13:40    33      2
2
            patient   time  code value
date                                  
1989-10-10        2  08:00    58   149
1989-10-10        2  08:00    33   010
1989-10-10        2  12:00    60   116
1989-10-10        2  12:00    33   004
1989-10-10        2  18:00    62   304
1989-10-10        2  18:00    33   010
1989-10-10        2  22:00    48   063
1989-10-10        2  22:00    33   014
1989-10-11        2  08:00    58   171
1989-10-11        2  08:00    33   010
3
            patient   time  code  value
date  

23
            patient   time  code  value
date                                   
1991-04-27       23  23:02    71      0
1991-04-28       23  08:14    57     98
1991-04-28       23  08:15    33     12
1991-04-28       23  08:15    34     18
1991-04-28       23  08:17    66      0
1991-04-29       23  07:20    57    115
1991-04-29       23  07:22    33     12
1991-04-29       23  07:22    34     18
1991-04-29       23  18:56    57    242
1991-04-29       23  18:57    33      5
24
            patient   time  code  value
date                                   
1991-05-28       24  21:35    57     39
1991-05-28       24  21:38    34     12
1991-05-29       24  07:24    57    278
1991-05-29       24  07:26    33     15
1991-05-29       24  07:26    34     20
1991-05-29       24  17:35    57     50
1991-05-29       24  17:37    33      3
1991-05-29       24  17:37    34     12
1991-05-30       24  07:23    57    293
1991-05-30       24  07:25    33     16
25
            patient   time  cod

            patient   time  code  value
date                                   
1991-03-29       46  18:59    34      3
1991-03-29       46  21:15    33      2
1991-03-29       46  21:15    34      6
1991-03-30       46  07:03    58    123
1991-03-30       46  07:06    33      2
1991-03-30       46  07:06    34     20
1991-03-30       46  12:13    33      2
1991-03-30       46  16:55    62     95
1991-03-30       46  17:37    33      4
1991-03-30       46  17:37    34      2
47
            patient   time  code  value
date                                   
1991-05-04       47  01:05    72      0
1991-05-04       47  07:10    58    192
1991-05-04       47  07:13    33      3
1991-05-04       47  07:13    34     20
1991-05-04       47  12:41    58     59
1991-05-04       47  16:14    62    257
1991-05-04       47  16:16    33      3
1991-05-04       47  18:09    33      3
1991-05-04       47  18:09    34      2
1991-05-04       47  21:22    63    251
48
            patient   time  code  

            patient   time  code  value
date                                   
1990-01-23       64  07:30    57     90
1990-01-23       64  13:30    57    220
1990-01-23       64  17:00    57    190
1990-01-23       64  22:00    57     90
1990-01-24       64  07:30    57     97
1990-01-24       64  11:30    57    134
1990-01-24       64  17:00    57     51
1990-01-24       64  19:00    57    110
1990-01-24       64  22:30    57     91
1990-01-24       64  14:00    69      0
65
            patient   time  code  value
date                                   
1989-04-17       65  06:35    58    345
1989-04-17       65  06:35    33     19
1989-04-17       65  06:35    35     23
1989-04-17       65  12:15    60    255
1989-04-17       65  12:15    33      9
1989-04-17       65  18:10    62    253
1989-04-17       65  18:10    33     18
1989-04-17       65  22:40    48    370
1989-04-18       65  06:20    58     95
1989-04-18       65  06:20    33     15
66
            patient   time  code  

### Presentation of data

In [4]:
# box and whisker plot
# print("Diabetes whisker plot")
# dataset.plot(kind='line', subplots=True, figsize=(16,20), sharex=False, sharey=False)
# plt.show()
# print()

In [5]:
# histogram
# print("Diabetes histogram")
# dataset.hist(figsize=(16,20))
# plt.show()
# print()

In [6]:
# scatter plot matrix
# print("Diabetes scatter plot")
# scatter_matrix(dataset, figsize=(16,20))
# plt.show()
# print()

In [7]:
# confirm values in Code field are numberic
dataset['code']

0        58
1        33
2        34
3        62
4        33
         ..
29325    33
29326    34
29327    34
29328    34
29329    34
Name: code, Length: 29330, dtype: int64

In [8]:
dataset['value']

0        100
1          9
2         13
3        119
4          7
        ... 
29325      1
29326      7
29327      7
29328      7
29329      7
Name: value, Length: 29330, dtype: object

### try to convert <em>'value'</em> field to numeric data or force to not-a-number (NaN)

In [9]:
pd.to_numeric(dataset['value'], errors='coerce')

0        100.0
1          9.0
2         13.0
3        119.0
4          7.0
         ...  
29325      1.0
29326      7.0
29327      7.0
29328      7.0
29329      7.0
Name: value, Length: 29330, dtype: float64