# General Overview - Data Wrangling

The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

- It is an inpatient encounter (a hospital admission).
- It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
- The length of stay was at least 1 day and at most 14 days.
- Laboratory tests were performed during the encounter.
- Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.*

*Taken from [UC Irvine's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style='darkgrid')
%matplotlib inline

In [2]:
data = pd.read_csv('diabetic_data.csv', na_values=["?"], low_memory=False) # import data
# csv contains "?" for missing values

diabetes = data.copy() # save a copy of data as diabetes

In [3]:
diabetes.head(10)

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
5,35754,82637451,Caucasian,Male,[50-60),,2,1,2,3,...,No,Steady,No,No,No,No,No,No,Yes,>30
6,55842,84259809,Caucasian,Male,[60-70),,3,1,2,4,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
7,63768,114882984,Caucasian,Male,[70-80),,1,1,7,5,...,No,No,No,No,No,No,No,No,Yes,>30
8,12522,48330783,Caucasian,Female,[80-90),,2,1,4,13,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,15738,63555939,Caucasian,Female,[90-100),,3,3,4,12,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [4]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
encounter_id                101766 non-null int64
patient_nbr                 101766 non-null int64
race                        99493 non-null object
gender                      101766 non-null object
age                         101766 non-null object
weight                      3197 non-null object
admission_type_id           101766 non-null int64
discharge_disposition_id    101766 non-null int64
admission_source_id         101766 non-null int64
time_in_hospital            101766 non-null int64
payer_code                  61510 non-null object
medical_specialty           51817 non-null object
num_lab_procedures          101766 non-null int64
num_procedures              101766 non-null int64
num_medications             101766 non-null int64
number_outpatient           101766 non-null int64
number_emergency            101766 non-null int64
number_inpatient            101766 non

# Column Descriptions

Below is a list of all 50 columns, their data type, description, and possible values. The table is from the research article [Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records](https://www.hindawi.com/journals/bmri/2014/781670/tab1/), which used a larger dataset from which this one is taken from.

| Feature name | Type | Description | Values |
|-------------|------|------------------------|----------|
| Encounter ID | Numeric | Unique identifier of an encounter |
| Patient number | Numeric | Unique identifier of a patient |
| Race | Nominal | | Caucasian, Asian, African American, Hispanic, and other |
| Gender | Nominal | | male, female, and unknown/invalid |
| Age | Nominal | Grouped in 10-year intervals |
| Weight | Numeric | Weight in pounds |
| Admission type | Nominal | Integer identifier corresponding to 9 distinct values | For example: emergency, urgent, elective, newborn, and not available |
| Discharge disposition | Nominal | Integer identifier corresponding to 29 distinct values | For example: discharged to home, expired, and not available |
| Admission source | Nominal | Integer identifier corresponding to 21 distinct values | For example: physician referral, emergency room, and transfer from a hospital |
| Time in hospital | Numeric | Integer number of days between admission and discharge |
| Payer code | Nominal | Integer identifier corresponding to 23 distinct values | For example: Blue Cross/Blue Shield, Medicare, and self-pay |
| Medical specialty | Nominal | Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values | For example: cardiology, internal medicine, family/general practice, and surgeon |
| Number of lab procedures | Numeric | Number of lab tests performed during the encounter |
| Number of procedures | Numeric | Number of procedures (other than lab tests) performed during the encounter |
| Number of medications | Numeric | Number of distinct generic names administered during the encounter |
| Number of outpatient visits | Numeric | Number of outpatient visits of the patient in the year preceding the encounter |
| Number of emergency visits | Numeric | Number of emergency visits of the patient in the year preceding the encounter |
| Number of inpatient visits | Numeric | Number of inpatient visits of the patient in the year preceding the encounter |
| Diagnosis 1 | Nominal | The primary diagnosis (coded as first three digits of ICD9) | 848 distinct values |
| Diagnosis 2 | Nominal | Secondary diagnosis (coded as first three digits of ICD9) | 923 distinct values |
| Diagnosis 3 | Nominal | Additional secondary diagnosis (coded as first three digits of ICD9) | 954 distinct values |
| Number of diagnoses | Numeric | Number of diagnoses entered to the system |
| Glucose serum test result | Nominal | Indicates the range of the result or if the test was not taken | ">200," ">300," "normal," and "none" if not measured |
| A1c test result | Nominal | Indicates the range of the result or if the test was not taken | ">8" if the result was greater than 8%, ">7" if the result was greater than 7% but less than 8%, "normal" if the result was less than 7%, and "none" if not measured |
| Change of medications | Nominal | Indicates if there was a change in diabetic medications (either dosage or generic name) | "change" and "no change" |
| Diabetes medications | Nominal | Indicates if there was any diabetic medication prescribed | "yes" and "no" |
| 24 features for medications | Nominal | For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage |  "up" if the dosage was increased during the encounter, "down" if the dosage was decreased, "steady" if the dosage did not change, and "no" if the drug was not prescribed |
| Readmitted | Nominal | Days to inpatient readmission |  "<30" if the patient was readmitted in less than 30 days, ">30" if the patient was readmitted in more than 30 days, and "No" for no record of readmission |

# Look For and Drop Duplicates

In [5]:
diabetes = diabetes.drop_duplicates() # based on encounter ID

diabetes.shape # no duplicates detected!

(101766, 50)

Some patients visited the hospital multiple times for treatment so to avoid over-representing any particular individual, only the first encounter with a patient will be used / kept in this dataset.

In [6]:
# total unique patients
len(diabetes.patient_nbr.value_counts())

71518

In [7]:
# locate number of patient visits using patient_id
diabetes.patient_nbr.value_counts()

88785891     40
43140906     28
23199021     23
1660293      23
88227540     23
             ..
71081460      1
30060018      1
67443444      1
141344240     1
93251151      1
Name: patient_nbr, Length: 71518, dtype: int64

In [8]:
# keep only one record for each patient, the first visit
diabetes = diabetes.drop_duplicates(['patient_nbr'], keep='first')

In [9]:
diabetes

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101754,443842016,183087545,Caucasian,Female,[70-80),,1,1,7,9,...,No,Steady,No,No,No,No,No,Ch,Yes,>30
101755,443842022,188574944,Other,Female,[40-50),,1,1,7,14,...,No,Up,No,No,No,No,No,Ch,Yes,>30
101756,443842070,140199494,Other,Female,[60-70),,1,1,7,2,...,No,Steady,No,No,No,No,No,No,Yes,>30
101758,443842340,120975314,Caucasian,Female,[80-90),,1,1,7,5,...,No,Up,No,No,No,No,No,Ch,Yes,NO


# Drop Irrelevant Columns

The majority of patients do not have a weight listed so this column can be dropped. Payer code and medical specialty are also missing for about half of all patients. We do not need to know how the patients paid for their treatments and we do not have enough information to figure out which medical unit they went to. Since each encounter is unique, the encounter ID is no longer needed.

In [10]:
# columns to drop
drop_cols = ['encounter_id', 'weight', 'payer_code', 'medical_specialty']

diabetes = diabetes.drop(columns=drop_cols)

In [11]:
diabetes.columns # confirm drop

Index(['patient_nbr', 'race', 'gender', 'age', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

# Addressing NaN Values + Removing Rows

The majority of columns are not missing any values. Only race, diagnosis 1, 2 and 3 contain missing values. Since there is no way to know the race of the patient using existing information, the best option is to remove the rows with a missing string for race.

In [12]:
diabetes.isnull().sum()

patient_nbr                    0
race                        1948
gender                         0
age                            0
admission_type_id              0
discharge_disposition_id       0
admission_source_id            0
time_in_hospital               0
num_lab_procedures             0
num_procedures                 0
num_medications                0
number_outpatient              0
number_emergency               0
number_inpatient               0
diag_1                        11
diag_2                       294
diag_3                      1225
number_diagnoses               0
max_glu_serum                  0
A1Cresult                      0
metformin                      0
repaglinide                    0
nateglinide                    0
chlorpropamide                 0
glimepiride                    0
acetohexamide                  0
glipizide                      0
glyburide                      0
tolbutamide                    0
pioglitazone                   0
rosiglitaz

In [13]:
# remove rows where race is null
diabetes = diabetes.dropna(axis=0, subset=['race'])

diabetes.shape

(69570, 46)

In [14]:
diabetes.isnull().sum()

patient_nbr                    0
race                           0
gender                         0
age                            0
admission_type_id              0
discharge_disposition_id       0
admission_source_id            0
time_in_hospital               0
num_lab_procedures             0
num_procedures                 0
num_medications                0
number_outpatient              0
number_emergency               0
number_inpatient               0
diag_1                        10
diag_2                       275
diag_3                      1158
number_diagnoses               0
max_glu_serum                  0
A1Cresult                      0
metformin                      0
repaglinide                    0
nateglinide                    0
chlorpropamide                 0
glimepiride                    0
acetohexamide                  0
glipizide                      0
glyburide                      0
tolbutamide                    0
pioglitazone                   0
rosiglitaz

After removing the rows without information in the race column, we are down to three columns with missing information: diagnosis 1, 2, and 3. Diagnosis 1 is described as the primary diagnosis made during the patient's visit while diagnosis 2 is the second and 3 is an any additional diagnoses made after that. Looking at the patients' rows that are missing a primary diagnosis, most of them have a second diagnosis or even a third. Since it doesn't make sense to have a second (or third) but not a primary diagnosis, we will remove these columns from the dataset.

The number of diagnoses column shows the total number of conditions a patient is diagnosed with. Only the first three are recorded, so those that are missing the first diagnosis but still a second or third are in error.

In [15]:
diabetes[['diag_1', 'diag_2', 'diag_3','number_diagnoses']][diabetes.diag_1.isnull()]

Unnamed: 0,diag_1,diag_2,diag_3,number_diagnoses
518,,780,997,4
1267,,250.82,401,5
1488,,276,594,8
3197,,250.01,428,7
37693,,780,295,9
57058,,V63,414,6
57737,,276,V08,8
60314,,427,486,8
86018,,250.02,438,4
87181,,,,5


In [16]:
# remove rows where diagnosis 1 is missing
diabetes = diabetes.dropna(axis=0, subset=['diag_1'])

diabetes.shape

(69560, 46)

There are two remaining diagnosis columns with missing values. Each number correlates to a specific condition so if there is a missing value, then it is likely that the patient only has one diagnosed condition. The number of diagnoses column lists the total number of diagnosed conditions. When looking at all three diagnosis columns, if the number is one, then diagnosis 2 and 3 can be filled in with a 0 to show that there is no additional diagnosis. If diagnosis 2 or 3 is missing a value and the number of diagnoses is greater than one, then some diagnoses were not recorded and the rows should be removed.

In [17]:
diabetes[['diag_1', 'diag_2', 'diag_3']].isnull().sum()

diag_1       0
diag_2     274
diag_3    1157
dtype: int64

In [18]:
diabetes[['diag_1','diag_2', 'diag_3','number_diagnoses']][diabetes.diag_2.isnull() & (diabetes.number_diagnoses > 1)].sort_values(by=['number_diagnoses'])

Unnamed: 0,diag_1,diag_2,diag_3,number_diagnoses
26220,250.81,,,2
31671,250.82,,,3
35105,996,,250,3
29386,864,,959,3
32681,824,,,3
...,...,...,...,...
46718,486,,428,8
48265,934,,493,8
40382,576,,276,8
20289,402,,425,9


In [19]:
# remove rows where diagnosis 2 is missing and number of diagnoses is greater than 1
diag_2 = diabetes[(diabetes.diag_2.isnull()) & (diabetes.number_diagnoses > 1)].index
 
diabetes.drop(diag_2, inplace=True)

In [20]:
# remaining rows with missing diagnosis 2 should all have one diagnosed condition
diabetes[['diag_1','diag_2', 'diag_3','number_diagnoses']][diabetes.diag_2.isnull()].sort_values(by=['number_diagnoses'])

Unnamed: 0,diag_1,diag_2,diag_3,number_diagnoses
0,250.83,,,1
23751,250.11,,,1
23870,250.01,,,1
24636,250.12,,,1
24801,250,,,1
...,...,...,...,...
11969,250.01,,,1
12030,250.03,,,1
12452,250.13,,,1
10876,250.13,,,1


In [21]:
diabetes[['diag_1','diag_2', 'diag_3']].isnull().sum()

diag_1       0
diag_2     188
diag_3    1116
dtype: int64

Diagnosis 3 is the last column left with unaccounted missing values. Since some patients have 1 or 2 diagnosed conditions, the diagnosis 3 column is left intentionally blank. The goal here is to remove the rows that have a diagnoses number greater than two.

In [22]:
# list of affected rows
diabetes[['diag_1','diag_2', 'diag_3', 'number_diagnoses']][diabetes.diag_3.isnull() & (diabetes.number_diagnoses > 2)].sort_values(by='number_diagnoses')

Unnamed: 0,diag_1,diag_2,diag_3,number_diagnoses
339,722,729,,3
88159,820,250.02,,3
76149,486,250.81,,3
54990,496,250,,3
28981,922,427,,3
...,...,...,...,...
94739,780,293,,8
97578,574,552,,8
97437,433,599,,8
101560,590,276,,8


In [23]:
# remove rows with missing diagnosis 3 and number of diagnoses is greater than 2
diag_3 = diabetes[(diabetes.diag_3.isnull()) & (diabetes.number_diagnoses > 2)].index
 
diabetes.drop(diag_3, inplace=True)

In [24]:
# remaining rows with missing diagnosis 3
diabetes[['diag_1','diag_2', 'diag_3','number_diagnoses']][diabetes.diag_3.isnull()].sort_values(by=['number_diagnoses'])

Unnamed: 0,diag_1,diag_2,diag_3,number_diagnoses
0,250.83,,,1
15752,250.13,,,1
15661,250.03,,,1
15578,250.83,,,1
15444,250.12,,,1
...,...,...,...,...
18437,311,250,,2
18482,432,250,,2
18563,786,250,,2
18721,648,250.02,,2


In [25]:
diabetes.isnull().sum()

patient_nbr                    0
race                           0
gender                         0
age                            0
admission_type_id              0
discharge_disposition_id       0
admission_source_id            0
time_in_hospital               0
num_lab_procedures             0
num_procedures                 0
num_medications                0
number_outpatient              0
number_emergency               0
number_inpatient               0
diag_1                         0
diag_2                       188
diag_3                      1031
number_diagnoses               0
max_glu_serum                  0
A1Cresult                      0
metformin                      0
repaglinide                    0
nateglinide                    0
chlorpropamide                 0
glimepiride                    0
acetohexamide                  0
glipizide                      0
glyburide                      0
tolbutamide                    0
pioglitazone                   0
rosiglitaz

In [26]:
# replace NaN with 0 in diagnosis 2 and 3 to show there is no additional diagnosis


# Unique Values per Column

Investigate the unique values of each column and look for error entries.

In [27]:
# display unique entries for each column
# count the unique values, equal to 1 means all values are the same
for x in diabetes.columns:
    print('Column Name: ' + x)
    print(diabetes[x].unique())

Column Name: patient_nbr
[  8222157  55629189  86047875 ... 140199494 120975314 175429310]
Column Name: race
['Caucasian' 'AfricanAmerican' 'Other' 'Asian' 'Hispanic']
Column Name: gender
['Female' 'Male' 'Unknown/Invalid']
Column Name: age
['[0-10)' '[10-20)' '[20-30)' '[30-40)' '[40-50)' '[50-60)' '[60-70)'
 '[70-80)' '[80-90)' '[90-100)']
Column Name: admission_type_id
[6 1 2 3 4 5 8 7]
Column Name: discharge_disposition_id
[25  1  3  6  2  5 11  7 10 14  4 18  8 12 13 17 16 22 23  9 15 20 28 24
 19 27]
Column Name: admission_source_id
[ 1  7  2  4  5  6 20  3 17  8  9 14 10 22 11 25 13]
Column Name: time_in_hospital
[ 1  3  2  4  5 13 12  9  7 10 11  6  8 14]
Column Name: num_lab_procedures
[ 41  59  11  44  51  31  70  73  68  33  47  62  60  55  49  75  45  29
  35  42  19  64  25  53  52  87  27  37  46  28  36  48  72  10   2  65
  67  40  58  57  32  83  34  39  69  38  56  22  96  78  61  88  66  43
  50   1  18  82  54   9  63  24  71  77  81  76  90  93   3 103  13  80
  85

In [28]:
# remove row where gender is Unknown/Invalid
gender = diabetes[diabetes.gender == 'Unknown/Invalid'].index

diabetes.drop(gender, inplace=True)

In [29]:
# confirm removal
diabetes.gender.unique()

array(['Female', 'Male'], dtype=object)

We can remove any columns with a consistent value: metformin-rosiglitazone, citoglipton, and examide. These medications were not prescribed.

In [30]:
# drop columns
meds = ['metformin-rosiglitazone', 'glimepiride-pioglitazone', 'citoglipton', 'examide']

diabetes = diabetes.drop(columns=meds)

In [31]:
diabetes.shape

(69388, 42)

## Admission Type ID

The ID mapping documents matches each number with an admission condition.

| admission_type_id	| description |
|-------------------|-------------|
| 1	| Emergency |
| 2	| Urgent |
| 3	| Elective |
| 4	| Newborn |
| 5	| Not Available |
| 6	| NULL |
| 7	| Trauma Center |
| 8	| Not Mapped |

In [32]:
pd.crosstab(diabetes.readmitted, diabetes.admission_type_id)

admission_type_id,1,2,3,4,5,6,7,8
readmitted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
<30,3183,1099,1127,1,261,445,0,21
>30,11197,3974,3659,2,1028,1843,0,68
NO,21099,7304,8786,6,1833,2233,20,199


In [33]:
diabetes.admission_type_id.value_counts()

1    35479
3    13572
2    12377
6     4521
5     3122
8      288
7       20
4        9
Name: admission_type_id, dtype: int64

In [34]:
# keep this column or no?

## Discharge Disposition

The discharge disposition tells us what happened to a patient after they left the hospital. Each number correlates to a specific outcome. We are removing all rows where patients have some form of 'expired' or 'hospice' listed as their discharge since those patients passed and are not going to be readmitted or they are terminally ill.

| discharge_disposition_id | description |
|--------------------------|-------------|
| 1 | Discharged to home |
| 2	| Discharged/transferred to another short term hospital |
| 3	| Discharged/transferred to SNF |
| 4	| Discharged/transferred to ICF |
| 5	| Discharged/transferred to another type of inpatient care institution |
| 6	| Discharged/transferred to home with home health service |
| 7	| Left AMA |
| 8 | Discharged/transferred to home under care of Home IV provider |
| 9 | Admitted as an inpatient to this hospital |
| 10 | Neonate discharged to another hospital for neonatal aftercare |
| 11 | Expired |
| 12 | Still patient or expected to return for outpatient services |
| 13 | Hospice / home |
| 14 | Hospice / medical facility |
| 15 | Discharged/transferred within this institution to Medicare approved swing bed |
| 16 | Discharged/transferred/referred another institution for outpatient services |
| 17 | Discharged/transferred/referred to this institution for outpatient services |
| 18 | NULL |
| 19 | Expired at home. Medicaid only, hospice |
| 20 | Expired in a medical facility. Medicaid only, hospice |
| 21 | Expired, place unknown. Medicaid only, hospice |
| 22 | Discharged/transferred to another rehab facility including rehab units of a hospital |
| 23 | Discharged/transferred to a long term care hospital |
| 24 | Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare |
| 25 | Not Mapped |
| 26 | Unknown/Invalid |
| 30 | Discharged/transferred to another type of health care institution not defined elsewhere |
| 27 | Discharged/transferred to a federal health care facility |
| 28 | Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital |
| 29 | Discharged/transferred to a Critical Access Hospital (CAH) |

In [35]:
diabetes.discharge_disposition_id.value_counts()

1     42907
3      8492
6      8102
18     2456
2      1478
22     1397
11     1050
5       875
25      761
4       496
7       398
23      255
13      242
14      215
28       89
8        72
15       40
24       25
9         9
17        8
10        6
19        6
27        3
16        3
12        2
20        1
Name: discharge_disposition_id, dtype: int64

In [36]:
# expired - remove rows with discharge disposition IDs of 11, 19, 20, 21
discharge = diabetes[(diabetes.discharge_disposition_id == 11) | (diabetes.discharge_disposition_id == 19)].index
 
diabetes.drop(discharge, inplace=True)

In [37]:
# expired - remove rows with discharge disposition IDs of 20, 21
discharge = diabetes[(diabetes.discharge_disposition_id == 20) | (diabetes.discharge_disposition_id == 21)].index
 
diabetes.drop(discharge, inplace=True)

In [38]:
diabetes[diabetes.discharge_disposition_id == 13]

Unnamed: 0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,...,miglitol,troglitazone,tolazamide,insulin,glyburide-metformin,glipizide-metformin,metformin-pioglitazone,change,diabetesMed,readmitted
11693,105371712,Caucasian,Female,[80-90),1,13,7,10,83,0,...,No,No,No,No,No,No,No,Ch,Yes,NO
12929,23904567,Caucasian,Female,[80-90),5,13,17,3,22,0,...,No,No,No,No,No,No,No,No,No,NO
16430,93417543,Caucasian,Female,[80-90),6,13,7,8,96,1,...,No,No,No,No,No,No,No,No,Yes,NO
17868,6356610,Caucasian,Male,[80-90),2,13,1,8,44,0,...,No,No,No,Steady,No,No,No,Ch,Yes,NO
18266,28125558,Caucasian,Male,[60-70),1,13,7,4,54,0,...,No,No,No,Steady,No,No,No,Ch,Yes,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100297,44636130,AfricanAmerican,Female,[50-60),1,13,7,4,57,0,...,No,No,No,No,No,No,No,No,Yes,NO
100729,181334759,Caucasian,Male,[80-90),1,13,7,4,19,1,...,No,No,No,No,No,No,No,No,No,NO
100960,58716846,AfricanAmerican,Female,[60-70),1,13,7,9,74,2,...,No,No,No,Steady,No,No,No,No,Yes,NO
101241,41514156,Caucasian,Female,[70-80),1,13,7,5,60,3,...,No,No,No,Steady,No,No,No,Ch,Yes,NO


In [39]:
diabetes[diabetes.discharge_disposition_id == 14]

Unnamed: 0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,...,miglitol,troglitazone,tolazamide,insulin,glyburide-metformin,glipizide-metformin,metformin-pioglitazone,change,diabetesMed,readmitted
1086,103126950,Caucasian,Male,[80-90),6,14,7,11,63,0,...,No,No,No,Steady,No,No,No,Ch,Yes,<30
3702,10900818,Caucasian,Female,[70-80),3,14,1,3,29,2,...,No,No,No,No,No,No,No,No,No,<30
4771,4095531,Caucasian,Female,[70-80),1,14,7,3,63,1,...,No,No,No,No,No,No,No,No,No,NO
4791,4690782,AfricanAmerican,Female,[60-70),1,14,7,4,45,2,...,No,No,No,No,No,No,No,No,Yes,NO
4922,4947471,Caucasian,Female,[80-90),1,14,7,5,50,0,...,No,No,No,No,No,No,No,No,No,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98299,131309663,AfricanAmerican,Female,[90-100),1,14,7,2,21,0,...,No,No,No,No,No,No,No,No,No,NO
99517,67374603,Caucasian,Male,[80-90),1,14,7,2,56,1,...,No,No,No,Steady,No,No,No,No,Yes,NO
100452,160070594,Caucasian,Female,[70-80),1,14,7,3,71,1,...,No,No,No,Steady,No,No,No,No,Yes,NO
100770,39610080,Caucasian,Female,[80-90),1,14,7,1,67,0,...,No,No,No,No,No,No,No,No,No,NO


In [40]:
# hospice care - remove rows with discharge disposition IDs of 13, 14
discharge = diabetes[(diabetes.discharge_disposition_id == 13) | (diabetes.discharge_disposition_id == 14)].index
 
diabetes.drop(discharge, inplace=True)

In [41]:
diabetes.discharge_disposition_id.unique()

array([25,  1,  3,  6,  2,  5,  7, 10,  4, 18,  8, 12, 17, 16, 22, 23,  9,
       15, 28, 24, 27])

## Admission Source

| admission_source_id | description |
|---------------------|-------------|
| 1 | Physician Referral |
| 2 | Clinic Referral |
| 3 | HMO Referral |
| 4 | Transfer from a hospital |
| 5 | Transfer from a Skilled Nursing Facility (SNF) |
| 6 | Transfer from another health care facility |
| 7 | Emergency Room |
| 8 | Court/Law Enforcement |
| 9 | Not Available |
| 10 | Transfer from critial access hospital |
| 11 | Normal Delivery |
| 12 | Premature Delivery |
| 13 | Sick Baby |
| 14 | Extramural Birth |
| 15 | Not Available |
| 17 | NULL |
| 18 | Transfer From Another Home Health Agency |
| 19 | Readmission to Same Home Health Agency |
| 20 | Not Mapped |
| 21 | Unknown/Invalid |
| 22 | Transfer from hospital inpt/same fac reslt in a sep claim |
| 23 | Born inside this hospital |
| 24 | Born outside this hospital |
| 25 | Transfer from Ambulatory Surgery Center |
| 26 | Transfer from Hospice |

In [42]:
diabetes.admission_source_id.value_counts()

7     36528
1     21050
17     4743
4      2359
6      1490
2       850
5       504
20      152
3       135
9        36
8        11
10        6
22        4
14        2
25        2
11        1
13        1
Name: admission_source_id, dtype: int64

In [43]:
# conclusion
diabetes = diabetes.reset_index(drop=True)

diabetes

Unnamed: 0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,...,miglitol,troglitazone,tolazamide,insulin,glyburide-metformin,glipizide-metformin,metformin-pioglitazone,change,diabetesMed,readmitted
0,8222157,Caucasian,Female,[0-10),6,25,1,1,41,0,...,No,No,No,No,No,No,No,No,No,NO
1,55629189,Caucasian,Female,[10-20),1,1,7,3,59,0,...,No,No,No,Up,No,No,No,Ch,Yes,>30
2,86047875,AfricanAmerican,Female,[20-30),1,1,7,2,11,5,...,No,No,No,No,No,No,No,No,Yes,NO
3,82442376,Caucasian,Male,[30-40),1,1,7,2,44,1,...,No,No,No,Up,No,No,No,Ch,Yes,NO
4,42519267,Caucasian,Male,[40-50),1,1,7,1,51,0,...,No,No,No,Steady,No,No,No,Ch,Yes,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67869,183087545,Caucasian,Female,[70-80),1,1,7,9,50,2,...,No,No,No,Steady,No,No,No,Ch,Yes,>30
67870,188574944,Other,Female,[40-50),1,1,7,14,73,6,...,No,No,No,Up,No,No,No,Ch,Yes,>30
67871,140199494,Other,Female,[60-70),1,1,7,2,46,6,...,No,No,No,Steady,No,No,No,No,Yes,>30
67872,120975314,Caucasian,Female,[80-90),1,1,7,5,76,1,...,No,No,No,Up,No,No,No,Ch,Yes,NO


# Saving Cleaned Data

In [44]:
# save cleaned dataset to new file for storytelling and visualization

# diabetes.to_csv('diabetes_cleaned.csv')