## Readmission causes of patient's with Diabetes
Unplanned hospital readmissions impose a substantial financial burden of approximately $15-20 billion annually on healthcare providers in the United States. By mitigating readmission rates among patients with diabetes, there is a twofold potential: improvement of healthcare systems and enhancing the overall well-being of patients. Precise management of blood sugar levels in individuals with diabetes yields a noteworthy decrease in both mortality and morbidity rates.

This analysis aims to scrutinize historical trends in diabetes care for patients hospitalized in the US, subsequently guiding the formulation of medical protocols for superior patient care. Within this investigation, particular emphasis is placed on the significance of the HbA1c marker in managing diabetes mellitus patients. The hypothesis asserts that the measurement of HbA1c is intrinsically linked to  reduction in readmission rates for individuals undergoing hospitalization.

## Problem Statement
Hospital readmissions are detrimental to patients with diabetes healthwise, financially and timewise. It is important to understand the variables that contribute to these readmissions to enable improve quality of care of the patients

## Objectives
1. To identify the most common primary diagnosis by age group
2. To identify the effect of a diabetes diagnosis on readmission rates
3. To identify groups of patients with high probability of readmission
4. To predict the significance of HbA1c marker in readmission rates

In [30]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LogisticRegression

In [31]:
#loading data
df = pd.read_csv('C:/Users/Wendy/Documents/readmission_ds/Diabetes_Readmissions/diabetic_data.csv')
df

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101761,443847548,100162476,AfricanAmerican,Male,[70-80),?,1,3,7,3,...,No,Down,No,No,No,No,No,Ch,Yes,>30
101762,443847782,74694222,AfricanAmerican,Female,[80-90),?,1,4,5,5,...,No,Steady,No,No,No,No,No,No,Yes,NO
101763,443854148,41088789,Caucasian,Male,[70-80),?,1,1,7,1,...,No,Down,No,No,No,No,No,Ch,Yes,NO
101764,443857166,31693671,Caucasian,Female,[80-90),?,2,3,7,10,...,No,Up,No,No,No,No,No,Ch,Yes,NO


In [32]:
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [33]:
df.tail()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
101761,443847548,100162476,AfricanAmerican,Male,[70-80),?,1,3,7,3,...,No,Down,No,No,No,No,No,Ch,Yes,>30
101762,443847782,74694222,AfricanAmerican,Female,[80-90),?,1,4,5,5,...,No,Steady,No,No,No,No,No,No,Yes,NO
101763,443854148,41088789,Caucasian,Male,[70-80),?,1,1,7,1,...,No,Down,No,No,No,No,No,Ch,Yes,NO
101764,443857166,31693671,Caucasian,Female,[80-90),?,2,3,7,10,...,No,Up,No,No,No,No,No,Ch,Yes,NO
101765,443867222,175429310,Caucasian,Male,[70-80),?,1,1,7,6,...,No,No,No,No,No,No,No,No,No,NO


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [35]:
def inspect_df(df):
    """This prints out the summary from the inspection of the data"""

    return {"Dimensions": f"This data set has {df.shape[0]} rows and {df.shape[1]} columns",
            "Duplicates": f"The data has {df.duplicated().sum()} duplicated entries and {len(df) - df.duplicated().sum()} non duplicated entries",
            "Missing values (%)": f"{(df.isna().sum().sum() / (df.shape[0] * df.shape[1])) * 100} % of the data has missing values",
            "Summary statistics": df.describe().T}
          
total_rows = len(df)
def inspect_col(df, col):
  """Summary Inspection of the column in focus"""
  
  return {"Top 5 value counts": df[col].value_counts()[:5],
          "Number of unique values": df[col].nunique(),
          "data type": df[col].dtype,
          "Missing Values": f"{col} has {df[col].isna().sum()/total_rows * 100 :.2f} % missing values."}
  
  # Summary data of the data frame
inspect_df(df)

{'Dimensions': 'This data set has 101766 rows and 50 columns',
 'Duplicates': 'The data has 0 duplicated entries and 101766 non duplicated entries',
 'Missing values (%)': '0.0 % of the data has missing values',
 'Summary statistics':                              count          mean           std      min  \
 encounter_id              101766.0  1.652016e+08  1.026403e+08  12522.0   
 patient_nbr               101766.0  5.433040e+07  3.869636e+07    135.0   
 admission_type_id         101766.0  2.024006e+00  1.445403e+00      1.0   
 discharge_disposition_id  101766.0  3.715642e+00  5.280166e+00      1.0   
 admission_source_id       101766.0  5.754437e+00  4.064081e+00      1.0   
 time_in_hospital          101766.0  4.395987e+00  2.985108e+00      1.0   
 num_lab_procedures        101766.0  4.309564e+01  1.967436e+01      1.0   
 num_procedures            101766.0  1.339730e+00  1.705807e+00      0.0   
 num_medications           101766.0  1.602184e+01  8.127566e+00      1.0   
 numbe

In the weight column, there are missing values represented by a '?' symbol. However, these symbols are not recognized as missing values by default. To accurately capture the missing data, we need to replace these '?' symbols with 'NaN' (Not a Number), which is a standard representation for missing values in the dataframe. This step ensures that we have a more precise understanding of the extent of missing data within the dataset."

In [36]:
#there are missing values like in the weight column but they are represented with a '?' so I am replacing with NAN
df.replace('?', np.nan, inplace=True)

In [37]:
inspect_df(df)

{'Dimensions': 'This data set has 101766 rows and 50 columns',
 'Duplicates': 'The data has 0 duplicated entries and 101766 non duplicated entries',
 'Missing values (%)': '3.790047756618124 % of the data has missing values',
 'Summary statistics':                              count          mean           std      min  \
 encounter_id              101766.0  1.652016e+08  1.026403e+08  12522.0   
 patient_nbr               101766.0  5.433040e+07  3.869636e+07    135.0   
 admission_type_id         101766.0  2.024006e+00  1.445403e+00      1.0   
 discharge_disposition_id  101766.0  3.715642e+00  5.280166e+00      1.0   
 admission_source_id       101766.0  5.754437e+00  4.064081e+00      1.0   
 time_in_hospital          101766.0  4.395987e+00  2.985108e+00      1.0   
 num_lab_procedures        101766.0  4.309564e+01  1.967436e+01      1.0   
 num_procedures            101766.0  1.339730e+00  1.705807e+00      0.0   
 num_medications           101766.0  1.602184e+01  8.127566e+00     

In [38]:
# Inspecting each column
for col in df.columns:
    print(col)
    print(inspect_col(df, col))
    print("="*70)

encounter_id
{'Top 5 value counts': 96210942     1
89943846     1
384306986    1
94650156     1
83156784     1
Name: encounter_id, dtype: int64, 'Number of unique values': 101766, 'data type': dtype('int64'), 'Missing Values': 'encounter_id has 0.00 % missing values.'}
patient_nbr
{'Top 5 value counts': 88785891    40
43140906    28
23199021    23
1660293     23
88227540    23
Name: patient_nbr, dtype: int64, 'Number of unique values': 71518, 'data type': dtype('int64'), 'Missing Values': 'patient_nbr has 0.00 % missing values.'}
race
{'Top 5 value counts': Caucasian          76099
AfricanAmerican    19210
Hispanic            2037
Other               1506
Asian                641
Name: race, dtype: int64, 'Number of unique values': 5, 'data type': dtype('O'), 'Missing Values': 'race has 2.23 % missing values.'}
gender
{'Top 5 value counts': Female             54708
Male               47055
Unknown/Invalid        3
Name: gender, dtype: int64, 'Number of unique values': 3, 'data type': d

{'Top 5 value counts': No        101680
Steady        79
Up             6
Down           1
Name: chlorpropamide, dtype: int64, 'Number of unique values': 4, 'data type': dtype('O'), 'Missing Values': 'chlorpropamide has 0.00 % missing values.'}
glimepiride
{'Top 5 value counts': No        96575
Steady     4670
Up          327
Down        194
Name: glimepiride, dtype: int64, 'Number of unique values': 4, 'data type': dtype('O'), 'Missing Values': 'glimepiride has 0.00 % missing values.'}
acetohexamide
{'Top 5 value counts': No        101765
Steady         1
Name: acetohexamide, dtype: int64, 'Number of unique values': 2, 'data type': dtype('O'), 'Missing Values': 'acetohexamide has 0.00 % missing values.'}
glipizide
{'Top 5 value counts': No        89080
Steady    11356
Up          770
Down        560
Name: glipizide, dtype: int64, 'Number of unique values': 4, 'data type': dtype('O'), 'Missing Values': 'glipizide has 0.00 % missing values.'}
glyburide
{'Top 5 value counts': No        9

## Observations
There are 101,766 unique records with 71518 unique patients - This shows that it is possible that at least one or more patients has had at least 2 or more encounters that has forced them to be hospitalized.

Missing values form 3.79 % of this data and there are no duplicates in the data.

There are columns that have been given wrong types. For instance, columns like encounter_id and patient_nbr are taken as integers yet they are primary identifiers of the encounter and the patient in question respectively.

gender has a category known as 'Unknown/Invalid' which hints at missing values.

weight has over 96% of the column missing.

admission_type_id, discharge_disposition_id and admission_source_id are not really IDs in the conventional sense but are more like categorical variables which are nominal in nature- these features will be type casted to category.

payer_code has about 40% missing values and medical_specialty has about 50% missing values.

Other columns that is race, diag_1, diag_2 and diag_3 each have a small number of missing values with none of them having more than 3 % missing - these can easily have their rows dropped. It is worth pointing out that these columns are supposed to be categorical columns because they take on a limited number of categories without any inherent order. The study referred to these files for better understanding: 'https://github.com/WendyMwiti/Diabetes_Readmissions/blob/7a78a50202d6044b16bc601093845088f86266be/data/references'

Columns 24 to column 46 are drug names. They are to be casted to category as they contain few categorical values that are repetitive.

Features of type object that are supposed to be object will be type casted to category.

The target column in this task is readmitted which tells us whether a patient was readmitted within 30 days, greater than 30 days or not readmitted at all.

The rows with consideration of the column discharge_disposition_id that hint towards the fact that the patient was taken to hospice or death will be removed as they do not help the model - they cannot be readmitted. They are denoted by values 11,13,14,19,20 and 21.

## Data Preparation 

### Completeness
Completeness refers to whether all the required data is present. Missing data can lead to inaccurate or biased results. Completeness check involves:
Identifying missing values in all the columns
Deciding on how to handle them for example removal, imputation and so forth.

Several columns within the dataset contain missing values. Specifically, the following columns have missing data: race (2.23%), weight (96.86%), payer_code (39.56%), medical_specialty (49.08%), diag_1 (0.02%), diag_2 (0.35%), and diag_3 (1.40%).

The following operations shall be perfomed on the dataset:

** Removing Columns with Significant Missing Values

'weight' and 'payer_code' columns have been identified to have a significant number of missing values and shall therefore be dropped from the dataset.

** Dropping Rows

'race', 'diag_1', 'diag_2', and 'diag_3' are not only critical columns for this analysis but also have a very low percentage of missing values. Therefore, rows with missing values in any of these columns shall be dropped from the dataset

** Filling Missing Values

Although the "medical_specialty" feature exhibited a substantial 49% of missing values, it was retained in the dataset. Rather than discarding it, the missing entries were substituted with the term "missing." This decision was made due to the recognition of its significance in influencing the study's outcomes.


In [39]:
def preprocess_data(df):
    # Drop specified columns
    columns_to_drop = ['weight', 'payer_code']
    df.drop(columns=columns_to_drop, axis=1, inplace=True)

    # Drop rows with missing values in specified columns
    columns_to_check = ['race', 'diag_1', 'diag_2', 'diag_3']
    df.dropna(subset=columns_to_check, inplace=True)

    # Fill missing values in 'medical_specialty' column
    df['medical_specialty'] = df['medical_specialty'].fillna('missing')

    return df


In [40]:
df = preprocess_data(df)
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 98053 entries, 1 to 101765
Data columns (total 48 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   encounter_id              98053 non-null  int64 
 1   patient_nbr               98053 non-null  int64 
 2   race                      98053 non-null  object
 3   gender                    98053 non-null  object
 4   age                       98053 non-null  object
 5   admission_type_id         98053 non-null  int64 
 6   discharge_disposition_id  98053 non-null  int64 
 7   admission_source_id       98053 non-null  int64 
 8   time_in_hospital          98053 non-null  int64 
 9   medical_specialty         98053 non-null  object
 10  num_lab_procedures        98053 non-null  int64 
 11  num_procedures            98053 non-null  int64 
 12  num_medications           98053 non-null  int64 
 13  number_outpatient         98053 non-null  int64 
 14  number_emergency     

## Validity
Validity in data refers to the accuracy and correctness of the data in representing events that the data is meant to capture. In this case the aim is to ensure that data types are appropriate for the information they represent.

In [41]:
columns_to_cast = [
    'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
    'race', 'diag_1', 'diag_2', 'diag_3', 'metformin', 'repaglinide',
    'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide',
    'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
    'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
    'tolazamide', 'examide', 'citoglipton', 'insulin',
    'glyburide-metformin', 'glipizide-metformin',
    'glimepiride-pioglitazone', 'metformin-rosiglitazone',
    'metformin-pioglitazone', 'age', 'gender', 'medical_specialty',
    'max_glu_serum', 'A1Cresult', 'change', 'diabetesMed', 'readmitted'
]


In [42]:
for column in columns_to_cast:
    df[column] = df[column].astype('category')

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98053 entries, 1 to 101765
Data columns (total 48 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   encounter_id              98053 non-null  int64   
 1   patient_nbr               98053 non-null  int64   
 2   race                      98053 non-null  category
 3   gender                    98053 non-null  category
 4   age                       98053 non-null  category
 5   admission_type_id         98053 non-null  category
 6   discharge_disposition_id  98053 non-null  category
 7   admission_source_id       98053 non-null  category
 8   time_in_hospital          98053 non-null  int64   
 9   medical_specialty         98053 non-null  category
 10  num_lab_procedures        98053 non-null  int64   
 11  num_procedures            98053 non-null  int64   
 12  num_medications           98053 non-null  int64   
 13  number_outpatient         98053 non-null  int