## Readmission causes of patient's with Diabetes
Unplanned hospital readmissions impose a substantial financial burden of approximately $15-20 billion annually on healthcare providers in the United States. By mitigating readmission rates among patients with diabetes, there is a twofold potential: improvement of healthcare systems and enhancing the overall well-being of patients. Precise management of blood sugar levels in individuals with diabetes yields a noteworthy decrease in both mortality and morbidity rates.

This analysis aims to scrutinize historical trends in diabetes care for patients hospitalized in the US, subsequently guiding the formulation of medical protocols for superior patient care. Within this investigation, particular emphasis is placed on the significance of the HbA1c marker in managing diabetes mellitus patients. The hypothesis asserts that the measurement of HbA1c is intrinsically linked to  reduction in readmission rates for individuals undergoing hospitalization.

## Problem Statement
Hospital readmissions are detrimental to patients with diabetes healthwise, financially and timewise. It is important to understand the variables that contribute to these readmissions to enable improve quality of care of the patients

## Objectives
1. To identify the most common primary diagnosis by age group
2. To identify the effect of a diabetes diagnosis on readmission rates
3. To identify groups of patients with high probability of readmission
4. To predict the significance of HbA1c marker in readmission rates

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LogisticRegression

In [20]:
#loading data
df = pd.read_csv('C:/Users/Wendy/Documents/readmission_ds/Diabetes_Readmissions/diabetic_data.csv')

In [31]:
def inspect_df(df):
    """This prints out the summary from the inspection of the data"""

    return {"Dimensions": f"This data set has {df.shape[0]} rows and {df.shape[1]} columns",
            "Duplicates": f"The data has {df.duplicated().sum()} duplicated entries and {len(df) - df.duplicated().sum()} non duplicated entries",
            "Missing values (%)": f"{sum(df.isna().sum())/df.__len__() * 100} % of the data has missing values",
            "Summary statistics": df.describe().T,
            "Info (printed above)":df.info()}

total_rows = len(df)

def inspect_col(df, col):
       
  """Summary Inspection of the column in focus"""
  return {"Top 5 value counts": df[col].value_counts()[:5],
          "Number of unique values": df[col].nunique(),
          "Missing Values": f"{col} has {df[col].isna().sum()/total_rows * 100 :.2f} % missing values."}
        
    

In [32]:
inspect_df(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      99493 non-null   object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    3197 non-null    object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                61510 non-null   object
 11  medical_specialty         51817 non-null   object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

{'Dimensions': 'This data set has 101766 rows and 50 columns',
 'Duplicates': 'The data has 0 duplicated entries and 101766 non duplicated entries',
 'Missing values (%)': '189.5023878309062 % of the data has missing values',
 'Summary statistics':                              count          mean           std      min  \
 encounter_id              101766.0  1.652016e+08  1.026403e+08  12522.0   
 patient_nbr               101766.0  5.433040e+07  3.869636e+07    135.0   
 admission_type_id         101766.0  2.024006e+00  1.445403e+00      1.0   
 discharge_disposition_id  101766.0  3.715642e+00  5.280166e+00      1.0   
 admission_source_id       101766.0  5.754437e+00  4.064081e+00      1.0   
 time_in_hospital          101766.0  4.395987e+00  2.985108e+00      1.0   
 num_lab_procedures        101766.0  4.309564e+01  1.967436e+01      1.0   
 num_procedures            101766.0  1.339730e+00  1.705807e+00      0.0   
 num_medications           101766.0  1.602184e+01  8.127566e+00     

In the weight column, there are missing values represented by a '?' symbol. However, these symbols are not recognized as missing values by default. To accurately capture the missing data, we need to replace these '?' symbols with 'NaN' (Not a Number), which is a standard representation for missing values in the dataframe. This step ensures that we have a more precise understanding of the extent of missing data within the dataset."

In [33]:
#there are missing values like in the weight column but they are represented with a '?' so I am replacing with NAN
df.replace('?', np.nan, inplace=True)

In [34]:
for cols in df.columns:
    print(inspect_col(df, cols))


{'Top 5 value counts': 96210942     1
89943846     1
384306986    1
94650156     1
83156784     1
Name: encounter_id, dtype: int64, 'Number of unique values': 101766, 'Missing Values': 'encounter_id has 0.00 % missing values.'}
{'Top 5 value counts': 88785891    40
43140906    28
23199021    23
1660293     23
88227540    23
Name: patient_nbr, dtype: int64, 'Number of unique values': 71518, 'Missing Values': 'patient_nbr has 0.00 % missing values.'}
{'Top 5 value counts': Caucasian          76099
AfricanAmerican    19210
Hispanic            2037
Other               1506
Asian                641
Name: race, dtype: int64, 'Number of unique values': 5, 'Missing Values': 'race has 2.23 % missing values.'}
{'Top 5 value counts': Female             54708
Male               47055
Unknown/Invalid        3
Name: gender, dtype: int64, 'Number of unique values': 3, 'Missing Values': 'gender has 0.00 % missing values.'}
{'Top 5 value counts': [70-80)    26068
[60-70)    22483
[50-60)    17256
[80-9

Several columns within the dataset contained missing values. Specifically, the following columns had missing data: race (2.23%), weight (96.86%), payer_code (39.56%), medical_specialty (49.08%), diag_1 (0.02%), diag_2 (0.35%), and diag_3 (1.40%).

Considering the extraordinarily high percentage of missing values in the weight column, it is impractical to include it in the subsequent analysis. Therefore, the decision has been made to omit this column from further consideration.

The payer_code column, in addition to having a substantial proportion of missing values, does not align with the primary objective of this project. Consequently, it will be removed from the dataset.

Addressing the medical_specialty column, which demonstrates a significant presence of missing values, a prudent approach involves imputing the missing entries with the placeholder 'missing', allowing us to retain valuable information while acknowledging the gaps.

Given the limited quantity of missing values in the race, diag_1, diag_2, and diag_3 columns, it is judicious to exclude the affected records, thereby ensuring the integrity of the remaining data for analysis
