# Analysis Effective Treatment for Diabetic Patients



### About data
The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks for diabetic patients.<br> 
It includes over 50 features representing patient and hospital outcomes.



Thanks to John Clore, Krzysztof Cios, Jon DeShazo & Beata Strack for creating and donating this data to UCI Machine Learning Repository.<br> 
[UCI Link](https://archive-beta.ics.uci.edu/ml/datasets/diabetes+130+us+hospitals+for+years+1999+2008#Descriptive)

“Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.<br> 
[Research Article Link](https://downloads.hindawi.com/journals/bmri/2014/781670.pdf)


### Importing required libraries

In [2]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Importing data
- Importing diabetic_data.csv and IDs_mapping.csv files for analysis

In [17]:
# Importing diabetic_data.cs dataset file to pandas dataframe
diabetic_data = pd.read_csv('datasets/diabetic_data.csv')

### Analysing diabetic_data

In [52]:
# Checking row and column count of diabetic_data dataset
diabetic_data.shape

(101766, 50)

Diabetic dataset have 101766 records with 50 columns

In [55]:
# Checking first 5 rows of diabetic_data
diabetic_data.head().T

Unnamed: 0,0,1,2,3,4
encounter_id,2278392,149190,64410,500364,16680
patient_nbr,8222157,55629189,86047875,82442376,42519267
race,Caucasian,Caucasian,AfricanAmerican,Caucasian,Caucasian
gender,Female,Female,Female,Male,Male
age,[0-10),[10-20),[20-30),[30-40),[40-50)
weight,?,?,?,?,?
admission_type_id,6,1,1,1,1
discharge_disposition_id,25,1,1,1,1
admission_source_id,1,7,7,7,7
time_in_hospital,1,3,2,2,1


In [53]:
# Checking colums of diabetic_data
diabetic_data.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [25]:
# Checking column data type and info for diabetic_data
diabetic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [29]:
# Checking if there is any null values
diabetic_data.isnull().values.any()

False

There is no null values in diabetic_data dateset.

In [127]:
# Importing IDs_mapping.csv file to pandas dataframe
#id_mapping = pd.read_csv('datasets/IDs_mapping.csv',header=None)
id_mapping = pd.read_csv('datasets/IDs_mapping.csv')

In [128]:
id_mapping.shape

(67, 2)

In [129]:
id_mapping.head(11)

Unnamed: 0,admission_type_id,description
0,1,Emergency
1,2,Urgent
2,3,Elective
3,4,Newborn
4,5,Not Available
5,6,
6,7,Trauma Center
7,8,Not Mapped
8,,
9,discharge_disposition_id,description


In [130]:
id_mapping.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   admission_type_id  65 non-null     object
 1   description        62 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


In [109]:
#id_mapping needs to split in 3 dataframes

df_list = np.split(id_mapping, id_mapping[id_mapping.isnull().all(1)].index) 

df_list = np.split(id_mapping, id_mapping[id_mapping.isnull().all(1)].index) 


admission_type_id = df_list[0]

admission_type_id.columns = admission_type_id.iloc[0]
admission_type_id = admission_type_id[1:]

admission_type_id.reset_index()

admission_type_id.set_index('admission_type_id')

admission_type_id

Unnamed: 0,admission_type_id,description
1,1,Emergency
2,2,Urgent
3,3,Elective
4,4,Newborn
5,5,Not Available
6,6,
7,7,Trauma Center
8,8,Not Mapped


In [125]:
admission_type_id.info

<bound method DataFrame.info of 0 admission_type_id    description
1                 1      Emergency
2                 2         Urgent
3                 3       Elective
4                 4        Newborn
5                 5  Not Available
6                 6            NaN
7                 7  Trauma Center
8                 8     Not Mapped>

In [122]:
discharge_disposition_id = df_list[1]

discharge_disposition_id = discharge_disposition_id[1:]

discharge_disposition_id.columns = discharge_disposition_id.iloc[0]
discharge_disposition_id = discharge_disposition_id[1:]

discharge_disposition_id.set_index('discharge_disposition_id')

discharge_disposition_id.reset_index()

discharge_disposition_id

10,discharge_disposition_id,description
11,1,Discharged to home
12,2,Discharged/transferred to another short term h...
13,3,Discharged/transferred to SNF
14,4,Discharged/transferred to ICF
15,5,Discharged/transferred to another type of inpa...
16,6,Discharged/transferred to home with home healt...
17,7,Left AMA
18,8,Discharged/transferred to home under care of H...
19,9,Admitted as an inpatient to this hospital
20,10,Neonate discharged to another hospital for neo...


In [123]:
discharge_disposition_id.columns

Index(['discharge_disposition_id', 'description'], dtype='object', name=10)

In [124]:
discharge_disposition_id.info

<bound method DataFrame.info of 10 discharge_disposition_id                                        description
11                        1                                 Discharged to home
12                        2  Discharged/transferred to another short term h...
13                        3                      Discharged/transferred to SNF
14                        4                      Discharged/transferred to ICF
15                        5  Discharged/transferred to another type of inpa...
16                        6  Discharged/transferred to home with home healt...
17                        7                                           Left AMA
18                        8  Discharged/transferred to home under care of H...
19                        9          Admitted as an inpatient to this hospital
20                       10  Neonate discharged to another hospital for neo...
21                       11                                            Expired
22                  

In [None]:
discharge_disposition_id[discharge_disposition_id['discharge_disposition_id']=='2']['description']

In [99]:
admission_source_id = df_list[2]

admission_source_id = admission_source_id[1:]

admission_source_id.columns = admission_source_id.iloc[0]
admission_source_id = admission_source_id[1:]

admission_source_id.set_index('admission_source_id')

admission_source_id

42,admission_source_id,description
43,1,Physician Referral
44,2,Clinic Referral
45,3,HMO Referral
46,4,Transfer from a hospital
47,5,Transfer from a Skilled Nursing Facility (SNF)
48,6,Transfer from another health care facility
49,7,Emergency Room
50,8,Court/Law Enforcement
51,9,Not Available
52,10,Transfer from critial access hospital


### Python
- Define a custom function to create reusable code (10)
- NumPy (10)
- Dictionary or Lists (10)

### Machine Learning (60)
- Predict a target variable with Supervised or Unsupervised algorithm
- You are free to choose any algorithm
- Perform hyper parameter tuning or boosting, whichever is relevant to your model. If it is not relevant, justify that in your report and Python comments

### Visualise
- Present two charts with Seaborn or Matplotlib (20)

### Generate valuable insights
- 5 insights from the project (20)

1. Gain insight into scoping in Python and be able to write functions with multiple parameters and multiple return values, along with default arguments and variable-length arguments.

2. Have a clear understanding of iterators, objects, list comprehensions and generators.

3. Identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced and deal with improper data types, validate that data is in the correct range, handle missing data and perform record linkage.

4. Understand string manipulation using regular expression and work with datasets containing movie reviews or streamed tweets that can be used to determine opinion, as well as with raw text scraped from the web.

5. Understand the two principles of statistical inference, parameter estimation and hypothesis testing and work on real world datasets to solve real inference problems.

6. Use tools a data scientist needs to clean and validate data, to visualize distributions and relationships between variables, and to use regression models to predict and visualize.

7. Understand and build supervised predictive models, tune their parameters, and determine how well they will perform with unseen data on real-world datasets.

8. Work on unlabeled datasets using unsupervised clustering algorithms to transform,visualize and extract insights and build a recommender system on a real-world usecase.

9. Use deep learning to optimize natural language processing, image and speech recognition, robotics and many more.

10. Understand the advantages and shortcomings of trees and demonstrate how ensembling can alleviate these shortcomings.

In [48]:
def df_analysis(df):
    '''Generic function for basic analysis of any given data frame
    Display shape (Rows and column)
    Check null values 
    Display columns
    Display Info
    Display first 5 records
    
    Args:
        df : pandas dataframe    
    '''
    # Printing shape ( row and columns ) of dataframe
    print("Data has Rows ={},Columns ={}".format(df.shape[0],df.shape[1]))
    # Checking null values in dataframe
    print(df.isnull().values.any())
    # Printing columns
    print(df.columns)
    # Printing info
    print(df.info())
    # Printing head ( first 5 rows)
    print(df.head())     