## Wiki Project

This project will be working with a diabetic dataset to evaluate predicting/explaining factors for diabetic patients' readmissions. A journal article related to this dataset is also attached to give the background about this dataset. For additional information about this dataset, see:

https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008


1. Preprocess your data.
    1. Handle missing variables*
    1. Code your categorical (e.g., using the get_dummies() or bag of words)*
    1. Bin your numerical variables*
    1. Conduct correlation analysis*
    1. Combine variables
    1. Apply data reduction techniques for numerical variables (e.g., PCA, SVD)
    1. Balance your classes

In [None]:
from zipfile import ZipFile
import pandas as pd


In [None]:
# Inputs
datafile="dataset_diabetes.zip"

In [None]:
mapping=[]
diabetic=[]

with ZipFile(datafile) as rawData:
        for info in rawData.infolist():
            print("Reading: ", info.filename)
            with rawData.open(info.filename) as f:
                if 'IDs_mapping' in info.filename:
                    mapping = pd.read_csv(f)
                else:
                    diabetic = pd.read_csv(f)

# Removing Colums with too many unknow values
dropped_Cols=['weight', 'medical_specialty', 'payer_code']
diabetic.drop(columns=dropped_Cols, inplace=True)
# Drop remaining rows with unknown values
diabetic.drop(index=(diabetic[diabetic.isin(['?']).any(axis=1)].index), inplace=True)
# Let's check what's left
diabetic.shape

diabetic = pd.get_dummies(diabetic, prefix_sep='_', drop_first=False)



Reading:  dataset_diabetes/diabetic_data.csv
Reading:  dataset_diabetes/IDs_mapping.csv


In [None]:
pd.DataFrame({'mean': diabetic.mean(),
              'sd': diabetic.std(),
              'min': diabetic.min(),
              'max': diabetic.max(),
              'median': diabetic.median(),
              'length': len(diabetic),
              'miss.val': diabetic.isnull().sum(),
             })

Unnamed: 0,mean,sd,min,max,median,length,miss.val
encounter_id,1.658294e+08,1.024322e+08,12522,443867222,153301920.0,98053,0
patient_nbr,5.484792e+07,3.866175e+07,135,189502619,46877904.0,98053,0
admission_type_id,2.025813e+00,1.450117e+00,1,8,1.0,98053,0
discharge_disposition_id,3.753368e+00,5.309392e+00,1,28,1.0,98053,0
admission_source_id,5.776692e+00,4.071640e+00,1,25,7.0,98053,0
...,...,...,...,...,...,...,...
diabetesMed_No,2.315278e-01,4.218110e-01,0,1,0.0,98053,0
diabetesMed_Yes,7.684722e-01,4.218110e-01,0,1,1.0,98053,0
readmitted_<30,1.128573e-01,3.164199e-01,0,1,0.0,98053,0
readmitted_>30,3.533701e-01,4.780188e-01,0,1,0.0,98053,0


In [None]:
diabetic['encounter_bin'] = pd.cut(diabetic.encounter_id, range(0, 500000000), labels=False)
diabetic


Unnamed: 0,admission_type_id,description
0,1,Emergency
1,2,Urgent
2,3,Elective
3,4,Newborn
4,5,Not Available
...,...,...
62,22,Transfer from hospital inpt/same fac reslt in...
63,23,Born inside this hospital
64,24,Born outside this hospital
65,25,Transfer from Ambulatory Surgery Center


KernelInterrupted: Execution interrupted by the Jupyter kernel.

(26755, 49)

In [None]:
diabetic.shape

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=52e9ae2e-8d42-48c9-9988-588f5a262306' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>