In [1]:
import pandas as pd 

# Importing and Cleaning CKD data (from UCI's ML repository)

**Data Source: Rubini, L., Soundarapandian, P., & Eswaran, P. (2015). Chronic Kidney Disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5G020.**

In [6]:
df = pd.read_csv('kidney_disease.csv')

# Dropping the rows with missing entries to get the full picture always
df_clean = df.dropna(subset=['age', 'bp', 'htn', 'sc', 'hemo', 'classification', 'bgr', 'dm']).copy()

df_clean = df_clean.map(lambda x: x.strip() if isinstance(x, str) else x)

# Encoding the categorical variables into binary for heatmap + later analysis
pd.set_option('future.no_silent_downcasting', True)
df_clean.replace(
    {'yes': 1, 'no': 0,
     'present': 1, 'notpresent': 0,
     'poor': 1, 'good': 0,
     'ckd':1, 'notckd':0,
     'abnormal':1, 'normal':0}, inplace=True)
df_clean = df_clean.apply(pd.to_numeric, errors='coerce')

# Note: 'id' refers to the person's number (1-400) in the dataset. We removed this since it doesn't provide any information.
df_clean = df_clean.drop(columns=['id'])

**To keep the maximum number of data points for the analysis, 'NaN' values were only dropped from the columns of features used for EDA & model training. The features of interest in this study are hemoglobin (hemo), serum creatinine (sc), blood pressure (bp), hypertension (htn), blood glucose random (bgr), diabetes mellitus (dm), and age (age).**

In [7]:
df_clean.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,48.0,80.0,1.02,1.0,0.0,,0.0,0.0,0.0,121.0,...,44.0,7800.0,5.2,1,1,0,0.0,0.0,0.0,1
2,62.0,80.0,1.01,2.0,3.0,0.0,0.0,0.0,0.0,423.0,...,31.0,7500.0,,0,1,0,1.0,0.0,1.0,1
3,48.0,70.0,1.005,4.0,0.0,0.0,1.0,1.0,0.0,117.0,...,32.0,6700.0,3.9,1,0,0,1.0,1.0,1.0,1
4,51.0,80.0,1.01,2.0,0.0,0.0,0.0,0.0,0.0,106.0,...,35.0,7300.0,4.6,0,0,0,0.0,0.0,0.0,1
5,60.0,90.0,1.015,3.0,0.0,,,0.0,0.0,74.0,...,39.0,7800.0,4.4,1,1,0,0.0,1.0,0.0,1


In [8]:
df_clean['classification'].value_counts()

classification
1    161
0    132
Name: count, dtype: int64

**The cleaned data has a satisfactory amount of points with an approximately even split in classification counts, which is ideal.**

In [9]:
df_clean.to_csv('clean.csv', index = False)