# Chronic Kidney Disease (CKD) Data Analysis

## Introduction
Chronic Kidney Disease (CKD) is a global health problem that affects millions of people every year. 
This project applies **data analytics techniques** to explore and analyze the CKD dataset, with the goal of 
identifying key risk factors and drawing insights that can support early detection and prevention.

## Objectives
- To clean and prepare the CKD dataset for analysis.  
- To perform exploratory data analysis (EDA) and identify important trends and patterns.  
- To visualize risk factors associated with CKD.  
- To interpret findings and provide data-driven conclusions.  

In [2]:
import pandas as pd
file = "Data.xlsx"
sheets =  pd.read_excel(file,sheet_name=None)
print(sheets.keys())

dict_keys(['Data Description', 'CKD Risk Data', 'Correlation', 'Regression Model'])


In [3]:
df = sheets["CKD Risk Data"]
df.head()

Unnamed: 0,ID,Age,Residence,CKD Risk Score,Diabetes Mellitus,Blood Pressure,Hypertension,Blood Sugar,Blood Glucose,Reticulocyte Count,Packed Cell Volume,Haemoglobin,Red Blood Cell Count,White Blood Cell Count,Blood Urea,Serum Creatinine,Alcohol Intake,Physical Activity
0,1,86,Urban,82.194168,No,123,No,2,119,1.071754,20,14.7,6.05,6817,3.282934,41.590812,High,Active
1,2,66,Urban,43.231758,Yes,103,No,1,238,1.19621,23,15.1,5.04,6438,1.819046,84.642535,High,Active
2,3,24,Rural,71.791997,Yes,151,No,0,372,2.531091,40,13.6,2.67,11071,2.495975,157.521097,High,Active
3,4,40,Rural,39.146159,Yes,165,No,2,448,1.980966,30,11.8,6.3,6594,1.378014,190.107385,High,Active
4,5,43,Urban,64.029626,No,113,No,1,301,2.948063,52,15.4,4.35,12708,2.264839,152.740807,High,Active


In [4]:
df.shape

(250, 18)

In [5]:
df.isnull().sum()

ID                         0
Age                        0
Residence                  0
CKD Risk Score             0
Diabetes Mellitus          0
Blood Pressure             0
Hypertension               0
Blood Sugar                0
Blood Glucose              0
Reticulocyte Count         0
Packed Cell Volume         0
Haemoglobin                0
Red Blood Cell Count       0
White Blood Cell Count     0
Blood Urea                 0
Serum Creatinine           0
Alcohol Intake            74
Physical Activity          0
dtype: int64

In [6]:
df["Alcohol Intake"].unique()

array(['High', 'Moderate', nan], dtype=object)

In [7]:
df["Alcohol Intake"] = df["Alcohol Intake"].fillna("None")
df.isnull().sum()

ID                        0
Age                       0
Residence                 0
CKD Risk Score            0
Diabetes Mellitus         0
Blood Pressure            0
Hypertension              0
Blood Sugar               0
Blood Glucose             0
Reticulocyte Count        0
Packed Cell Volume        0
Haemoglobin               0
Red Blood Cell Count      0
White Blood Cell Count    0
Blood Urea                0
Serum Creatinine          0
Alcohol Intake            0
Physical Activity         0
dtype: int64

In [8]:
df.describe()

Unnamed: 0,ID,Age,CKD Risk Score,Blood Pressure,Blood Sugar,Blood Glucose,Reticulocyte Count,Packed Cell Volume,Haemoglobin,Red Blood Cell Count,White Blood Cell Count,Blood Urea,Serum Creatinine
count,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0
mean,125.5,53.26,57.983781,120.78,2.472,279.832,2.065026,36.708,11.0008,4.6186,9378.124,2.39125,117.983519
std,72.312977,20.759098,17.737879,35.155701,1.763421,122.550882,0.585775,10.149423,3.661961,1.076358,3227.389973,0.819981,40.637299
min,1.0,20.0,4.323176,60.0,0.0,70.0,0.874126,20.0,5.0,2.52,4018.0,0.770744,38.237198
25%,63.25,35.25,46.676049,91.0,1.0,175.25,1.653822,27.25,8.1,3.8925,6564.0,1.718965,84.701509
50%,125.5,52.5,58.138066,119.5,2.0,269.5,2.056248,37.0,10.9,4.62,9296.0,2.383499,117.69169
75%,187.75,72.0,71.345585,153.0,4.0,380.5,2.485562,45.0,14.3,5.5075,12051.75,3.079916,151.857328
max,250.0,89.0,96.128144,179.0,5.0,499.0,3.409239,54.0,17.5,6.46,14996.0,4.172717,205.755871


In [9]:
df['CKD Risk Score'].describe()

count    250.000000
mean      57.983781
std       17.737879
min        4.323176
25%       46.676049
50%       58.138066
75%       71.345585
max       96.128144
Name: CKD Risk Score, dtype: float64

In [10]:
df["CKD Risk Score"].skew()

np.float64(-0.2431038939876664)

In [11]:
df.kurtosis(numeric_only=True)

ID                       -1.200000
Age                      -1.238321
CKD Risk Score           -0.264414
Blood Pressure           -1.234523
Blood Sugar              -1.341916
Blood Glucose            -1.244434
Reticulocyte Count       -0.678210
Packed Cell Volume       -1.186234
Haemoglobin              -1.192259
Red Blood Cell Count     -0.923208
White Blood Cell Count   -1.197195
Blood Urea               -0.948485
Serum Creatinine         -0.965084
dtype: float64

In [12]:
df.skew(numeric_only=True)

ID                        0.000000
Age                      -0.011807
CKD Risk Score           -0.243104
Blood Pressure           -0.000313
Blood Sugar               0.046731
Blood Glucose             0.105384
Reticulocyte Count        0.088754
Packed Cell Volume       -0.010880
Haemoglobin               0.019327
Red Blood Cell Count     -0.198384
White Blood Cell Count    0.052796
Blood Urea               -0.043078
Serum Creatinine         -0.038642
dtype: float64

## Step 1: Concept — What is Linear Regression?

Linear regression is a method we use to model the relationship between one dependent variable (Y) and one or more independent variables (X₁, X₂, X₃, …).

The goal is to find a line (or hyperplane, if there are many predictors) that best fits the data:
CKD Risk Score=β0​+β1​X1​+β2​X2​+…+βn​Xn​+ε

Where:

β₀ = intercept (baseline CKD risk when all other predictors = 0)

β₁, β₂, … βₙ = coefficients showing how much each variable affects CKD risk

ε = error term (unexplained variation)

### Step 2: In our CKD Context ,dependent variable (Y) is__CKD Risk Score

Our independent variables (X) include:

Blood Pressure, Blood Sugar, Age, Residence, Diabetes Mellitus, Hypertension, Alcohol Intake, Physical Activity, etc.

So the model tries to answer:

“How do these health and lifestyle factors combine to predict a person’s CKD Risk Score?”

## Step 3: Multiple vs Simple Regression

Simple regression: uses one predictor, e.g. only Blood Pressure.

Multiple regression: uses many predictors, e.g. BP + Age + Diabetes + Alcohol Intake.

Our assignment says to do multiple regression, so we’ll include all predictors except Patient ID and CKD Risk Score.


In [15]:
# Step 1: Define dependent and independent variables
y = df["CKD Risk Score"]  # dependent variable

# Exclude CKD Risk Score and Patient ID from predictors
X = df.drop(["ID", "CKD Risk Score"], axis=1)

# Step 2: Define categorical variables
categorical_vars = ["Residence", "Diabetes Mellitus", "Hypertension", "Alcohol Intake", "Physical Activity"]

# Step 3: Encode categorical variables into dummy variables
X_encoded = pd.get_dummies(X, columns=categorical_vars, drop_first=True)

# Step 4: Check the new columns
X_encoded.head()


Unnamed: 0,Age,Blood Pressure,Blood Sugar,Blood Glucose,Reticulocyte Count,Packed Cell Volume,Haemoglobin,Red Blood Cell Count,White Blood Cell Count,Blood Urea,Serum Creatinine,Residence_Urban,Diabetes Mellitus_Yes,Hypertension_Yes,Alcohol Intake_Moderate,Alcohol Intake_None,Physical Activity_Inactive,Physical Activity_Typical
0,86,123,2,119,1.071754,20,14.7,6.05,6817,3.282934,41.590812,True,False,False,False,False,False,False
1,66,103,1,238,1.19621,23,15.1,5.04,6438,1.819046,84.642535,True,True,False,False,False,False,False
2,24,151,0,372,2.531091,40,13.6,2.67,11071,2.495975,157.521097,False,True,False,False,False,False,False
3,40,165,2,448,1.980966,30,11.8,6.3,6594,1.378014,190.107385,False,True,False,False,False,False,False
4,43,113,1,301,2.948063,52,15.4,4.35,12708,2.264839,152.740807,True,False,False,False,False,False,False


In [20]:
# Convert booleans and other non-numeric types to integers or floats
X_with_const = X_with_const.apply(pd.to_numeric, errors='coerce')
y = pd.to_numeric(y, errors='coerce')


In [22]:
X_with_const.dtypes


const                         float64
Age                             int64
Blood Pressure                  int64
Blood Sugar                     int64
Blood Glucose                   int64
Reticulocyte Count            float64
Packed Cell Volume              int64
Haemoglobin                   float64
Red Blood Cell Count          float64
White Blood Cell Count          int64
Blood Urea                    float64
Serum Creatinine              float64
Residence_Urban                  bool
Diabetes Mellitus_Yes            bool
Hypertension_Yes                 bool
Alcohol Intake_Moderate          bool
Alcohol Intake_None              bool
Physical Activity_Inactive       bool
Physical Activity_Typical        bool
dtype: object

In [23]:
import statsmodels.api as sm

# Add constant (intercept)
X_with_const = sm.add_constant(X_encoded)

# Fit the model
model = sm.OLS(y, X_with_const).fit()

# View the summary
print(model.summary())


ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).