---
title: STATS 3DA3
subtitle: Homework Assignment 6
author: "Pratheepa Jeganathan"
date: 04/04/2024
format: pdf
header-includes:
   - \usepackage{amsmath}
   - \usepackage{bbm}
   - \usepackage{array}
   - \usepackage{multirow}
   - \usepackage{graphicx}
   - \usepackage{float}
   - \usepackage{apacite}
   - \usepackage{natbib}
execute: 
  echo: true
fontsize: 11pt
geometry: margin = 1in
linestretch: 1.5
bibliography: ass6.bib
---

## Chronic Kidney Disease Classification Challenge

### Overview

Engage with the dataset from the [Early Stage of Indians Chronic Kidney Disease (CKD)](https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease) project, which comprises data on 250 early-stage CKD patients and 150 healthy controls.

For foundational knowledge on the subject, refer to "Predict, diagnose, and treat chronic kidney disease with machine learning: a systematic literature review" by [Sanmarchi et al., (2023)](https://link.springer.com/article/10.1007/s40620-023-01573-4).

### Objectives

Analyze the dataset using two classification algorithms, focusing on exploratory data analysis, feature selection, engineering, and especially on handling missing values and outliers. Summarize your findings with insightful conclusions.

**Classifier Requirement:** Ensure at least one of the classifiers is interpretable, to facilitate in-depth analysis and inference.

### Guidelines

- **Teamwork:** Group submissions should compile the workflow (Python codes and interpretations) into a single PDF, including a GitHub repository link. The contributions listed should reflect the GitHub activity.
- **Content:** Address the following questions in your submission, offering detailed insights and conclusions from your analysis.

### Assignment Questions

1. **Classification Problem Identification:** Define and describe a classification problem based on the dataset.
2. **Variable Transformation:** Implement any transformations chosen or justify the absence of such modifications.
3. **Dataset Overview:** Provide a detailed description of the dataset, covering variables, summaries, observation counts, data types, and distributions (at least three statements).
4. **Association Between Variables:** Analyze variable relationships and their implications for feature selection or extraction (at least three statements).
5. **Missing Value Analysis and Handling:** Implement your strategy for identifying and addressing missing values in the dataset, or provide reasons for not addressing them.
6. **Outlier Analysis:** Implement your approach for identifying and managing outliers, or provide reasons for not addressing them.
7. **Sub-group Analysis:** Explore potential sub-groups within the data, employing appropriate data science methods to find the sub-groups of patients and visualize the sub-groups. The sub-group analysis must not include the labels (for CKD patients and healthy controls).
8. **Data Splitting:** Segregate 30% of the data for testing, using a random seed of 1. Use the remaining 70% for training and model selection.
9. **Classifier Choices:** Identify the two classifiers you have chosen and justify your selections.
10. **Performance Metrics:** Outline the two metrics for comparing the performance of the classifiers.
11. **Feature Selection/Extraction:** Implement methods to enhance the performance of at least one classifier in (9). The answer for this question can be included in (12).
12. **Classifier Comparison:** Utilize the selected metrics to compare the classifiers based on the test set. Discuss your findings (at least two statements).
13. **Interpretable Classifier Insight:** After re-training the interpretable classifier with all available data, analyze and interpret the significance of predictor variables in the context of the data and the challenge (at least two statements).
14. **[Bonus]** Sub-group Improvement Strategy: If sub-groups were identified, propose and implement a method to improve one classifier performance further. Compare the performance of the new classifer with the results in (12).
15. **Team Contributions:** Document each team member's specific contributions related to the questions above.
16. **Link** to the public GitHub repository.

### Notes

- This assignment encourages you to apply sophisticated machine learning methods to a vital healthcare challenge, promoting the development of critical analytical skills, teamwork, and practical problem-solving abilities in the context of chronic kidney disease diagnosis and treatment.
- Students can choose one classifer not covered in the lectures.

In [522]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from patsy import dmatrices, dmatrix
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

In [523]:
pip install ucimlrepo

Note: you may need to restart the kernel to use updated packages.


In [524]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
chronic_kidney_disease = fetch_ucirepo(id=336) 
  
# data (as pandas dataframes) 
X = chronic_kidney_disease.data.features 
y = chronic_kidney_disease.data.targets 
  
# metadata 
print(chronic_kidney_disease.metadata) 
  
# variable information 
print(chronic_kidney_disease.variables) 


{'uci_id': 336, 'name': 'Chronic Kidney Disease', 'repository_url': 'https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease', 'data_url': 'https://archive.ics.uci.edu/static/public/336/data.csv', 'abstract': 'This dataset can be used to predict the chronic kidney disease and it can be collected from the hospital nearly 2 months of period.', 'area': 'Other', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 400, 'num_features': 24, 'feature_types': ['Real'], 'demographics': ['Age'], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2015, 'last_updated': 'Mon Mar 04 2024', 'dataset_doi': '10.24432/C5G020', 'creators': ['L. Rubini', 'P. Soundarapandian', 'P. Eswaran'], 'intro_paper': None, 'additional_info': {'summary': 'We use the following representation to collect the dataset\r\n                        age\t\t-\tage\t\r\n\t\t\tbp\t\t-\tblood pressure\r\n\t\t\tsg\t

In [525]:
X

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
0,48.0,80.0,1.020,1.0,0.0,,normal,notpresent,notpresent,121.0,...,15.4,44.0,7800.0,5.2,yes,yes,no,good,no,no
1,7.0,50.0,1.020,4.0,0.0,,normal,notpresent,notpresent,,...,11.3,38.0,6000.0,,no,no,no,good,no,no
2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,9.6,31.0,7500.0,,no,yes,no,poor,no,yes
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,11.2,32.0,6700.0,3.9,yes,no,no,poor,yes,yes
4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,11.6,35.0,7300.0,4.6,no,no,no,good,no,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,140.0,...,15.7,47.0,6700.0,4.9,no,no,no,good,no,no
396,42.0,70.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,75.0,...,16.5,54.0,7800.0,6.2,no,no,no,good,no,no
397,12.0,80.0,1.020,0.0,0.0,normal,normal,notpresent,notpresent,100.0,...,15.8,49.0,6600.0,5.4,no,no,no,good,no,no
398,17.0,60.0,1.025,0.0,0.0,normal,normal,notpresent,notpresent,114.0,...,14.2,51.0,7200.0,5.9,no,no,no,good,no,no


In [526]:
y

Unnamed: 0,class
0,ckd
1,ckd
2,ckd
3,ckd
4,ckd
...,...
395,notckd
396,notckd
397,notckd
398,notckd


In [527]:
X.dtypes

age      float64
bp       float64
sg       float64
al       float64
su       float64
rbc       object
pc        object
pcc       object
ba        object
bgr      float64
bu       float64
sc       float64
sod      float64
pot      float64
hemo     float64
pcv      float64
wbcc     float64
rbcc     float64
htn       object
dm        object
cad       object
appet     object
pe        object
ane       object
dtype: object

In [528]:
y.dtypes

class    object
dtype: object

In [529]:
X['rbc'].value_counts()

rbc
normal      201
abnormal     47
Name: count, dtype: int64

In [530]:
X['pc'].value_counts()

pc
normal      259
abnormal     76
Name: count, dtype: int64

In [531]:
X['pcc'].value_counts()

pcc
notpresent    354
present        42
Name: count, dtype: int64

In [532]:
X['ba'].value_counts()

ba
notpresent    374
present        22
Name: count, dtype: int64

In [533]:
X['htn'].value_counts()

htn
no     251
yes    147
Name: count, dtype: int64

In [534]:
X['dm'].value_counts()

dm
no      260
yes     137
\tno      1
Name: count, dtype: int64

In [535]:
X['cad'].value_counts()

cad
no     364
yes     34
Name: count, dtype: int64

In [536]:
X['appet'].value_counts()

appet
good    317
poor     82
Name: count, dtype: int64

In [537]:
X['pe'].value_counts()

pe
no     323
yes     76
Name: count, dtype: int64

In [538]:
X['ane'].value_counts()

ane
no     339
yes     60
Name: count, dtype: int64

In [539]:
X.isnull().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
dtype: int64

In [540]:
y.isnull().sum()

class    0
dtype: int64

In [564]:
X_drop= X.dropna()
X_drop

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
3,48.0,70.0,1.005,4.0,0.0,1.0,0.0,1.0,0.0,117.0,...,11.2,32.0,6700.0,3.9,1.0,0,0.0,0.0,1.0,1.0
9,53.0,90.0,1.020,2.0,0.0,0.0,0.0,1.0,0.0,70.0,...,9.5,29.0,12100.0,3.7,1.0,1,0.0,0.0,0.0,1.0
11,63.0,70.0,1.010,3.0,0.0,0.0,0.0,1.0,0.0,380.0,...,10.8,32.0,4500.0,3.8,1.0,1,0.0,0.0,1.0,0.0
14,68.0,80.0,1.010,3.0,2.0,1.0,0.0,1.0,1.0,157.0,...,5.6,16.0,11000.0,2.6,1.0,1,1.0,0.0,1.0,0.0
20,61.0,80.0,1.015,2.0,0.0,0.0,0.0,0.0,0.0,173.0,...,7.7,24.0,9200.0,3.2,1.0,1,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,140.0,...,15.7,47.0,6700.0,4.9,0.0,0,0.0,1.0,0.0,0.0
396,42.0,70.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,75.0,...,16.5,54.0,7800.0,6.2,0.0,0,0.0,1.0,0.0,0.0
397,12.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,100.0,...,15.8,49.0,6600.0,5.4,0.0,0,0.0,1.0,0.0,0.0
398,17.0,60.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,114.0,...,14.2,51.0,7200.0,5.9,0.0,0,0.0,1.0,0.0,0.0


In [565]:
columns_to_convert = ['sg', 'al','su','rbc','pc','pcc','ba','bu','htn','dm','cad','appet','pe','ane']
for col in columns_to_convert:
    X_drop[col] = pd.Categorical(X_drop[col])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_drop[col] = pd.Categorical(X_drop[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_drop[col] = pd.Categorical(X_drop[col])


In [543]:
X_drop.dtypes

age       float64
bp        float64
sg       category
al       category
su       category
rbc      category
pc       category
pcc      category
ba       category
bgr       float64
bu       category
sc        float64
sod       float64
pot       float64
hemo      float64
pcv       float64
wbcc      float64
rbcc      float64
htn      category
dm       category
cad      category
appet    category
pe       category
ane      category
dtype: object

# Algorithm1:

1. **Classification Problem Identification:** Define and describe a classification problem based on the dataset.

######QUESTION: Should we convert variables to categorical as we did in assignment 4??

There are 400 observations and 25 variables in Chronic Kidney Disease dataset. 
There are 14 float64 type variables, and they are "age", "bp", "sg", "al", "su", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wbcc", "rbcc".
There are 11 object type vaariables, and they are "rbc", "pc", "pcc", "ba", "htn", "dm", "cad", "appet", "pe", "ane", "calss".
In these 25 variables, all variables in X are covariates (consists 14 floats and 10 object) and the variable in y is the response (the variable "class" which is an object variable). Moreover, the response is "class" contains only 2 category:"ckd" or "not ckd", so the respnse is binary.

There are 9 missing values in "age", 12 missing values in "bp", 47 missing values in "sg", 46 missing values in "al", 49 missing values in "su", 152 missing values in "rbc", 65 missing values in "pc", 4 missing values in "pcc", 4 missing values in "ba", 44  missing values in "bgr", 19  missing values in "bu", 17  missing values in "sc", 87 missing values in "sod", 88 missing values in "pot", 52 missing values in "hemo", 71 missing values in "pcv", 106 missing values in "wbcc", 131 missing values in "rbcc", 2  missing values in "htn", 2  missing values in "dm", 2 missing values in "cad", 1 missing value in "appet", 1 missing value in "pe" and 1 missing value in "ane".

0 missing value in "class".

2. **Variable Transformation:** Implement any transformations chosen or justify the absence of such modifications.

In [544]:
cat = ['sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']
X_drop = pd.get_dummies(X_drop,columns=cat)

In [545]:
caty = ['class']
y = pd.get_dummies(y,columns=caty)

In [566]:
X_drop

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
3,48.0,70.0,1.005,4.0,0.0,1.0,0.0,1.0,0.0,117.0,...,11.2,32.0,6700.0,3.9,1.0,0,0.0,0.0,1.0,1.0
9,53.0,90.0,1.020,2.0,0.0,0.0,0.0,1.0,0.0,70.0,...,9.5,29.0,12100.0,3.7,1.0,1,0.0,0.0,0.0,1.0
11,63.0,70.0,1.010,3.0,0.0,0.0,0.0,1.0,0.0,380.0,...,10.8,32.0,4500.0,3.8,1.0,1,0.0,0.0,1.0,0.0
14,68.0,80.0,1.010,3.0,2.0,1.0,0.0,1.0,1.0,157.0,...,5.6,16.0,11000.0,2.6,1.0,1,1.0,0.0,1.0,0.0
20,61.0,80.0,1.015,2.0,0.0,0.0,0.0,0.0,0.0,173.0,...,7.7,24.0,9200.0,3.2,1.0,1,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,140.0,...,15.7,47.0,6700.0,4.9,0.0,0,0.0,1.0,0.0,0.0
396,42.0,70.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,75.0,...,16.5,54.0,7800.0,6.2,0.0,0,0.0,1.0,0.0,0.0
397,12.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,100.0,...,15.8,49.0,6600.0,5.4,0.0,0,0.0,1.0,0.0,0.0
398,17.0,60.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,114.0,...,14.2,51.0,7200.0,5.9,0.0,0,0.0,1.0,0.0,0.0


In [567]:
y.head()

Unnamed: 0,class_ckd,class_ckd\t,class_notckd
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False


In [568]:
float_col = X_drop.select_dtypes(include=['float64']).columns
object_col = X_drop.select_dtypes(include=['object']).columns

In [569]:
binary = {'rbc':{'normal':1,'abnormal':0},
          'pc':{'normal':1,'abnormal':0},
          'pcc':{'present':1,'notpresent':0},
          'ba':{'present':1,'notpresent':0},
          'htn':{'yes':1,'no':0},
          'dm':{'yes':1,'no':0},
          'cad':{'yes':1,'no':0},
          'appet':{'good':1,'poor':0},
          'pe':{'yes':1,'no':0},
          'ane':{'yes':1,'no':0}
          
          }

In [570]:
for i,j in binary.items():
    X_drop[i] = X_drop[i].replace(j)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_drop[i] = X_drop[i].replace(j)


In [571]:
X_drop

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
3,48.0,70.0,1.005,4.0,0.0,1.0,0.0,1.0,0.0,117.0,...,11.2,32.0,6700.0,3.9,1.0,0,0.0,0.0,1.0,1.0
9,53.0,90.0,1.020,2.0,0.0,0.0,0.0,1.0,0.0,70.0,...,9.5,29.0,12100.0,3.7,1.0,1,0.0,0.0,0.0,1.0
11,63.0,70.0,1.010,3.0,0.0,0.0,0.0,1.0,0.0,380.0,...,10.8,32.0,4500.0,3.8,1.0,1,0.0,0.0,1.0,0.0
14,68.0,80.0,1.010,3.0,2.0,1.0,0.0,1.0,1.0,157.0,...,5.6,16.0,11000.0,2.6,1.0,1,1.0,0.0,1.0,0.0
20,61.0,80.0,1.015,2.0,0.0,0.0,0.0,0.0,0.0,173.0,...,7.7,24.0,9200.0,3.2,1.0,1,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,140.0,...,15.7,47.0,6700.0,4.9,0.0,0,0.0,1.0,0.0,0.0
396,42.0,70.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,75.0,...,16.5,54.0,7800.0,6.2,0.0,0,0.0,1.0,0.0,0.0
397,12.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,100.0,...,15.8,49.0,6600.0,5.4,0.0,0,0.0,1.0,0.0,0.0
398,17.0,60.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,114.0,...,14.2,51.0,7200.0,5.9,0.0,0,0.0,1.0,0.0,0.0


In [573]:
scale = StandardScaler()
X_drop[float_col] = scale.fit_transform(X_drop[float_col])
X_drop.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_drop[float_col] = scale.fit_transform(X_drop[float_col])


Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
3,-0.101098,-0.363613,1.005,4.0,0.0,1.0,0.0,1.0,0.0,-0.221549,...,-0.865744,-1.092705,-0.569768,-0.976025,1.0,0,0.0,0.0,1.0,1.0
9,0.222253,1.431726,1.02,2.0,0.0,0.0,0.0,1.0,0.0,-0.947597,...,-1.457446,-1.423236,1.162684,-1.17285,1.0,1,0.0,0.0,0.0,1.0
11,0.868954,-0.363613,1.01,3.0,0.0,0.0,0.0,1.0,0.0,3.841231,...,-1.004968,-1.092705,-1.275582,-1.074438,1.0,1,0.0,0.0,1.0,0.0
14,1.192305,0.534056,1.01,3.0,2.0,1.0,0.0,1.0,1.0,0.396364,...,-2.814879,-2.855537,0.809777,-2.255385,1.0,1,1.0,0.0,1.0,0.0
20,0.739614,0.534056,1.015,2.0,0.0,0.0,0.0,0.0,0.0,0.643529,...,-2.083954,-1.974121,0.232293,-1.664911,1.0,1,1.0,0.0,1.0,1.0


3. **Dataset Overview:** Provide a detailed description of the dataset, covering variables, summaries, observation counts, data types, and distributions (at least three statements).

In [557]:
X_drop.describe()

Unnamed: 0,age,bp,bgr,sc,sod,pot,hemo,pcv,wbcc,rbcc
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,8.432074000000001e-17,5.846238e-16,-4.4971060000000007e-17,0.0,9.893633e-16,5.621382000000001e-17,2.698264e-16,-4.4971060000000007e-17,-4.4971060000000007e-17,1.349132e-16
std,1.00318,1.00318,1.00318,1.00318,1.00318,1.00318,1.00318,1.00318,1.00318,1.00318
min,-2.817246,-2.158952,-0.9475974,-0.583015,-3.730148,-0.6165957,-3.685029,-3.626776,-1.500159,-2.747446
25%,-0.6669624,-1.261282,-0.5305059,-0.485227,-0.5154386,-0.2703085,-0.3784601,-0.4867313,-0.6259123,-0.3855519
50%,0.06057713,0.5340564,-0.244721,-0.354843,0.02034626,-0.03945044,0.1958388,0.2294192,-0.2168611,0.05730335
75%,0.6749439,0.5340564,0.006306235,-0.191863,0.6900774,0.07597862,0.7266301,0.6701272,0.4167672,0.6969831
max,2.162358,3.227064,5.540492,4.241194,1.493755,12.22489,1.431451,1.331189,5.750474,3.058878


In [558]:
X_drop.dtypes

age                float64
bp                 float64
bgr                float64
bu                category
sc                 float64
sod                float64
pot                float64
hemo               float64
pcv                float64
wbcc               float64
rbcc               float64
sg_1.005              bool
sg_1.01               bool
sg_1.015              bool
sg_1.02               bool
sg_1.025              bool
al_0.0                bool
al_1.0                bool
al_2.0                bool
al_3.0                bool
al_4.0                bool
su_0.0                bool
su_1.0                bool
su_2.0                bool
su_3.0                bool
su_4.0                bool
su_5.0                bool
rbc_abnormal          bool
rbc_normal            bool
pc_abnormal           bool
pc_normal             bool
pcc_notpresent        bool
pcc_present           bool
ba_notpresent         bool
ba_present            bool
htn_no                bool
htn_yes               bool
d

4. **Association Between Variables:** Analyze variable relationships and their implications for feature selection or extraction (at least three statements).

5. **Missing Value Analysis and Handling:** Implement your strategy for identifying and addressing missing values in the dataset, or provide reasons for not addressing them.

In [None]:
X.isnull().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
dtype: int64

In [None]:
y.isnull().sum()

class_ckd       0
class_ckd\t     0
class_notckd    0
dtype: int64

In [None]:
X_drop= X.dropna()
X_drop

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,hemo,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane
3,-0.203139,-0.473370,-2.173584,2.208413,-0.410106,1.0,0.0,1.0,0.0,-0.392022,...,-0.456071,-0.766953,-0.580420,-0.788961,1.0,0,0.0,0.0,1.0,1.0
9,0.088445,0.990117,0.454071,0.727772,-0.410106,0.0,0.0,1.0,0.0,-0.985679,...,-1.040585,-1.101161,1.256651,-0.984385,1.0,1,0.0,0.0,0.0,1.0
11,0.671612,-0.473370,-1.297699,1.468092,-0.410106,0.0,0.0,1.0,0.0,2.929931,...,-0.593604,-0.766953,-1.328856,-0.886673,1.0,1,0.0,0.0,1.0,0.0
14,0.963195,0.258373,-1.297699,1.468092,1.412011,1.0,0.0,1.0,1.0,0.113218,...,-2.381529,-2.549398,0.882433,-2.059217,1.0,1,1.0,0.0,1.0,0.0
20,0.554978,0.258373,-0.421814,0.727772,-0.410106,0.0,0.0,0.0,0.0,0.315314,...,-1.659482,-1.658175,0.270076,-1.472945,1.0,1,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,0.205078,0.258373,0.454071,-0.752868,-0.410106,1.0,1.0,0.0,0.0,-0.101509,...,1.091172,0.904090,-0.580420,0.188159,0.0,0,0.0,1.0,0.0,0.0
396,-0.553039,-0.473370,1.329955,-0.752868,-0.410106,1.0,1.0,0.0,0.0,-0.922524,...,1.366237,1.683910,-0.206202,1.458415,0.0,0,0.0,1.0,0.0,0.0
397,-2.302541,0.258373,0.454071,-0.752868,-0.410106,1.0,1.0,0.0,0.0,-0.606749,...,1.125555,1.126896,-0.614440,0.676719,0.0,0,0.0,1.0,0.0,0.0
398,-2.010957,-1.205114,1.329955,-0.752868,-0.410106,1.0,1.0,0.0,0.0,-0.429915,...,0.575424,1.349701,-0.410321,1.165279,0.0,0,0.0,1.0,0.0,0.0


6. **Outlier Analysis:** Implement your approach for identifying and managing outliers, or provide reasons for not addressing them.

7. **Sub-group Analysis:** Explore potential sub-groups within the data, employing appropriate data science methods to find the sub-groups of patients and visualize the sub-groups. The sub-group analysis must not include the labels (for CKD patients and healthy controls).

8. **Data Splitting:** Segregate 30% of the data for testing, using a random seed of 1. Use the remaining 70% for training and model selection.


9. **Classifier Choices:** Identify the two classifiers you have chosen and justify your selections.

10. **Performance Metrics:** Outline the two metrics for comparing the performance of the classifiers.

11. **Feature Selection/Extraction:** Implement methods to enhance the performance of at least one classifier in (9). The answer for this question can be included in (12).

12. **Classifier Comparison:** Utilize the selected metrics to compare the classifiers based on the test set. Discuss your findings (at least two statements).

13. **Interpretable Classifier Insight:** After re-training the interpretable classifier with all available data, analyze and interpret the significance of predictor variables in the context of the data and the challenge (at least two statements).

14. **[Bonus]** Sub-group Improvement Strategy: If sub-groups were identified, propose and implement a method to improve one classifier performance further. Compare the performance of the new classifer with the results in (12).

15. **Team Contributions:** Document each team member's specific contributions related to the questions above.

16. **Link** to the public GitHub repository.

# Algorithm2:

1. **Classification Problem Identification:** Define and describe a classification problem based on the dataset.

2. **Variable Transformation:** Implement any transformations chosen or justify the absence of such modifications.

3. **Dataset Overview:** Provide a detailed description of the dataset, covering variables, summaries, observation counts, data types, and distributions (at least three statements).

4. **Association Between Variables:** Analyze variable relationships and their implications for feature selection or extraction (at least three statements).

5. **Missing Value Analysis and Handling:** Implement your strategy for identifying and addressing missing values in the dataset, or provide reasons for not addressing them.

6. **Outlier Analysis:** Implement your approach for identifying and managing outliers, or provide reasons for not addressing them.

7. **Sub-group Analysis:** Explore potential sub-groups within the data, employing appropriate data science methods to find the sub-groups of patients and visualize the sub-groups. The sub-group analysis must not include the labels (for CKD patients and healthy controls).

8. **Data Splitting:** Segregate 30% of the data for testing, using a random seed of 1. Use the remaining 70% for training and model selection.

9. **Classifier Choices:** Identify the two classifiers you have chosen and justify your selections.

10. **Performance Metrics:** Outline the two metrics for comparing the performance of the classifiers.

11. **Feature Selection/Extraction:** Implement methods to enhance the performance of at least one classifier in (9). The answer for this question can be included in (12).

12. **Classifier Comparison:** Utilize the selected metrics to compare the classifiers based on the test set. Discuss your findings (at least two statements).

13. **Interpretable Classifier Insight:** After re-training the interpretable classifier with all available data, analyze and interpret the significance of predictor variables in the context of the data and the challenge (at least two statements).

14. **[Bonus]** Sub-group Improvement Strategy: If sub-groups were identified, propose and implement a method to improve one classifier performance further. Compare the performance of the new classifer with the results in (12).

15. **Team Contributions:** Document each team member's specific contributions related to the questions above.

16. **Link** to the public GitHub repository.

\newpage

## Grading scheme 

\begin{table}[H]
\begin{tabular}{p{0.15\textwidth}  p{0.65\textwidth}}
1.   & Answer [1]\\
2.   & Codes [2] \\
     & OR answer [2]\\
3.   & Codes [3] and answer [3]\\
4.   & Codes [2] and answer [3]\\
5.   & Codes [2]\\
     & OR answer [2]\\
6.   & Codes [2] \\
     & OR answer [2]\\
7.   & Codes [3] and Plot [1]\\
8.   & Codes [1]\\
9.   & Answers [2]\\
10.   & Describe the two metrics [2]\\
11.   & Codes [2] \\
      & these codes can be included in (12)\\
12.   & Codes (two classifiers training,\\
     & model selection for each classifier, \\
     & classifiers comparisons) [5] and answer [2]\\
13.   & Codes [1] and answers [2]\\
14.   & Codes and comparison will \\
     & give \textbf{bonus 2 points for the final grade}.\\
\end{tabular}
\end{table}

**The maximum point for this assignment is 39. We will convert this to 100%.**

**All group members will receive the same grade if they contribute to the same.**
