# Classification Assignment

## 1. Problem Statement
The hospital management requires a predictive model to classify whether a patient has Chronic Kidney Disease (CKD) based on medical parameters.

## 2. Dataset Information
- Total Rows: 399
- Total Columns: 25
- Columns: ['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hrmo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'classification']


In [None]:

import pandas as pd

# Load dataset
df = pd.read_csv('Classification Assignment.csv')
df.head()


Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,2.0,76.459948,c,3.0,0.0,normal,abnormal,notpresent,notpresent,148.112676,...,38.868902,8408.191126,4.705597,no,no,no,yes,yes,no,yes
1,3.0,76.459948,c,2.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,34.0,12300.0,4.705597,no,no,no,yes,poor,no,yes
2,4.0,76.459948,a,1.0,0.0,normal,normal,notpresent,notpresent,99.0,...,34.0,8408.191126,4.705597,no,no,no,yes,poor,no,yes
3,5.0,76.459948,d,1.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,38.868902,8408.191126,4.705597,no,no,no,yes,poor,yes,yes
4,5.0,50.0,c,0.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,36.0,12400.0,4.705597,no,no,no,yes,poor,no,yes



## 3. Preprocessing
Steps performed:
- Handle missing values
- Convert categorical variables into numeric using encoding
- Split dataset into training and testing sets
- Feature scaling applied where required


In [12]:

# Check missing values
df.isnull().sum()

age               0
bp                0
sg                0
al                0
su                0
rbc               0
pc                0
pcc               0
ba                0
bgr               0
bu                0
sc                0
sod               0
pot               0
hrmo              0
pcv               0
wc                0
rc                0
htn               0
dm                0
cad               0
appet             0
pe                0
ane               0
classification    0
dtype: int64

In [13]:
# Fill missing values

df = df.fillna(0)

In [14]:
# Convert categorical columns
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
    df[col] = le.fit_transform(df[col])
    
df.head()


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  for col in df.select_dtypes(include='object').columns:


Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,2.0,76.459948,2,3.0,0.0,1,0,0,0,148.112676,...,38.868902,8408.191126,4.705597,0,0,0,1,1,0,1
1,3.0,76.459948,2,2.0,0.0,1,1,0,0,148.112676,...,34.0,12300.0,4.705597,0,0,0,1,0,0,1
2,4.0,76.459948,0,1.0,0.0,1,1,0,0,99.0,...,34.0,8408.191126,4.705597,0,0,0,1,0,0,1
3,5.0,76.459948,3,1.0,0.0,1,1,0,0,148.112676,...,38.868902,8408.191126,4.705597,0,0,0,1,0,1,1
4,5.0,50.0,2,0.0,0.0,1,1,0,0,148.112676,...,36.0,12400.0,4.705597,0,0,0,1,0,0,1


In [15]:

from sklearn.model_selection import train_test_split

X = df.drop('classification', axis=1)
y = df['classification']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## 4. Model Development and Evaluation

In [16]:

from sklearn.metrics import accuracy_score, classification_report

# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_pred))

# SVM
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))

# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Accuracy: 0.975
SVM Accuracy: 0.5125
Random Forest Accuracy: 1.0



## 5. Final Model Selection

Random Forest is selected as final model because:
- Higher accuracy compared to other models
- Handles missing and categorical data well
- Less overfitting due to ensemble learning
- Robust and reliable for medical datasets
