# Chronic Kidney Disease Analysis and Prediction

![](https://th.bing.com/th/id/R.7508e9f5ce7c05d2f2feb20cf9ea4436?rik=ap1KNSWzuYqBqQ&riu=http%3a%2f%2fwww.assignmentpoint.com%2fwp-content%2fuploads%2f2015%2f10%2fBioinformatics.jpg&ehk=jfYuZa%2b8bJCbDO2yzjFlWxGkCkkz6U%2b%2fN05aWJ95koQ%3d&risl=&pid=ImgRaw&r=0&sres=1&sresct=1)

# Table of contents
1. [Import packages](#1)
1. [Import data](#2)
1. [Data Cleaning](#3)
1. [EDA](#4)
1. [Variance Inflation Factor](#5)
1. [Principal Component Analysis](#6)  
1. [Decision Tree model](#7)   
1. [GridSearch](#8)    
1. [Feature Importance](#9)    
1. [Random Forest model](#10)    
1. [XgBoost model](#11)    
1. [TabNet model](#12)    
1. [TabNet + Optuna](#13)    
1. [Results](#14)    
1. [Conclusion](#15)    

<a id="1"></a>
# Import packages

In [89]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np

In [90]:
import warnings
warnings.filterwarnings("ignore")

<a id="2"></a>
# Import data

In [91]:
df_orig = pd.read_csv('../input/ckdisease/kidney_disease.csv')

In [92]:
df_orig.head()

In [93]:
df_orig.info()

In [94]:
df_orig.describe()

In [95]:
df_orig.shape

In [96]:
(df_orig.isnull().sum() / df_orig.shape[0] * 100.00).round(2)

In [97]:
df_orig.shape

<a id="3"></a>
# Data Cleaning

## Transform column names

In [98]:
df_orig.drop('id', axis = 1, inplace = True)

In [99]:
df_orig.columns = ['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar', 'red_blood_cells', 'pus_cell',
              'pus_cell_clumps', 'bacteria', 'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
              'potassium', 'haemoglobin', 'packed_cell_volume', 'white_blood_cell_count', 'red_blood_cell_count',
              'hypertension', 'diabetes_mellitus', 'coronary_artery_disease', 'appetite', 'peda_edema',
              'aanemia', 'class']

In [100]:
df_orig.head()

## Analyze categorical columns

In [101]:
df_orig['packed_cell_volume'] = pd.to_numeric(df_orig['packed_cell_volume'], errors='coerce')
df_orig['white_blood_cell_count'] = pd.to_numeric(df_orig['white_blood_cell_count'], errors='coerce')
df_orig['red_blood_cell_count'] = pd.to_numeric(df_orig['red_blood_cell_count'], errors='coerce')

In [102]:
cat_col=[col for col in df_orig.columns if df_orig[col].dtype=='object']
for col in cat_col:
    print('{} has {} values '.format(col,df_orig[col].unique()))
    print('\n')

In [103]:
df_orig['diabetes_mellitus'].replace(to_replace = {'\tno':'no','\tyes':'yes',' yes':'yes'},inplace=True)

df_orig['coronary_artery_disease'] = df_orig['coronary_artery_disease'].replace(to_replace = '\tno', value='no')

df_orig['class'] = df_orig['class'].replace(to_replace = 'ckd\t', value = 'ckd')

df_orig['class'] = df_orig['class'].replace(to_replace = 'notckd', value = 'not ckd')


for col in cat_col:
    print('{} has {} values  '.format(col, df_orig[col].unique()))
    print('\n')

In [104]:
df_orig['class'] = df_orig['class'].map({'ckd': 0, 'not ckd': 1})
df_orig['class'] = pd.to_numeric(df_orig['class'], errors='coerce')
cat_cols = [col for col in df_orig.columns if df_orig[col].dtype == 'object']
for col in cat_cols:
    print(f"{col} has {df_orig[col].unique()} values\n")

In [105]:
cat_cols = [col for col in df_orig.columns if df_orig[col].dtype == 'object']
num_cols = [col for col in df_orig.columns if df_orig[col].dtype != 'object']
num_cols = num_cols[:-1]
print(cat_cols)
print(num_cols)

## Replace NaN values

1) replace nan with mode for categorical values and mean for numerical:

In [106]:
df_orig_1 = df_orig

def mean_value_imputation(feature):
    mean = df_orig[feature].mean()
    df_orig[feature] = df_orig[feature].fillna(mean)
    
    
for col in num_cols:
    mean_value_imputation(col)

In [107]:
def impute_mode(feature):
    mode = df_orig[feature].mode()[0]
    df_orig[feature] = df_orig[feature].fillna(mode)
    
for col in cat_cols:
    impute_mode(col)

In [108]:
(df_orig.isnull().sum() / df_orig.shape[0] * 100.00).round(2)

2) use knn to replace nan values:

In [109]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=11)
imputer.fit_transform(df_orig_1[num_cols])

def impute_mode(feature):
    mode = df_orig_1[feature].mode()[0]
    df_orig_1[feature] = df_orig_1[feature].fillna(mode)
    
for col in cat_cols:
    impute_mode(col)

In [110]:
(df_orig_1.isnull().sum() / df_orig_1.shape[0] * 100.00).round(2)

Both methods work fine, but knn would be better for model accuracy.

<a id="4"></a>
# EDA

*matplotlib does not visualize plots because of GUI issues (don't know what's wrong, I am working on it), so I used pyplot. Also pandas_profilling works fine, I have hidden it because it loads too long.*

In [111]:
import pandas_profiling as pp

profile = pp.ProfileReport(df_orig, title="Chronic Kidney Disease Dataset Profile", html={"style": {"full_width": True}}, sort=None)
#profile

Check for outliers (there is apparently a lot of them):

In [112]:
num_cols1 = num_cols[:-2]
fig = px.box(df_orig[num_cols1], y=num_cols1)
fig.show()

In [113]:
fig = px.box(df_orig['white_blood_cell_count'], y='white_blood_cell_count')
fig.show()

In [114]:
fig = px.box(df_orig['red_blood_cell_count'], y='red_blood_cell_count')
fig.show()

In [115]:
import plotly.graph_objects as go
df_px = df_orig[num_cols]
fig = px.imshow(df_px.corr())
fig.show()

In [116]:
df_px = df_orig[['class', 'red_blood_cell_count', 'white_blood_cell_count', 'specific_gravity', 'packed_cell_volume']]
fig = px.scatter_matrix(df_px, 
    dimensions = ['red_blood_cell_count', 'white_blood_cell_count', 'specific_gravity', 'packed_cell_volume'],
    color="class")
fig.show()

In [117]:
fig = go.Figure([go.Histogram(x = df_orig['age'])])
fig.show()

From my perspective scaling age is not necessary, because the distribution is close to normal.

In [118]:
fig = go.Figure([go.Histogram(x = df_orig['white_blood_cell_count'])])
fig.show()

In [119]:
fig = go.Figure([go.Histogram(x = df_orig['red_blood_cell_count'])])
fig.show() 

In [120]:
fig = go.Figure([go.Histogram(x = df_orig['blood_glucose_random'])])
fig.show() 

Judging by the plots I would check the outliers and also use scaling for the columns mentioned above.

Let's check if diabetes and kidney disease are connected:

In [121]:
fig = px.scatter(df_orig, 
    x = df_orig['age'], y = df_orig['diabetes_mellitus'],
    color="class")
fig.show()

Seeing the there is no instances where there are both diabetes and ckd we can not assume that there is any correlation between them.

## Deal with Outliers

In [122]:
import numpy as np
# IQR
def IQR_outliers(col):

    Q1 = np.percentile(df_orig[col], 25,
                       interpolation = 'midpoint')

    Q3 = np.percentile(df_orig[col], 75,
                       interpolation = 'midpoint')
    
    per_95 = np.percentile(df_orig[col], 95,
                       interpolation = 'midpoint')
    
    IQR = Q3 - Q1
    
    upper = Q3+1.5*IQR
    lower = Q1-1.5*IQR
    
    df_orig[col] = np.where(df_orig[col] > upper, per_95, df_orig[col])
    df_orig[col] = np.where(df_orig[col] < lower, lower, df_orig[col])

    return df_orig



for col in num_cols:
    df_orig = IQR_outliers(col)

### Check outliers

In [123]:
num_cols1 = num_cols[:-2]
fig = px.box(df_orig[num_cols1], y=num_cols1)
fig.show()

In [124]:
fig = px.box(df_orig['white_blood_cell_count'], y='white_blood_cell_count')
fig.show()

In [125]:
fig = px.box(df_orig['red_blood_cell_count'], y='red_blood_cell_count')
fig.show()

## Use LabelEncoder for categorical values

In [126]:
for col in cat_cols:
    print(f"{col} has {df_orig[col].nunique()} categories\n")

In [127]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in cat_cols:
    df_orig[col] = le.fit_transform(df_orig[col])

In [128]:
df_orig.head()

In [129]:
from sklearn.model_selection import train_test_split
df, df_test = train_test_split(df_orig, test_size = 0.15, random_state = 42)

<a id="5"></a>
# Check VIF (Variance Inflation Factor)

In [130]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[num_cols]

vif_info = pd.DataFrame()
vif_info['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif_info['Column'] = X.columns
vif_info.sort_values('VIF', ascending=False)

Only 3 features have VIF below 5, so they do not have that much multicolinearity, but the other ones seem to be problematic.

## Split the train/test data

I split train/validation/test as 65/20/15%, but because the dataset is small I would consider using 70/20/10% split.

In [131]:
ind_col = [col for col in df.columns if col != 'class']
dep_col = 'class'

X = df[ind_col]
y = df[dep_col]

X_test = df_test[ind_col]
y_test = df_test[dep_col]

In [132]:
print(X.shape)
print(X_test.shape)

In [133]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.20, random_state = 0)

<a id="6"></a>
# Perform PCA on data

In [134]:
from sklearn.decomposition import PCA
pca = PCA(n_components=24)
principalComponents = pca.fit_transform(X_train)
print (pca.explained_variance_ratio_.cumsum())

*The PCA did not improve model perfomance so I did not use it, n_components = 24 remains the best if we use it.*

In [135]:
pca.fit(X_train)
X_valid_new = pca.transform(X_valid)

In [136]:
#pca.fit(X_train)
X_train_new = pca.transform(X_train)

In [137]:
X_train_new.shape

In [138]:
X_valid_new.shape

In [139]:
print(X_valid_new)

### KMeans Clustering

Let's check best k for kmeans clustering:

In [140]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
model = KElbowVisualizer(KMeans(), k=15)
model.fit(X_train)
model.show()

In [141]:
from sklearn.cluster import KMeans
kmeans = KMeans(
    init="random",
    n_clusters=5,
    n_init=10,
    max_iter=300,
    random_state=42
)
X_clustered = kmeans.fit_transform(X_train)
                        
from sklearn.metrics import silhouette_score
score = silhouette_score(X_clustered, kmeans.labels_)
print(score)

# Modeling

<a id="7"></a>
# Build the Decision Tree model

In [142]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

dtc = DecisionTreeClassifier()
dtc.fit(X_train_new, y_train)

# accuracy score, confusion matrix and classification report of decision tree

dtc_acc = accuracy_score(y_valid, dtc.predict(X_valid_new))
dtc_acc_test = accuracy_score(y_test, dtc.predict(X_test))

print(f"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, dtc.predict(X_train_new))}")
print(f"Test Accuracy of Decision Tree Classifier is {dtc_acc_test} \n")
print(f"Validation Accuracy of Decision Tree Classifier is {accuracy_score(y_valid, dtc.predict(X_valid_new))}")

print(f"Confusion Matrix :- \n{confusion_matrix(y_valid, dtc.predict(X_valid_new))}\n")
print(f"Classification Report :- \n {classification_report(y_valid, dtc.predict(X_valid_new))}")

<a id="8"></a>
# Use GridSearch

In [143]:
from sklearn.model_selection import GridSearchCV
grid_param = {
    'criterion' : ['gini', 'entropy'],
    'max_depth' : [3, 5, 7, 10],
    'splitter' : ['best', 'random'],
    'min_samples_leaf' : [1, 2, 3, 5, 7],
    'min_samples_split' : [1, 2, 3, 5, 7],
    'max_features' : ['auto', 'sqrt', 'log2']
}

grid_search_dtc = GridSearchCV(dtc, grid_param, cv = 11, n_jobs = -1, verbose = 1)
grid_search_dtc.fit(X_train, y_train)
print(grid_search_dtc.best_params_)
print(grid_search_dtc.best_score_)

##### Best parameters after using GridSearch are: 
*{'criterion': 'gini', 'max_depth': 7, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 7, 'splitter': 'best'}*

In [144]:
dtc = grid_search_dtc.best_estimator_

print(dtc)

# accuracy score, confusion matrix and classification report of grid search

dtc_gs_acc = accuracy_score(y_valid, dtc.predict(X_valid))
dtc_gs_acc_test = accuracy_score(y_test, dtc.predict(X_test))

print(f"Training Accuracy of Decision Tree Classifier is {accuracy_score(y_train, dtc.predict(X_train))}")
print(f"Test Accuracy of Decision Tree Classifier is {dtc_gs_acc_test}")
print(f"Validation Accuracy of Decision Tree Classifier is {accuracy_score(y_valid, dtc.predict(X_valid))} \n")

print(f"Confusion Matrix :- \n{confusion_matrix(y_valid, dtc.predict(X_valid_new))}\n")
print(f"Classification Report :- \n {classification_report(y_valid, dtc.predict(X_valid_new))}")

<a id="9"></a>
# Feature Importance

According to National Health Service UK:
> Chronic kidney disease is usually caused by other conditions that put a strain on the kidneys. Often it's the result of a combination of different problems.

NHS gives further descriptions of possible CKD causes, such as:

>*CKD can be caused by:*
>* high blood pressure – over time, this can put strain on the small blood vessels in the kidneys and stop the kidneys working properly
>* diabetes – too much glucose in your blood can damage the tiny filters in the kidneys
>* high cholesterol – this can cause a build-up of fatty deposits in the blood vessels supplying your kidneys, which can make it harder for them to work properly
>* kidney infections
>* glomerulonephritis – kidney inflammation
>* polycystic kidney disease – an inherited condition where growths called cysts develop in the kidneys
>* blockages in the flow of urine – for example, from kidney stones that keep coming back, or an enlarged prostate long-term, regular use of certain medicines – such as lithium and non-steroidal anti-inflammatory drugs (NSAIDs)


3 methods were used to analyze feature importance for GridSearch model:

---> using built-in function

In [145]:
plt.figure(figsize=(12,3))
features = X_test.columns.values.tolist()
importance = dtc.feature_importances_.tolist()
ft_imp = pd.DataFrame()
ft_imp['feature'] = features
ft_imp['importance'] = importance
ft_imp.sort_values(by=['importance'], ascending = False, inplace=True)
print(ft_imp)
feature_series = pd.Series(data=importance,index=features)
feature_series.plot.bar()
plt.title('Feature Importance')

---> using LIME

In [146]:
import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_test.columns.values.tolist(),
                                                  class_names=['skd', 'not skd'], verbose=True, mode='classification')

In [147]:
j = len(X_train) - 1
exp = explainer.explain_instance(X_train.values[j], grid_search_dtc.predict_proba, num_features=10)

In [148]:
exp.show_in_notebook(show_table=True)

---> using boruta-shap

In [149]:
!pip install borutashap

In [150]:
from BorutaShap import BorutaShap

Feature_Selector = BorutaShap(importance_measure='shap', classification=False)

Feature_Selector.fit(X=X_train, y=y_train, n_trials=50, random_state=0)

In [151]:
Feature_Selector.plot(which_features='all', figsize=(16,12))

selected_columns = list()
selected_columns.append(sorted(Feature_Selector.Subset().columns))
    
print(f"Selected features are: {selected_columns[-1]}")

##### All methods may slightly differ in their results, but all of them correspond to scientific definition of possible CKD causes. In conlusion, anomalies in specific gravity (kidney's ability to concentrate urine), blood pressure and presence of diabetes can be a good indicator of CKD.

<a id="10"></a>
# Random Forest

In [152]:
from sklearn.ensemble import RandomForestClassifier

rd_clf = RandomForestClassifier(criterion = 'entropy', max_depth = 10, max_features = 'auto', min_samples_leaf = 2, min_samples_split = 7, n_estimators = 12)
rd_clf.fit(X_train_new, y_train)

# accuracy score, confusion matrix and classification report of random forest

rd_clf_acc = accuracy_score(y_valid, rd_clf.predict(X_valid_new))
rd_clf_acc_test = accuracy_score(y_test, rd_clf.predict(X_test))

print(f"Training Accuracy of Random Forest Classifier is {accuracy_score(y_train, rd_clf.predict(X_train_new))}")
print(f"Test Accuracy of Random Forest Classifier is {rd_clf_acc_test} \n")
print(f"Validation Accuracy of Random Forest Classifier is {accuracy_score(y_valid, rd_clf.predict(X_valid_new))}")

print(f"Confusion Matrix :- \n{confusion_matrix(y_valid, rd_clf.predict(X_valid_new))}\n")
print(f"Classification Report :- \n {classification_report(y_valid, rd_clf.predict(X_valid_new))}")

<a id="11"></a>
# XGBoost

In [153]:
from xgboost import XGBClassifier

xgb = XGBClassifier(objective = 'binary:logistic', learning_rate = 0.5, max_depth = 5, n_estimators = 150)
xgb.fit(X_train_new, y_train)

# accuracy score, confusion matrix and classification report of xgboost

xgb_acc = accuracy_score(y_valid, xgb.predict(X_valid_new))
xgb_acc_test = accuracy_score(y_test, xgb.predict(X_test))

print(f"Training Accuracy of XgBoost is {accuracy_score(y_train, xgb.predict(X_train_new))}")
print(f"Test Accuracy of XgBoost is {xgb_acc_test}")
print(f"Validation Accuracy of XgBoost is {accuracy_score(y_valid, xgb.predict(X_valid_new))} \n")

print(f"Confusion Matrix :- \n{confusion_matrix(y_valid, xgb.predict(X_valid_new))}\n")
print(f"Classification Report :- \n {classification_report(y_valid, xgb.predict(X_valid_new))}")

<a id="12"></a>
# TabNet

In [154]:
!pip install pytorch-tabnet

In [155]:
from pytorch_tabnet.tab_model import TabNetClassifier
from sklearn.model_selection import KFold
import torch as torch

In [156]:
clf = TabNetClassifier(verbose=1,seed=42)

In [157]:
clf.fit(X_train=X_train_new, y_train=y_train,
               eval_metric=['auc'])

In [158]:
tbnt_acc = accuracy_score(y_valid, clf.predict(X_valid_new))
#tbnt_acc_test = accuracy_score(y_test, clf.predict(X_test))

print(f"Training Accuracy of TabNet is {accuracy_score(y_train, clf.predict(X_train_new))}")
#print(f"Test Accuracy of TabNet is {tbnt_acc_test} \n")
print(f"Validation Accuracy of TabNet is {accuracy_score(y_valid, clf.predict(X_valid_new))}")

print(f"Confusion Matrix :- \n{confusion_matrix(y_valid, clf.predict(X_valid_new))}\n")
print(f"Classification Report :- \n {classification_report(y_valid, clf.predict(X_valid_new))}")

<a id="13"></a>
# TabNet Optimization (Optima)

In [159]:
import optuna
from optuna import Trial

In [160]:
EPOCHS = 30
BATCH_SIZE = 32

def objective(trial):
    # parameter set by optuna
    N_D = trial.suggest_int('N_D', 8, 32)
    N_A = N_D
    GAMMA = trial.suggest_float('GAMMA', 1.0, 2.0)
    N_STEPS = trial.suggest_int('N_STEPS', 1, 3, 1)
    LAMBDA_SPARSE = trial.suggest_loguniform("LAMBDA_SPARSE", 1e-5, 1e-1)
    
    # changes
    # introduced lambda-sparse
    clf = TabNetClassifier(
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":4,
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='sparsemax',
                          n_d = N_D,
                          n_a = N_A,
                          gamma = GAMMA,
                          n_steps = N_STEPS,
                          lambda_sparse = LAMBDA_SPARSE)
    
    clf.fit(X_train_new, y_train,
        eval_set=[(X_train_new, y_train),(X_valid_new, y_valid)],
        max_epochs = EPOCHS,
        batch_size = BATCH_SIZE,
        patience = 5,
        eval_name=['train', 'valid'],
        eval_metric=['auc']
           )
    
    # changed, now score is max val_uac
    score = np.max(clf.history['valid_auc'])
    
    return score

In [161]:
study = optuna.create_study(direction='maximize', study_name = 'tabnet-study')

study.optimize(objective, n_trials=150, timeout = 3600*8)

print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [162]:
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [163]:
from optuna.visualization import plot_optimization_history

plot_optimization_history(study)

In [164]:
from optuna.visualization import plot_param_importances

plot_param_importances(study)

Predict on valid/test using best parameters:

In [165]:
params_tb = study.best_trial.params
print(params_tb)

In [166]:
clf_opt = TabNetClassifier(verbose=1, seed=42,
                        optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=2e-2),
                       scheduler_params={"step_size":4,
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='sparsemax',
                          n_d = params_tb['N_D'],
                          n_a = params_tb['N_D'],
                          gamma = params_tb['GAMMA'],
                          n_steps = params_tb['N_STEPS'],
                          lambda_sparse = params_tb['LAMBDA_SPARSE'])

In [167]:
clf_opt.fit(X_train=X_train_new, y_train=y_train,
        max_epochs = 100,
        batch_size = BATCH_SIZE,
        patience = 5,
        eval_metric=['auc']
       )

In [168]:
tbnt_opt_acc = accuracy_score(y_valid, clf_opt.predict(X_valid_new))
#tbnt_opt_acc_test = accuracy_score(y_test, clf_opt.predict(X_test))

print(f"Training Accuracy of TabNet is {accuracy_score(y_train, clf_opt.predict(X_train_new))}")
#print(f"Test Accuracy of TabNet is {tbnt_opt_acc_test} \n")
print(f"Validation Accuracy of TabNet is {accuracy_score(y_valid, clf_opt.predict(X_valid_new))}")

print(f"Confusion Matrix :- \n{confusion_matrix(y_valid, clf_opt.predict(X_valid_new))}\n")
print(f"Classification Report :- \n {classification_report(y_valid, clf_opt.predict(X_valid_new))}")

<a id="14"></a>
# Plot results

In [169]:
models = pd.DataFrame({
    'Model' : [ 'Decision Tree Classifier', 'TabNet', 'Tabnet + Optuna', 'Decision Tree + GridSearch', 'Random Forest Classifier',
             'XgBoost'],
    'Score' : [dtc_acc, tbnt_acc, tbnt_opt_acc, dtc_gs_acc, rd_clf_acc, xgb_acc]
})


models = models.sort_values(by = 'Score', ascending = True)

In [170]:
px.bar(data_frame = models, x = 'Score', y = 'Model', color = 'Score', template = 'seaborn', 
       title = 'Models Validation Score Comparison')

<a id="15"></a>
# Conclusions

**In this notebook:**
* analyzed Chronic Kidney Disease dataset
* performed EDA
* tried out Principal Component Analysis (PCA)
* used different models for prediction: Decision Tree, Grid Search, Random Forest, XgBoost, TabNet (with Optuna optimization)
* used 3 different packages for feature importance analysis: LIME, SHAP, Boruta

In conclusion, Decision Trees and TabNet perform well with optimization (in our case GridSearch/Optuna), whereas XgBoost is capable of outperforming them singlehandedly. TabNet is proven itself to work good with tabular data, which is common among classic neural network models, but still, tree models and boosting are more effiecient in this case - even though they are easier to implement due to their simplicity, they can maintain high performance with data like used here. When analyzing feature importance we can see, that important features are the same which are considered to be main causes of CKD.

#### Thank you for making it till the end of my notebook! I hope you enjoyed the content above, please feel free to comment and upvote :)