_______________________

# **About The Model**

This project focuses on developing a classification model to predict health risk levels based on key health indicators: Age, BMI, Alcohol Consumption, Physical Activity, and Liver Function Test results. By analyzing these features, the model aims to predict the probability of an individual to develop liver diseases, providing valuable insights into how lifestyle factors and demographics influence overall health. This model can be a useful tool for preventive healthcare, helping to identify individuals who may be at higher risk for health issues.

<div style="text-align: left;">

<h2>Feature Descriptions</h2>

<table style="width:100%; border-collapse: collapse; text-align:left; table-layout: auto;">
  <tr>
    <th style="border: 1px solid black; padding: 8px; background-color: #f2f2f2; width: 30%;">Feature</th>
    <th style="border: 1px solid black; padding: 8px; background-color: #f2f2f2; width: 70%;">Description</th>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Age</td>
    <td style="border: 1px solid black; padding: 8px;">The age of the individual.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Gender</td>
    <td style="border: 1px solid black; padding: 8px;">The gender of the individual (binary variable: 0 for male, 1 for female).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">BMI</td>
    <td style="border: 1px solid black; padding: 8px;">Body Mass Index, a measure of body fat based on height and weight.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">AlcoholConsumption</td>
    <td style="border: 1px solid black; padding: 8px;">Amount of alcohol consumed.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Smoking</td>
    <td style="border: 1px solid black; padding: 8px;">Whether the individual smokes (binary variable: 0 for non-smoker, 1 for smoker).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">GeneticRisk</td>
    <td style="border: 1px solid black; padding: 8px;">Genetic predisposition to liver disease (binary variable: 0 for low risk, 1 for high risk).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">PhysicalActivity</td>
    <td style="border: 1px solid black; padding: 8px;">Level of physical activity.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Diabetes</td>
    <td style="border: 1px solid black; padding: 8px;">Whether the individual has diabetes (binary variable: 0 for no, 1 for yes).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Hypertension</td>
    <td style="border: 1px solid black; padding: 8px;">Whether the individual has hypertension (binary variable: 0 for no, 1 for yes).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">LiverFunctionTest</td>
    <td style="border: 1px solid black; padding: 8px;">A measure of liver function.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Diagnosis</td>
    <td style="border: 1px solid black; padding: 8px;">Target variable indicating whether the individual is diagnosed with liver disease (binary variable: 0 for no, 1 for yes).</td>
  </tr>
</table>

</div>

# **Importing Libraries & Models**

In [None]:
# For Manipulation and Visualasation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Importing Models

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Machine Learning Libraries

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# **Importing Dataset**

In [None]:
df = pd.read_csv("/kaggle/input/predict-liver-disease-1700-records-dataset/Liver_disease_data.csv")

In [None]:
df.head()

# **Inspecting The Dataset**

In [None]:
df.info()

In [None]:
df.duplicated().sum()

Every Column has correct datatype and no duplicate data found

# **Checking for Null Values**

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull(),yticklabels=False)

No Null Values found in Dataset

# **Descriptive Statistics**

In [None]:
df.describe()

# **EDA**

In [None]:
import warnings # To remove warnings
warnings.filterwarnings('ignore')

# Visualizaing the Data points on Box plot to check for any outlier
def boxplot_with_points(df, columns):
    for column in columns:
        plt.figure(figsize=(8, 4))
        sns.boxplot(x=df[column], showfliers=False)
        sns.stripplot(x=df[column], color='red', alpha=0.5)
        plt.title(f'Boxplot with Data Points for {column}')
        plt.show()

# List of columns to check
columns_for_outlier_check = ['Age', 'BMI', 'AlcoholConsumption', 'PhysicalActivity', 'LiverFunctionTest']

# Binary columns are not incluced for outlier check because they are not eligible for it.

# Boxplots with data points
boxplot_with_points(df, columns_for_outlier_check)

The datapoints are in normal range of whiskers, no outliers detected.

# **Check for Multicolineatiry**

In [None]:
correlation_matrix = df.corr()

plt.figure(figsize=(8, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1, center=0)
plt.title('Correlation Heatmap')
plt.show()

The Data looks absolutely fine, now we can proceed for model building.

# **Test Train Split**

In [None]:
X = df.drop("Diagnosis", axis=1)
y = df["Diagnosis"]

In [None]:
# Splitting The Dataset for Training Models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Implementing Feature Scaling**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# **Building Models**

In [None]:
# Defining a dictionary of models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Support Vector Classifier": SVC(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "CatBoost": CatBoostClassifier(verbose=0),
    "XGBoost": XGBClassifier(),
    "LightGBM": LGBMClassifier()
}

# Function to train, evaluate, and print stats for each model
def evaluate_models(models, X_train, X_test, y_train, y_test):
    results = {}
    for name, model in models.items():
        print(f"Training {name}...")
        model.fit(X_train, y_train)
        
        # Predictions
        y_pred = model.predict(X_test)
        
        # Evaluation metrics
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        
        print(f"{name} Accuracy: {accuracy:.4f}")
        print(f"{name} Classification Report:\n{report}")
        print(f"{name} Confusion Matrix:\n{conf_matrix}")
        print("-" * 50)
        
        results[name] = accuracy
    
    # Finding the best model based on accuracy
    best_model_name = max(results, key=results.get)
    best_model_accuracy = results[best_model_name]
    print(f"Best Model: {best_model_name} with Accuracy: {best_model_accuracy:.4f}")

# Evaluating all models
evaluate_models(models, X_train, X_test, y_train, y_test)


# Model Performance Summary

This table presents the accuracy of various machine learning models used to predict health risk classification.

<div style="text-align: left;">

## Model Accuracies

<table style="width:100%; text-align:left;">
  <tr>
    <th>Model</th>
    <th>Accuracy</th>
  </tr>
  <tr>
    <td>Logistic Regression</td>
    <td>0.8088</td>
  </tr>
  <tr>
    <td>Decision Tree</td>
    <td>0.8529</td>
  </tr>
  <tr>
    <td>Random Forest</td>
    <td>0.9029</td>
  </tr>
  <tr>
    <td>Gradient Boosting</td>
    <td>0.9088</td>
  </tr>
  <tr>
    <td>Support Vector Classifier</td>
    <td>0.7735</td>
  </tr>
  <tr>
    <td>K-Nearest Neighbors</td>
    <td>0.7735</td>
  </tr>
  <tr>
    <td>Naive Bayes</td>
    <td>0.8029</td>
  </tr>
  <tr>
    <td>CatBoost</td>
    <td>0.9147</td>
  </tr>
  <tr>
    <td>XGBoost</td>
    <td>0.8912</td>
  </tr>
  <tr>
    <td>LightGBM</td>
    <td>0.8853</td>
  </tr>
</table>

</div>

## Best Model

- **Best Model:** CatBoost
- **Accuracy:** 0.9147

The CatBoost model achieved the highest accuracy of 0.9147, making it the best model for this classification task.

In [None]:
# Save the model
models["CatBoost"].save_model('catboost_model.cbm')

# **Predicting Liver Health Risk With Example Data**

In [None]:
from catboost import CatBoostClassifier

# Loading trained CatBoost model
model = CatBoostClassifier()
model.load_model('catboost_model.cbm')

# Sample data for prediction
sample_data = {
    'Age': [45],
    'Gender': [1],
    'BMI': [26.5],
    'AlcoholConsumption': [7],
    'Smoking': [0],
    'GeneticRisk': [1],
    'PhysicalActivity': [4],
    'Diabetes': [0],
    'Hypertension': [1],
    'LiverFunctionTest': [55]
}

# Converting sample data to DataFrame
df_sample = pd.DataFrame(sample_data)

# Predicting using the loaded CatBoost model
predictions = model.predict(df_sample)
pred_proba = model.predict_proba(df_sample)

# Extracting probabilities for disease class
prob_disease = pred_proba[0][1] * 100

# Printing the result
print(f"Prediction: {'Liver Disease' if predictions[0] == 1 else 'No Liver Disease'}")
print(f"Prediction Probability: {pred_proba[0]}")
print(f"Probability of Having Disease: {prob_disease:.2f}%")