_______________________

# **About The Model**

This project focuses on developing a classification model to predict health risk levels based on key health indicators: Age, BMI, Alcohol Consumption, Physical Activity, and Liver Function Test results. By analyzing these features, the model aims to predict the probability of an individual to develop liver diseases, providing valuable insights into how lifestyle factors and demographics influence overall health. This model can be a useful tool for preventive healthcare, helping to identify individuals who may be at higher risk for health issues.

<div style="text-align: left;">

<h2>Feature Descriptions</h2>

<table style="width:100%; border-collapse: collapse; text-align:left; table-layout: auto;">
  <tr>
    <th style="border: 1px solid black; padding: 8px; background-color: #f2f2f2; width: 30%;"><strong>Feature</strong></th>
    <th style="border: 1px solid black; padding: 8px; background-color: #f2f2f2; width: 70%;"><strong>Description</strong></th>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Age</td>
    <td style="border: 1px solid black; padding: 8px;">The age of the individual.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Gender</td>
    <td style="border: 1px solid black; padding: 8px;">The gender of the individual (binary variable: 0 for male, 1 for female).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">BMI</td>
    <td style="border: 1px solid black; padding: 8px;">Body Mass Index, a measure of body fat based on height and weight.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">AlcoholConsumption</td>
    <td style="border: 1px solid black; padding: 8px;">Amount of alcohol consumed.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Smoking</td>
    <td style="border: 1px solid black; padding: 8px;">Whether the individual smokes (binary variable: 0 for non-smoker, 1 for smoker).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">GeneticRisk</td>
    <td style="border: 1px solid black; padding: 8px;">Genetic predisposition to liver disease (binary variable: 0 for low risk, 1 for high risk).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">PhysicalActivity</td>
    <td style="border: 1px solid black; padding: 8px;">Level of physical activity.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Diabetes</td>
    <td style="border: 1px solid black; padding: 8px;">Whether the individual has diabetes (binary variable: 0 for no, 1 for yes).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Hypertension</td>
    <td style="border: 1px solid black; padding: 8px;">Whether the individual has hypertension (binary variable: 0 for no, 1 for yes).</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">LiverFunctionTest</td>
    <td style="border: 1px solid black; padding: 8px;">A measure of liver function.</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 8px;">Diagnosis</td>
    <td style="border: 1px solid black; padding: 8px;">Target variable indicating whether the individual is diagnosed with liver disease (binary variable: 0 for no, 1 for yes).</td>
  </tr>
</table>

</div>

# **Importing Libraries**

In [None]:
# For Manipulation and Visualasation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Machine Learning Libraries

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

import warnings # To remove warnings
warnings.filterwarnings('ignore')

# **Importing Dataset**

In [None]:
df = pd.read_csv("/kaggle/input/predict-liver-disease-1700-records-dataset/Liver_disease_data.csv")

In [None]:
df.head()

# **Inspecting The Dataset**

In [None]:
df.info()

In [None]:
df.duplicated().sum()

Every Column has correct datatype and no duplicate data found

# **Checking for Null Values**

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull(),yticklabels=False)

No Null Values found in Dataset

# **Descriptive Statistics**

In [None]:
df.describe()

# **EDA**

### Checking Distribution

In [None]:
# Distribution of Age
plt.figure(figsize=(8, 4))
sns.histplot(df['Age'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
print('Median:',df['Age'].median())
print('Mean:',df['Age'].mean())
print('\nThe Median is not very different from Mean, the distribution is normal, and no outlier detected, so we can proceed with it.')

In [None]:
# Distribution of Gender
gender_counts = df['Gender'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(gender_counts, labels=gender_counts.index, colors=['skyblue', 'lightcoral'], autopct='%1.1f%%', startangle=90, wedgeprops=dict(width=0.3))
plt.title('Gender Distribution')
plt.show()

In [None]:
# Distribution of BMI
plt.figure(figsize=(8, 4))
sns.histplot(df['BMI'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of BMI')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.show()

In [None]:
print('Median:',df['BMI'].median())
print('Mean:',df['BMI'].mean())
print('\nThe Median is not very different from Mean, the distribution is normal, and no outlier detected, so we can proceed with it.')

In [None]:
#Distribution of AlcoholConsumption
plt.figure(figsize=(8, 4))
sns.histplot(df['AlcoholConsumption'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of AlcoholConsumption')
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency')
plt.show()

In [None]:
print('Median:',df['AlcoholConsumption'].median())
print('Mean:',df['AlcoholConsumption'].mean())
print('\nThe Median is not very different from Mean, the distribution is normal, and no outlier detected, so we can proceed with it.')

In [None]:
#Distribution of Smoking
smoking_counts = df['Smoking'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(smoking_counts, labels=smoking_counts.index, colors=['skyblue', 'lightcoral'], autopct='%1.1f%%', startangle=90, wedgeprops=dict(width=0.3))
plt.title('Smoking Distribution')
plt.show()

In [None]:
#Distribution of Genetic Risk
genetic_counts = df['GeneticRisk'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(genetic_counts, labels=genetic_counts.index, colors=['skyblue', 'lightcoral'], autopct='%1.1f%%', startangle=90, wedgeprops=dict(width=0.3))
plt.title('GeneticRisk Distribution')
plt.show()

In [None]:
#Distribution of PhysicalActivity

plt.figure(figsize=(8, 4))
sns.histplot(df['PhysicalActivity'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of PhysicalActivity')
plt.xlabel('PhysicalActivity')
plt.ylabel('Frequency')
plt.show()

In [None]:
print('Median:',df['PhysicalActivity'].median())
print('Mean:',df['PhysicalActivity'].mean())
print('\nThe Median is not very different from Mean, the distribution is normal, and no outlier detected, so we can proceed with it.')

In [None]:
#Distribution of Diabetes
genetic_counts = df['Diabetes'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(genetic_counts, labels=genetic_counts.index, colors=['skyblue', 'lightcoral'], autopct='%1.1f%%', startangle=90, wedgeprops=dict(width=0.3))
plt.title('Diabetes Distribution')
plt.show()

In [None]:
#Distribution of Hypertension
genetic_counts = df['Hypertension'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(genetic_counts, labels=genetic_counts.index, colors=['skyblue', 'lightcoral'], autopct='%1.1f%%', startangle=90, wedgeprops=dict(width=0.3))
plt.title('Hypertension Distribution')
plt.show()

In [None]:
#Distribution of LiverFunctionTest

plt.figure(figsize=(8, 4))
sns.histplot(df['LiverFunctionTest'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of LiverFunctionTest')
plt.xlabel('LiverFunctionTest')
plt.ylabel('Frequency')
plt.show()

In [None]:
print('Median:',df['LiverFunctionTest'].median())
print('Mean:',df['LiverFunctionTest'].mean())
print('\nThe Median is not very different from Mean, the distribution is normal, and no outlier detected, so we can proceed with it.')

In [None]:
#Distribution of Diagnosis
genetic_counts = df['Diagnosis'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(genetic_counts, labels=genetic_counts.index, colors=['skyblue', 'lightcoral'], autopct='%1.1f%%', startangle=90, wedgeprops=dict(width=0.3))
plt.title('Diagnosis Distribution')
plt.show()

The Distribution of each feature looks great, proceeding for multivariate analysis for Determining which model to use

# Multivariate Analysis

In [None]:
# Pair plot of all columns based on the different categories in the Diagnosis column.
sns.pairplot(df, hue='Diagnosis', palette='Set2', diag_kind='kde')
plt.suptitle('Pair Plot', y=1.02)
plt.show()


**Note: Open the image in New Tab for better visualization & understanding!!**

**Observation:**

Intense overlapping of Data Points is observered, so Logistic Regression & KNN will not be the best choice for this classification, but we will use them just for representation.

<table style="float: left; border-collapse: collapse; width: auto;">
    <thead>
        <tr>
            <th style="text-align: left; border: 1px solid #ddd; padding: 8px; width: 1%;"><strong>Model</strong></th>
            <th style="text-align: left; border: 1px solid #ddd; padding: 8px;"><strong>What to use/What to not use</strong></th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">Logistic Regression</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, but struggles with intense overlap as it finds linear decision boundaries. Errors may be high due to misclassification in overlapping regions.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">K-Nearest Neighbors (KNN)</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, handles overlap by considering the local neighborhood of points, but performance decreases with intense overlap, especially if k is not chosen carefully.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">Decision Trees</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, creates complex decision boundaries, but may overfit with intense overlap, requiring careful pruning or tuning.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">Random Forest</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, reduces overfitting by averaging multiple decision trees, making it more robust with overlap, but still requires careful tuning to balance variance and bias.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">Gradient Boosting Machines (GBM)</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, focuses on difficult cases iteratively, creating nuanced decision boundaries that handle overlap well. Requires tuning to avoid overfitting.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">XGBoost</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, handles overlap effectively with regularization and iterative improvements. Generally performs well in a variety of overlapping scenarios.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">LightGBM</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, efficient and scalable, handles large datasets and complex distributions well. Performs well with overlap, especially in large datasets.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">CatBoost</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, effective in handling categorical variables and robust in overlapping data scenarios. Performs particularly well with categorical features and requires less tuning.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">Support Vector Classifier (SVC)</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Can be used, especially with kernels (like RBF) to handle non-linear overlap, but can be computationally expensive, especially with large datasets.</td>
        </tr>
        <tr>
            <td style="border: 1px solid #ddd; padding: 8px; white-space: nowrap;">Naive Bayes</td>
            <td style="border: 1px solid #ddd; padding: 8px;">Not recommended with intense overlap, as its independence assumption often fails, leading to poor performance in overlapping scenarios.</td>
        </tr>
    </tbody>
</table>

# Check for Outliers

In [None]:
# Visualizaing the Data points on Box plot to check for any outlier
def boxplot_with_points(df, columns):
    for column in columns:
        plt.figure(figsize=(8, 4))
        sns.boxplot(x=df[column], showfliers=False)  # Do not show fliers initially
        sns.stripplot(x=df[column], color='red', alpha=0.5)  # Overlay data points
        plt.title(f'Boxplot with Data Points for {column}')
        plt.show()

# List of columns to check
columns_for_outlier_check = ['Age', 'BMI', 'AlcoholConsumption', 'PhysicalActivity', 'LiverFunctionTest']

# Binary columns are not incluced for outlier check because they are not eligible for it.

# Boxplots with data points
boxplot_with_points(df, columns_for_outlier_check)

The datapoints are in normal range of whiskers, no outliers detectedn now we can proceed.

# **Check for Multicolineatiry**

In [None]:
#Before proceeding for training, Multicolineraity check is must required.
correlation_matrix = df.corr()

plt.figure(figsize=(8, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1, center=0)
plt.title('Correlation Heatmap')
plt.show()

The Data looks absolutely fine, now we can proceed for model building.

# **Test Train Split**

In [None]:
X = df.drop("Diagnosis", axis=1)
y = df["Diagnosis"]

In [None]:
# Splitting The Dataset for Training Models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Implementing Feature Scaling**

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Lets Import all the models

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Importing Metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# **Building Models**

In [None]:
# Defining a dictionary of models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "XGBoost": XGBClassifier(),
    "LightGBM": LGBMClassifier(),
    "CatBoost": CatBoostClassifier(verbose=0),
    "Support Vector Classifier": SVC(),
    "Naive Bayes": GaussianNB()
    
}

# Function to train, evaluate, and print stats for each model
def evaluate_models(models, X_train, X_test, y_train, y_test):
    results = {}
    for name, model in models.items():
        print(f"Training {name}...")
        model.fit(X_train, y_train)
        
        # Predictions
        y_pred = model.predict(X_test)
        
        # Evaluation metrics
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        
        print(f"{name} Accuracy: {accuracy:.4f}")
        print(f"{name} Classification Report:\n{report}")
        print(f"{name} Confusion Matrix:\n{conf_matrix}")
        print("-" * 50)
        
        results[name] = accuracy
    
    # Finding the best model based on accuracy
    best_model_name = max(results, key=results.get)
    best_model_accuracy = results[best_model_name]
    print(f"Best Model: {best_model_name} with Accuracy: {best_model_accuracy:.4f}")

# Evaluating all models
evaluate_models(models, X_train, X_test, y_train, y_test)


# Model Performance Summary

This table presents the accuracy of various machine learning models used to predict health risk classification.

<div style="text-align: left;">

## Model Accuracies

<table style="width:auto; text-align:left;">
  <tr>
    <th><strong>Model</strong></th>
    <th><strong>Accuracy</strong></th>
    <th><strong>Justification</strong></th>
  </tr>
  <tr>
    <td>Logistic Regression</td>
    <td>0.8088</td>
    <td>Struggles with dense overlap due to its linear decision boundaries, which are inadequate for distinguishing between overlapping classes. This results in relatively lower accuracy compared to more sophisticated models.</td>
  </tr>
    
   <tr>
    <td>K-Nearest Neighbors</td>
    <td>0.7735</td>
    <td>Performs poorly in intense overlap because it relies on local neighborhoods, which are difficult to define accurately in dense regions. This limitation is evident in its lower accuracy in this case.</td>
  </tr>
  
  <tr>
    <td>Decision Tree</td>
    <td>0.8529</td>
    <td>Handles overlap better than linear models by creating complex, non-linear decision boundaries. It improves performance by capturing intricate data patterns, though it may still overfit to dense areas but performed well in this case.</td>
  </tr>
  <tr>
    <td>Random Forest</td>
    <td>0.9029</td>
    <td>Effective in handling dense overlap by averaging multiple decision trees, which helps to generalize better across complex data distributions. It has high accuracy making it one of the best to use in this densly overlapped case.</td>
  </tr>
  <tr>
    <td>Gradient Boosting</td>
    <td>0.9088</td>
    <td>Handles overlap well by iteratively focusing on the most challenging cases. Its ability to refine decision boundaries in dense areas leads to improved performance and higher accuracy.</td>
  </tr>
  <tr>
    <td>XGBoost</td>
    <td>0.8912</td>
    <td>Effectively manages overlap with regularization and iterative improvements, which helps in refining the decision boundaries and handling complex data distributions, though not as well as Gradient Boosting or Random Forest.</td>
  </tr>
  <tr>
    <td>LightGBM</td>
    <td>0.8853</td>
    <td>Scalable and efficient, it manages dense overlaps effectively. However, slightly lower performance compared to XGBoost and Gradient Boosting may be due to its handling of complex data structures and distributions.</td>
  </tr>
  <tr>
    <td><strong>CatBoost</strong></td>
    <td><strong>0.9147</strong></td>
    <td><strong>Excels in managing dense overlapping data due to its robust handling of categorical features and effective algorithmic techniques. The highest accuracy reflects its superior capability to handle complex and overlapping data patterns making it the best one to use in this case.</strong></td>
  </tr>  
  <tr>
    <td>Support Vector Classifier</td>
    <td>0.7735</td>
    <td>Despite using kernels to handle non-linear boundaries, it struggles with dense overlap due to computational complexity and parameter tuning challenges, resulting in lower accuracy in this scenario.</td>
  </tr>
  <tr>
    <td>Naive Bayes</td>
    <td>0.8029</td>
    <td>Assumes feature independence, which is often not valid in overlapping regions. This results in lower performance compared to models that can capture complex relationships between features theefore not suitable for this case.</td>
  </tr>  
  
</table>

## Best Model

- **Best Model:** <strong>CatBoost</strong>
- **Accuracy:** <strong>0.9147</strong>

The CatBoost model achieved the highest accuracy of 0.9147, making it the best model for this classification task.

</div>


In [None]:
# Save the model
models["CatBoost"].save_model('catboost_model.cbm')

# **Predicting Liver Health Risk With Example Data**

In [None]:
from catboost import CatBoostClassifier

# Load the trained CatBoost model
model = CatBoostClassifier()
model.load_model('catboost_model.cbm')

# Sample data for prediction
sample_data = {
    'Age': [45],
    'Gender': [1],
    'BMI': [26.5],
    'AlcoholConsumption': [7],
    'Smoking': [0],
    'GeneticRisk': [1],
    'PhysicalActivity': [4],
    'Diabetes': [0],
    'Hypertension': [1],
    'LiverFunctionTest': [55]
}

# Convert sample data to DataFrame
df_sample = pd.DataFrame(sample_data)

# Predict using the loaded CatBoost model
predictions = model.predict(df_sample)
pred_proba = model.predict_proba(df_sample)

# Extract probabilities for the class
prob_disease = pred_proba[0][1] * 100

# Print the result
print(f"Prediction: {'Liver Disease' if predictions[0] == 1 else 'No Liver Disease'}")
print(f"Prediction Probability: {pred_proba[0]}")
print(f"Probability of Having Disease: {prob_disease:.2f}%")

**These Models need further overfitting test and still need to be optimized, you can proceed the further testing and tuning.**