# Predicting hepatitis C virus using machine learning 
This note book looks into using various Python-based machine learning and data science libraries 
in an attenpt to build a machine learning model capable of predicting whether or not someone has 
hepatitis based on their medical attributes.

we're going to take the following steps:

    1. Problem definition
    2. data
    3. Evaluation 
    4. Featuring
    5. Modelling
    6. Experimentation 

# 1. Problem Definition

>Given clinical parameters about a patient, can we predict whether or not they have hepatitis virus?

# 2. Data
The dataset contains laboratory values of blood donors and Hepatitis C patients and demographic values like age. 
 
link: https://archive.ics.uci.edu/ml/datasets/HCV+data

kaggle version available here: 

link: https://www.kaggle.com/datasets/fedesoriano/hepatitis-c-dataset


* Source: The Original data came from the  University of California Irvine (UCI) ML Reository.
* Creators: Ralf Lichtinghagen, Frank Klawonn, Georg Hoffmann
* Donor1 : Ralf Lichtinghagen: Institute of Clinical Chemistry; Medical University Hannover (MHH); Hannover, Germany; lichtinghagen.ralf '@' mh-hannover.de
* Donor2 : Frank Klawonn; Helmholtz Centre for Infection Research; Braunschweig, Germany; frank.klawonn '@' helmholtz-hzi.de
* Donor2: Georg Hoffmann; Trillium GmbH; Grafrath, Germany; georg.hoffmann '@' trillium.de

# 3. Evaluating 
> if we can reach 95% accuracy predicting whether or not a patient has heart disease during the proof of concept, we'll pursue the project

# 4. Features
Content of the Data:

All attributes except Category and Sex are numerical. The laboratory data are the attributes 5-14.

* 1) X (Patient ID/No.)
* 2) Category (diagnosis) (values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis',  '2=Fibrosis', '3=Cirrhosis')
* 3) Age (in years)
* 4) Sex (f,m) # after some data manipulations female=2, male=1. for our program understands better with numeric values
* 5) ALB - Albumin: An albumin blood test measures the amount of albumin in your blood. Low albumin levels can be a sign of liver or kidney disease or another medical condition

* 6) ALP - Alkaline phosphatase is a protein found in all body tissues. Tissues with higher amounts of ALP include the liver, bile ducts, and bone. A blood test can be done to measure the level of ALP. A related test is the ALP isoenzyme test.

* 7) ALT - Alanine Transaminase. It is an enzyme found mostly in the liver. An ALT test measures the amount of ALT in the blood. When liver cells are damaged, they release ALT into the bloodstream.

* 8) AST - Aspartate Aminotransferase is an enzyme that is found mostly in the liver, but it's also in muscles and other organs in your body. When cells that contain AST are damaged, they release the AST into your blood

* 9) BIL - Bil/i: Pertaining to bile or gall (from Latin bilis). Produced by the liver and stored in the gall bladder, bile is a yellowish-green fluid that aids in the digestion of fats

* 10) CHE - Cholinesterase: An enzyme that occurs chiefly at neuromuscular junctions and promotes the hydrolysis of acetylcholine at postsynaptic receptors : acetylcholinesterase. 2 : butyrylcholinesterase.

* 11) CHOL - Cholesterol from food mostly ends up in the liver. If you are getting too much, this can increase your risk for fatty liver disease. High cholesterol also can turn fatty liver disease (steatosis) into a more serious and sometimes fatal condition known as nonalcoholic steatohepatitis (NASH).
* 12) CREA -  Creatinine: A compound that is excreted from the body in urine. Creatinine levels are measured to monitor kidney function
* 13) GGT - A gamma-glutamyl transferase (GGT) test measures the amount of GGT in the blood. GGT is an enzyme found throughout the body, but it is mostly found in the liver. When the liver is damaged, GGT may leak into the bloodstream. High levels of GGT in the blood may be a sign of liver disease or damage to the bile ducts.
* 14) PROT - proto- , prot- Combining forms meaning the first in a series; the highest in rank. [G. prōtos, first]




## Preparing the tools


* matplot for plotting data 
* numpy manipulation of data
* pandas for data analysis and manipulation
* sklearn for data modelling and evaluation 



In [None]:
# Import all the tools we need 

# Regular EDA (exploratory data analyssis) and plotting lbraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
#we want our plots to appear inside the notebook
%matplotlib inline 

# Models from Scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score 
from sklearn.metrics import plot_roc_curve





In [None]:
print(f"it takes {%%time}" to import libraries)

## Data Preparation/Manipulation

In [None]:
hepatitis = pd.read_csv("data/HepatitisCdata.csv")
hepatitis

In [None]:
hepatitis.info()

In [None]:
hepatitis.isna().sum()

In [None]:
df = hepatitis.copy()
df[:5]

In [None]:
df[df["Category"] == "0=Blood Donor"] = 0

In [None]:
df

In [None]:
df[df["Category"] =="0s=suspect Blood Donor"] = 0

In [None]:
df[df["Category"] =="1=Hepatitis"] = 1
df

In [None]:
df[df["Category"] =="1=Hepatitis"] = 1
df[df["Category"] =="2=Fibrosis"] = 2
df[df["Category"] == "3=Cirrhosis"] = 3


In [None]:
df

In [None]:
df.isna().sum()

In [None]:
newdf = hepatitis.copy()

In [None]:
newdf.replace("0=Blood Donor", 0, inplace=True)
newdf.replace(["0s=suspect Blood Donor", 4],['1=Hepatitis', 1], inplace=True)
newdf["Category"].value_counts()

In [None]:
'0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis',  '2=Fibrosis', '3=Cirrhosis'


In [None]:
# Turning the category column to numeric values
hepatitis.replace("0=Blood Donor", 0, inplace=True)
hepatitis.replace("1=Hepatitis", 1, inplace=True)
hepatitis.replace("2=Fibrosis", 2, inplace=True)
hepatitis.replace("3=Cirrhosis", 3, inplace=True)
hepatitis.replace("0s=suspect Blood Donor", 4, inplace=True)
hepatitis

In [None]:
# checking missing values
hepatitis.isnull().sum()

In [None]:
hepatitis.info()

In [None]:
# Find the colums which contain strings 

for label, content in hepatitis.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# This will turn all of the string value into category value 
for label, content in hepatitis.items():
    if pd.api.types.is_string_dtype(content):
        hepatitis[label] = content.astype("category").cat.as_ordered()

In [None]:
hepatitis.info()

In [None]:
data = hepatitis.copy()
data

In [None]:
data.isna().sum()

In [None]:
# checking numerical columns
for label, content in hepatitis.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
#checking which numerical column has missing values
for label, content in hepatitis.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Fill missing numerical columns with median 
for label, content in hepatitis.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # binary columns telling missing data
            hepatitis[label + "_is_missing"] = pd.isnull(content)
            # fill missing values with median
            hepatitis[label] = content.fillna(content.median())
        


In [None]:
hepatitis.isnull().sum()

In [None]:
hepatitis["ALB_is_missing"].value_counts()

In [None]:
# check columns that are'nt numeric 
for label, content in hepatitis.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)


In [None]:
# turn categorical variables into numeric 
for label,content in hepatitis.items():
    if not pd.api.types.is_numeric_dtype(content):
        # turning categories into number adding 1
        hepatitis[label] = pd.Categorical(content).codes +1

In [None]:
# check columns that are'nt numeric 
for label, content in hepatitis.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
hepatitis.info()

In [None]:
hepatitis

In [None]:
temp_hep  = hepatitis.copy()
temp_hep 

In [None]:

temp_hep

In [None]:
hepatitis

In [None]:
# deleting unwanted columns
del hepatitis["Unnamed: 0"]
hepatitis

## Exploratory data Analysis


In [None]:
# Histogram of age
hepatitis["Age"].plot(kind = "hist", bins=10);

In [None]:
hepatitis["Age"].value_counts().sort_values(ascending=True)

In [None]:
over_30 = hepatitis[hepatitis["Age"]>30]
over_30

In [None]:
hepatitis["CHE"].min()

In [None]:
fig, ax= plt.subplots(figsize=(10, 6))
over_30.plot(kind ='scatter',
            x='Age',
            y='CHOL',
            c='Category',
            ax=ax);
ax.set_xlim((20, 80));

In [None]:
fig, ax= plt.subplots(figsize=(10, 6))
hepatitis.plot(kind ='scatter',
            x='Age',
            y='CHOL',
            c='Category',
            ax=ax);
ax.set_xlim((20, 80));

In [None]:
fig, (ax0, ax1) = plt.subplots(nrows=2,
                              ncols =1,
                              figsize = (10, 20), 
                              sharex = True)

# Add data to ax0
scatter = ax0.scatter(x = hepatitis["Age"],
                    y = hepatitis["CHOL"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax0
ax0.set(title= "Hepatitis and Cholesterol Levels",
      ylabel= "Cholesterol")
# change the x axis limits
ax0.set_xlim([20,80])
# add legend
ax0.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax0.axhline(hepatitis["CHOL"].mean(),
          linestyle = '--');

# Add data to ax1
scatter = ax1.scatter(x = hepatitis["Age"],
                    y = hepatitis["CHE"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax1
ax1.set(title= "Hepatitis and Cholinesterase(CHE)",
      xlabel= "Age",
      ylabel= "Cholinesterase")
# Change ax1 x axis limits

ax1.set_ylim([1, 20])
# add legend ax1
ax1.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax1.axhline(hepatitis["CHE"].mean(),
          linestyle = '--');




In [None]:
hepatitis

In [None]:
hepatitis["ALB"].value_counts().sum()

In [None]:
fig, (ax2, ax3) = plt.subplots(nrows=2,
                              ncols =1,
                              figsize = (10, 20), 
                              sharex = True)

# Add data to ax0
scatter = ax2.scatter(x = hepatitis["Age"],
                    y = hepatitis["ALB"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax0
ax2.set(title= "Hepatitis and Albumin(ALB)",
      ylabel= "Albumin(ALB)")
# change the x axis limits
ax2.set_xlim([20,80])
# add legend
ax2.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax2.axhline(hepatitis["ALB"].mean(),
          linestyle = '--');

# Add data to ax1
scatter = ax3.scatter(x = hepatitis["Age"],
                    y = hepatitis["ALP"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax1
ax3.set(title= "Hepatitis and Alkaline phosphatase (ALP)",
      xlabel= "Age",
      ylabel= "Alkaline phosphatase (ALP)")
# Change ax1 x axis limits

ax3.set_ylim([1,250])
# add legend ax1
ax3.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax3.axhline(hepatitis["ALP"].mean(),
          linestyle = '--');



In [None]:
fig, (ax4, ax5) = plt.subplots(nrows=2,
                              ncols =1,
                              figsize = (10, 20), 
                              sharex = True)

# Add data to ax0
scatter = ax4.scatter(x = hepatitis["Age"],
                    y = hepatitis["ALT"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax0
ax4.set(title= "Hepatitis and Alanine Transaminase(ALT)",
      ylabel= "Alanine Transaminase(ALT)")
# change the x axis limits
ax4.set_xlim([20,80])
ax4.set_ylim([0, 150])
# add legend
ax4.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax4.axhline(hepatitis["ALT"].mean(),
          linestyle = '--');

# Add data to ax1
scatter = ax5.scatter(x = hepatitis["Age"],
                    y = hepatitis["AST"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax1
ax5.set(title= "Hepatitis and Aspartate Aminotransferase(AST)",
      xlabel= "Age",
      ylabel= "Aspartate Aminotransferase(AST)")
# Change ax1 x axis limits

ax5.set_ylim([1,200])
# add legend ax1
ax5.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax5.axhline(hepatitis["AST"].mean(),
          linestyle = '--');



In [None]:
fig, (ax6, ax7) = plt.subplots(nrows=2,
                              ncols =1,
                              figsize = (10, 20), 
                              sharex = True)

# Add data to ax0
scatter = ax6.scatter(x = hepatitis["Age"],
                    y = hepatitis["BIL"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax0
ax6.set(title= "Hepatitis and BIL",
      ylabel= "BIL")
# change the x axis limits
ax6.set_xlim([20,80])
ax6.set_ylim([0,70])
# add legend
ax6.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax6.axhline(hepatitis["ALT"].mean(),
          linestyle = '--');

# Add data to ax1
scatter = ax7.scatter(x = hepatitis["Age"],
                    y = hepatitis["CREA"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax1
ax7.set(title= "Hepatitis and Creatinine(CREA)",
      xlabel= "Age",
      ylabel= "Creatinine(CREA)")
# Change ax1 x axis limits

ax7.set_ylim([0,200])
# add legend ax1
ax7.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax7.axhline(hepatitis["CREA"].mean(),
          linestyle = '--');



In [None]:
fig, (ax8, ax9) = plt.subplots(nrows=2,
                              ncols =1,
                              figsize = (10, 20), 
                              sharex = True)

# Add data to ax0
scatter = ax8.scatter(x = hepatitis["Age"],
                    y = hepatitis["GGT"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax0
ax8.set(title= "Hepatitis and Gamma-glutamyl Transferase (GGT)",
      ylabel= "Gamma-glutamyl Transferase (GGT)")
# change the x axis limits
ax8.set_xlim([20,80])
ax8.set_ylim([0,500])
# add legend
ax8.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax8.axhline(hepatitis["GGT"].mean(),
          linestyle = '--');

# Add data to ax1
scatter = ax9.scatter(x = hepatitis["Age"],
                    y = hepatitis["PROT"],
                    c = hepatitis["Category"],
                     cmap = 'tab10')
#customize ax1
ax9.set(title= "Hepatitis and PROT",
      xlabel= "Age",
      ylabel= "PROT)")
# Change ax1 x axis limits

ax9.set_ylim([40,95])
# add legend ax1
ax9.legend(*scatter.legend_elements(), title= "Category")

# Add a horizontal line

ax9.axhline(hepatitis["PROT"].mean(),
          linestyle = '--');


In [None]:
hepatitis

In [None]:
hepatitis.corr()

In [None]:
corr_metrix = hepatitis.corr()
import seaborn as sns

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(corr_metrix,
                annot =True,
                linewidths = 0.5,
                fmt= ".2f",
                cmap = "YlGnBu" );

## Modelling

In [None]:
hepatitis_df = hepatitis.copy()
hepatitis_df

In [None]:
del hepatitis_df["ALB_is_missing"]
del hepatitis_df[ "ALP_is_missing"]
del hepatitis_df["ALT_is_missing"]
del hepatitis_df["CHOL_is_missing"]
del hepatitis_df["PROT_is_missing"]

In [None]:
hepatitis_df

In [None]:
# Spliting data into features(x) and label(y)

x = hepatitis_df.drop("Category", axis = 1)
y = hepatitis_df["Category"]

In [None]:
# Spliting data into train and test sets
np.random.seed(42)

# Split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [None]:
x_train

In [None]:
 x_test

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape


 Comparing 3 different machine learning models for our data:
 1. logistic regression
 2. K-NearestNeigbour
 3. Random Forest Classifier

In [None]:
# Put models in a dictionary

models = {"Logistic Regression": LogisticRegression(),
          "KNN": KNeighborsClassifier (),
          "Random Forest": RandomForestClassifier()}

# Create a function to fit and score model 
def fit_and_score(models, x_train, x_test, y_train, y_test):
    """
    Fit and evaluate given machine learning models.
    model : a dict different sklearn ml model
    x_train : trainig data (no labels)
    x_text : testing data (no labels)
    y_train : training labels
    y_test : testing labels
    
    """
    # set random seed 
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models 
    for name, model in models.items():
        #Fit the model to the data
        model.fit(x_train, y_train)
        # Evaluate the model and append it's score to model_scores
        model_scores[name] = model.score(x_test, y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models = models,
                            x_train = x_train,
                            x_test = x_test,
                            y_train = y_train,
                            y_test = y_test)

model_scores

In [None]:
model_compare = pd.DataFrame(model_scores, index= ["accuracy"])
model_compare.T.plot.bar();

## Hyperparameter tuning with RandimizedSearchCV

 Tunning 
* RandomForestClassifier()
* LogisticRegression()

using RandimizedSearchCV

In [None]:
# Create a hyperparameter grid for logisticRegression
log_reg_grid ={"C": np.logspace(-4, 4, 20),
              "solver" : ["liblinear"]}

# Create a hyperparameter grid for RandomForestRegressor
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
          "max_depth": [None, 3, 5, 10],
          "min_samples_split": np.arange(2, 20, 2),
          "min_samples_leaf": np.arange(1, 20, 2)}

In [None]:


# setup random hyperparameter search for a LogisticRegression

rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                               param_distributions = log_reg_grid,
                               cv = 5,
                               n_iter = 20,
                               verbose = True,
                               random_state= 42)

# fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(x_train, y_train)


In [None]:
rs_log_reg.best_params_

In [None]:

#  Setup random hyperparameter search for RandomForestClassifier()
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                          param_distributions = rf_grid,
                          cv = 5,
                          n_iter = 20,
                          verbose = True,
                          random_state=42)

rs_rf.fit(x_train, y_train)

In [None]:
rs_rf.best_params_

In [None]:
# Evaluating the randomizedsearched  model 
rs_log_reg.score(x_test, y_test)


In [None]:
rs_rf.score(x_test, y_test)

## hyperparameter tuning with GridSearchCv


In [None]:
%%time
# Different Hyperparameters for our LogisticRegression model 

log_reg_gs={"C": np.logspace(-4, 4, 30),
              "solver" : ["liblinear"]}



rf_gs = {"n_estimators": np.arange(10, 1000, 50),
          "max_depth": [None, 3, 5, 10],
          "min_samples_split": np.arange(2, 20, 2),
          "min_samples_leaf": np.arange(1, 20, 2)}



In [None]:
# Create a hyperparameter grid for LogisticRegression
gs_log_model = GridSearchCV(LogisticRegression(),
                          param_grid = log_reg_gs,
                          cv = 5,
                          verbose = True)

gs_log_model.fit(x_train, y_train)

In [None]:
gs_log_model.best_params_

In [None]:
gs_log_model.score(x_test, y_test)

## since logistics regression model gave the best score lets make focus more on tuning with best parameter and save model for prediction 

In [None]:
# Instantiating a LogisticRegression classifier using the best hyperparameters from RandomizedSearchCV
clf_hepatitis_model = LogisticRegression(solver = 'liblinear', C =  1.3738237958832638)

# Fitting the new instance of LogisticRegression with the best hyperparameters on the training databest_para = {'solver': 'liblinear', 'C': 1.3738237958832638}
clf_hepatitis_model.fit(x_train, y_train)
clf_hepatitis_model.score(x_test, y_test)

In [None]:
# Make predictions on test data 
y_preds = clf_hepatitis_model.predict(x_test)

In [None]:
# Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_preds)

In [None]:
# Creating a more visual confusion matrix
import seaborn as sns
import matplotlib.pyplot as plt


sns.heatmap(confusion_matrix(y_test, y_preds), annot=True, cmap='GnBu')
sns.set(font_scale=2)

In [None]:
# Classification Report 
from sklearn.metrics import classification_report
class_report = classification_report(y_test, y_preds)
print(class_report)

In [None]:
# ROC (receiver operator characteristic) curve & AUC (area under curve)
from sklearn.metrics import plot_roc_curve

plot_roc_curve(clf_hepatitis_model, x_test, y_test);