Download the dataset from here  -> https://archive.ics.uci.edu/ml/datasets/Hepatitis

### Data Science Project

Question
Predict if a patient will live or die with Hepatitis based on the parameters using ML

### Workflow
- Data Preprossing
- EDA
- Feature Selection
- Build Model
- Interpret Model
- Serialization
- Deploy to production with Streamlit

## Load packages

In [1]:
import pandas as pd 
import numpy as np

# visualization package
import matplotlib.pyplot as plt
import seaborn as sns 

### Data Attribute Information:

1. Class: DIE, LIVE
2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
3. SEX: male, female
4. STEROID: no, yes
5. ANTIVIRALS: no, yes
6. FATIGUE: no, yes
7. MALAISE: no, yes
8. ANOREXIA: no, yes
9. LIVER BIG: no, yes
10. LIVER FIRM: no, yes
11. SPLEEN PALPABLE: no, yes
12. SPIDERS: no, yes
13. ASCITES: no, yes
14. VARICES: no, yes
15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
17. SGOT: 13, 100, 200, 300, 400, 500,
18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
20. HISTOLOGY: no, yes

## Load Dataset

In [2]:
hepatitis_df = pd.read_csv('G:\\ML\\hepatitis\\train-model\data\hepatitis.data')

FileNotFoundError: [Errno 2] File b'G:\\ML\\hepatitis\train-model\\data\\hepatitis.data' does not exist: b'G:\\ML\\hepatitis\train-model\\data\\hepatitis.data'

In [None]:
# Preview the first five row
hepatitis_df.head()

In [None]:
# add the columns names
col_names = ["Class","AGE","SEX","STEROID","ANTIVIRALS","FATIGUE","MALAISE","ANOREXIA","LIVER BIG","LIVER FIRM",
             "SPLEEN PALPABLE","SPIDERS","ASCITES","VARICES","BILIRUBIN","ALK PHOSPHATE","SGOT","ALBUMIN","PROTIME",
             "HISTOLOGY"]

In [None]:
hepatitis_df = pd.read_csv('G:\\ML\\hepatitis\train-model\data\hepatitis.data', names=col_names)

In [None]:
# Preview the first five row
hepatitis_df.head()

In [None]:
# change columns name to small letter
hepatitis_df.columns = hepatitis_df.columns.str.lower().str.replace(' ', '_')

In [None]:
# Preview the first five row
hepatitis_df.head()

In [None]:
# Check out Data Types, memory usage, no of rows
hepatitis_df.info()

In [None]:
# size of the data 
hepatitis_df.shape

In [None]:
# check for missing value
hepatitis_df.isna().sum()

In [None]:
# check the number of unique values
for cols in hepatitis_df.columns:
    print(cols.capitalize(), 'have ', hepatitis_df[cols].nunique(), 'unique values: \n', hepatitis_df[cols].unique())

**As we can see there are question mark as values  in many columns**

**Hence we need to replace them with some value**

In [None]:
# Replace question mark with the most occuring value in that column
for cols in hepatitis_df.columns:
    count = hepatitis_df[cols].value_counts().index
    if count[0] != '?':
        hepatitis_df[cols] = hepatitis_df[cols].replace('?', count[0])
    else:
        print(cols)
        hepatitis_df[cols] = hepatitis_df[cols].replace('?', count[1])

In [None]:
# check if there is any question mark
for cols in hepatitis_df.columns:
    print(cols.capitalize(), 'have ', hepatitis_df[cols].nunique(), 'unique values: \n', hepatitis_df[cols].unique())

In [None]:
# change this two column data type to float
hepatitis_df[['bilirubin','albumin']] = hepatitis_df[['bilirubin','albumin']].astype(float)

In [None]:
# change the data type to int
for cols in hepatitis_df.columns:
    if cols != 'bilirubin' and cols != 'albumin':
        hepatitis_df[cols] = hepatitis_df[cols].astype(int)

In [None]:
# check the data typ
hepatitis_df.dtypes

In [None]:
### Count of each row
hepatitis_df.count()

## Exploratoy Data Analysis

In [None]:
# stastical aggregation of dataset
hepatitis_df.describe()

In [None]:
# Value counts for target varaible
# DIE: 1
# LIVE: 2

hepatitis_df['class'].value_counts()

In [None]:
# plot the count
hepatitis_df['class'].value_counts().plot(kind='bar')

In [None]:
### How many are males(1) and females(2)
hepatitis_df['sex'].unique()

In [None]:
# check the count
hepatitis_df['sex'].value_counts()

In [None]:
# mortality rate when compar with gender
sns.countplot(x='sex', hue='class', data=hepatitis_df)

In [None]:
# plot the count
hepatitis_df['sex'].value_counts().plot.bar()

- There are more males than females in ourdataset

In [None]:
### check Age Range
hepatitis_df.groupby(['age', 'sex']).size()

## Frequency Distribution Table using the Age Range

In [None]:
hepatitis_df['age'].agg(['max', 'min'])

#### Bucket the age

In [None]:
labels = ["Less than 10","10-20","20-30","30-40","40-50","50-60","60-70","70 and more"]
bins= [0,10,20,30,40,50,60,70,80]

frequency =  hepatitis_df.groupby(pd.cut(hepatitis_df['age'], bins=bins, labels=labels)).size()

frequency

In [None]:
frequency = frequency.reset_index(name='count')
frequency

In [None]:
frequency.plot(kind='bar')

In [None]:
frequency.plot.line()

- Highest prevalence of Hepatitis is from 30-40 followed by 40-50
-The least is individual under 10, and elderly above 70
-----------------------------------------------------------------------

#### Checking for Outliers
- Univariate Analysis
-Multivariate Analysis

#### Methods
- Boxplot(Uni)
-Scatterplot (Multi)
-Z-score
-IQR Interquartile Range

In [None]:
# box plot
for cols in hepatitis_df.columns:
    sns.boxplot(hepatitis_df[cols])
    plt.show()

In [None]:
# Scatterplot
sns.scatterplot(hepatitis_df['age'],hepatitis_df['albumin'])

In [None]:
# Scatter plot
sns.scatterplot(x='albumin',y='age',hue='sex', palette=['green','red'],data=hepatitis_df)

#### using Interquartile Range(IQR)
- H-Spread/Mid_spread
-Measures the statistical dispersion/spread
-IQR = Quantile 3(75) - Quantile 1(25)

In [None]:
q2 = hepatitis_df.quantile(.25)
q3 = hepatitis_df.quantile(.75)

IQR = q3 - q2

IQR

In [None]:
(hepatitis_df < (q2 - 1.5 * IQR )) | (hepatitis_df > (q3 + 1.5 * IQR))

- The data points with true are the outlier

#### Solution
- Remove
-Change
-Ignore

In [None]:
# hepatitis_df = hepatitis_df[~((hepatitis_df < (q2 - 1.5 * IQR )) | (hepatitis_df > (q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
# hepatitis_df.shape

### Feature Selection and Importance

- SelectKbest
   - Strong relation with the output/target
- Recursive Feature Elimination

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection  import chi2

In [None]:
# seperate predictor and target

xfeatures = hepatitis_df[['age', 'sex', 'steroid', 'antivirals', 'fatigue', 'malaise',
       'anorexia', 'liver_big', 'liver_firm', 'spleen_palpable', 'spiders',
       'ascites', 'varices', 'bilirubin', 'alk_phosphate', 'sgot', 'albumin',
       'protime', 'histology']]
ylabels = hepatitis_df['class']

In [None]:
# featue selection
chi_sq = SelectKBest(chi2, k=10)
best_features = chi_sq.fit(xfeatures, ylabels)

In [None]:
# 10 selected featues 
best_features = xfeatures.columns[best_features.get_support()]
best_features

#### Recursive Feature Elimination

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfe_feautue =  RFE(RandomForestClassifier(), n_features_to_select=10)
rfe_feautue.fit(xfeatures, ylabels)

In [None]:
# selected features
rfe_feautue = xfeatures.columns[rfe_feautue.support_]
rfe_feautue

### Checking for Feature Importance
- ExtraTreeClassifier
-Which feature is important

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
# initialize and fit the data
et_clf = ExtraTreesClassifier()
et_clf.fit(xfeatures,ylabels)

In [None]:
# Print Important
print(et_clf.feature_importances_)

In [None]:
feature_imporance_df = pd.Series(et_clf.feature_importances_,index=xfeatures.columns)
feature_imporance_df

In [None]:
feature_imporance_df = feature_imporance_df.nlargest(n=10)
feature_imporance_df

In [None]:
# plot the importance
feature_imporance_df.plot(kind='barh')

In [None]:
# Correlation
hepatitis_df.corr()

In [None]:
# Heatmap for Correlation
plt.figure(figsize=(18,10))
sns.heatmap(xfeatures.corr(), annot=True)

## Model Building
- Feature & Labels
-Train/Test/Split
- LogisticRegression, DT, RF, SVC
- Serialize

In [None]:
# load packages
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Metrics to evaluate model
from sklearn.metrics import accuracy_score, precision_score,confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score


In [None]:
# we will use the feature by Extra tree classifier
hepatitis_df_final = hepatitis_df[feature_imporance_df.index]

hepatitis_df_final.head()

In [None]:
# Independent and Dependent variable

x = hepatitis_df_final
y = hepatitis_df['class']

In [None]:
# train test split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=7)

In [None]:
# models
models = list()
lg = LogisticRegression()
rf = RandomForestClassifier(n_estimators=130)
dt = DecisionTreeClassifier()

models = [lg,rf, dt]

In [None]:
score = []
for i in models:
    cv = cross_val_score(estimator=i ,X=x_train, y=y_train, cv=5, scoring='accuracy', n_jobs=-1)
    score.append(cv)
    print(cv)

## Building Mode

#### 1. Logistic Regression

In [None]:
logreg = LogisticRegression(max_iter=120)
logreg.fit(x_train,y_train)

# predict 
y_pedict_lr = logreg.predict(x_test)

In [None]:
print('The precision score  on Test is : ',round(precision_score(y_test, y_pedict_lr) * 100,2), '%')
print('The accuracy score  on Test is : ',round(accuracy_score(y_test, y_pedict_lr) * 100,2), '%')
print ('\n\n Confusion Matrix TEST:\n', confusion_matrix(y_test, y_pedict_lr))

### 2. Decision Tree

In [None]:
parms = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1,2,3,4,5],
    'min_samples_split': [3,4,5,6],
    'min_samples_leaf': [3,4,5,6],
    'max_features': [1,2,3,4],
    'random_state': [20,24,36]
}

gscv = GridSearchCV(DecisionTreeClassifier(), param_grid=parms, cv=5, scoring='accuracy', )

gscv.fit(x_train,y_train)

gscv.best_estimator_

In [None]:
# predict 
dtc = DecisionTreeClassifier(max_depth=5, max_features=1, min_samples_leaf=3, min_samples_split=3, random_state=36)

dtc.fit(x_train,y_train)

# predict 
y_pedict_dtc = dtc.predict(x_test)

In [None]:
print('The precision score  on Test is : ',round(precision_score(y_test, y_pedict_dtc) * 100,2), '%')
print('The accuracy score  on Test is : ',round(accuracy_score(y_test, y_pedict_dtc) * 100,2), '%')
print ('\n\n Confusion Matrix TEST:\n', confusion_matrix(y_test, y_pedict_dtc))

### 3. Randome Forest

In [None]:
rfc = RandomForestClassifier(n_estimators=160, min_samples_leaf=4)

rfc.fit(x_train,y_train)

# predict 
y_pedict_rfc = rfc.predict(x_test)

In [None]:
print('The precision score  on Test is : ',round(precision_score(y_test, y_pedict_rfc) * 100,2), '%')
print('The accuracy score  on Test is : ',round(accuracy_score(y_test, y_pedict_rfc) * 100,2), '%')
print ('\n\n Confusion Matrix TEST:\n', confusion_matrix(y_test, y_pedict_rfc))

### 4. Support vector machine

In [None]:
sv = SVC(C=0.9, kernel='linear', probability=True)

sv.fit(x_train,y_train)

# predict 
y_pedict_sv = sv.predict(x_test)

In [None]:
print('The precision score  on Test is : ',round(precision_score(y_test, y_pedict_sv) * 100,2), '%')
print('The accuracy score  on Test is : ',round(accuracy_score(y_test, y_pedict_sv) * 100,2), '%')
print ('\n\n Confusion Matrix TEST:\n', confusion_matrix(y_test, y_pedict_sv))

### Save Our Model
- Serialization
- Pickle
- Joblib
- numpy/json/ray

In [None]:
# Using Joblib
import joblib

In [None]:
model_file = open("logistic_regression.pkl","wb")
joblib.dump(logreg,model_file)
model_file.close()

In [None]:
model_file_rfc = open("Random_forest_model.pkl","wb")
joblib.dump(rfc,model_file_rfc)
model_file_rfc.close()

In [None]:
model_file_dtc = open("decision_tree_clf_model.pkl","wb")
joblib.dump(dtc,model_file_dtc)
model_file_dtc.close()

In [None]:
model_file_sv = open("support_vector_model.pkl","wb")
joblib.dump(sv,model_file_sv)
model_file_sv.close()

In [None]:
# Create Decision Tree Plot
from IPython.display import Image
from sklearn import tree
import pydotplus

In [None]:
feature_names_best = x.columns
target_names = ["Die","Live"]

In [None]:
# Create A Dot Plot
dot_data = tree.export_graphviz(dtc, out_file=None, feature_names=feature_names_best, class_names=target_names)

In [None]:
# Draw a graph
graph = pydotplus.graph_from_dot_data(dot_data)

In [None]:
Image(graph.create_png())

In [None]:
# Save the plot
graph.write_png("hep_decisition_tree_plot.png")

### Interpret Model & Evaluate
- Eli5
- Lime

### Using Lime

In [None]:
# Intepreting with Lime
import lime
import lime.lime_tabular

In [None]:
# Methods and Attributes
dir(lime)

# Create Lime Explainer
- LimeTabularExplainer = Tables
- LimeTextExplainer = Text
- LimeImageExplainer = Images

In [None]:
feature_names_best

In [None]:
target_names

In [None]:
class_names = ["Die(1)","Live(2)"]

In [None]:
# Create Explainer
explainer = lime.lime_tabular.LimeTabularExplainer(x.values,
                                                   feature_names=feature_names_best,
                                                   class_names=class_names,
                                                   discretize_continuous=True)

In [None]:
x_test.iloc[1]

In [None]:
logreg.predict(np.array(x_test.iloc[1]).reshape(1,-1))

In [None]:
exp = explainer.explain_instance(x_test.iloc[1], logreg.predict_proba, num_features=10, top_labels=1)

In [None]:
exp.show_in_notebook(show_table=True,show_all=False)

In [None]:
# Explanation as list
exp.as_list()

### Using Eli5

In [None]:
import eli5

In [None]:
# Show how each feature contributes
eli5.show_weights(logreg,top=10)

In [None]:
feature_names_best = list(feature_names_best)

# Show how each feature contributes
eli5.show_weights(logreg, feature_names=feature_names_best, target_names=class_names)

In [None]:
# Show how each feature contributes
eli5.show_prediction(logreg,x_test.iloc[1], show_feature_values=True)

In [None]:
# check the weights
eli5.explain_weights(logreg)

In [None]:
hepatitis_df.columns

In [None]:
hepatitis_df.to_csv('clean_hepatitis.csv', index=False)
frequency.to_csv('age_group_infection_rate.csv', index=False)

In [None]:
from sklearn.model_selection import GridSearchCV

In [3]:
pwd

'G:\\ML\\hepatitis\\train-model'