**DTSA 5509 Final Project**



**Project Topic and Goals**

The topic of this project is survival outcomes in the setting of hepatitis. This is fundamentally a supervised learning problem with the goal being classification. More specifically, the goal in this project is the use the available data to classify a patient into either the "Live" group or "Die" group. So, more plainly, the goal is to predict if a given patient with hepatitis will live or die based on the data available. In addition to acheiving the maximum possible prediction accuracy for classifying these patients, a secondary goal is to understand which factors (features) are the strongest contributors to whether a patient is classified as living or dying.    

**Data Source** 

The data used in this project was obtained from the popular [UCI Machine Learning Repository](https://archive.ics.uci.edu/) The UCI Repository has created a python package/library that can be used to access these datasets and that is what is installed below and used to access/gather the data. 

Citation: Hepatitis. (1988). UCI Machine Learning Repository. https://doi.org/10.24432/C5Q59J.

In [4]:
# Installing package to get data 

#%pip install ucimlrepo

# %pip is the magic command that will install into the current kernel (rather than into the instance of Python
# that launched the notebook)



Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Importing Dataset 

from ucimlrepo import fetch_ucirepo

# fetch dataset 

hepatitis = fetch_ucirepo(id=46)

In [46]:
# Importing necessary packages 


import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

In [14]:
hepatitis

{'data': {'ids': None,
  'features':      Age  Sex  Steroid  Antivirals  Fatigue  Malaise  Anorexia  Liver Big  \
  0     30    2      1.0           2      2.0      2.0       2.0        1.0   
  1     50    1      1.0           2      1.0      2.0       2.0        1.0   
  2     78    1      2.0           2      1.0      2.0       2.0        2.0   
  3     31    1      NaN           1      2.0      2.0       2.0        2.0   
  4     34    1      2.0           2      2.0      2.0       2.0        2.0   
  ..   ...  ...      ...         ...      ...      ...       ...        ...   
  150   46    1      2.0           2      1.0      1.0       1.0        2.0   
  151   44    1      2.0           2      1.0      2.0       2.0        2.0   
  152   61    1      1.0           2      1.0      1.0       2.0        1.0   
  153   53    2      1.0           2      1.0      2.0       2.0        2.0   
  154   43    1      2.0           2      1.0      2.0       2.0        2.0   
  
       Liver F

In [8]:
# Putting data into a Pandas dataframe and checking the shape of the dataframe 

df_hepatitis = hepatitis.data.original

df_hepatitis.shape 


(155, 20)

In [9]:
df_hepatitis.head()

Unnamed: 0,Class,Age,Sex,Steroid,Antivirals,Fatigue,Malaise,Anorexia,Liver Big,Liver Firm,Spleen Palpable,Spiders,Ascites,Varices,Bilirubin,Alk Phosphate,Sgot,Albumin,Protime,Histology
0,2,30,2,1.0,2,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,2,50,1,1.0,2,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,2,78,1,2.0,2,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,2,31,1,,1,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,2,34,1,2.0,2,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1


In [23]:
print(hepatitis.metadata.additional_info.variable_info)

     1. Class: DIE, LIVE
     2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
     3. SEX: male, female
     4. STEROID: no, yes
     5. ANTIVIRALS: no, yes
     6. FATIGUE: no, yes
     7. MALAISE: no, yes
     8. ANOREXIA: no, yes
     9. LIVER BIG: no, yes
    10. LIVER FIRM: no, yes
    11. SPLEEN PALPABLE: no, yes
    12. SPIDERS: no, yes
    13. ASCITES: no, yes
    14. VARICES: no, yes
    15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
        -- see the note below
    16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
    17. SGOT: 13, 100, 200, 300, 400, 500, 
    18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
    19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
    20. HISTOLOGY: no, yes

The BILIRUBIN attribute appears to be continuously-valued.  I checked this with the donater, Bojan Cestnik, who replied:

 About the hepatitis database and BILIRUBIN problem I would like to say the following: BILIRUBIN is continuous attribute (= the number of it's "values" in the ASDO

In [24]:
print(hepatitis.variables)

               name     role         type demographic description units  \
0             Class   Target  Categorical        None        None  None   
1               Age  Feature      Integer        None        None  None   
2               Sex  Feature  Categorical        None        None  None   
3           Steroid  Feature  Categorical        None        None  None   
4        Antivirals  Feature  Categorical        None        None  None   
5           Fatigue  Feature  Categorical        None        None  None   
6           Malaise  Feature  Categorical        None        None  None   
7          Anorexia  Feature  Categorical        None        None  None   
8         Liver Big  Feature  Categorical        None        None  None   
9        Liver Firm  Feature  Categorical        None        None  None   
10  Spleen Palpable  Feature  Categorical        None        None  None   
11          Spiders  Feature  Categorical        None        None  None   
12          Ascites  Feat

In [36]:
df_variable_descriptions = hepatitis.variables 

num_categorical = (df_variable_descriptions['type'] == 'Categorical').sum()

num_numeric = ((df_variable_descriptions['type'] == 'Integer') | 
               (df_variable_descriptions['type'] == 'Continuous')).sum()

print(f'There are {num_categorical -1} categorical features and {num_numeric} numeric features (1 continuous and 6 Integer).')



There are 12 categorical features and 7 numeric features (1 continuous and 6 Integer).


**Data Description**

Data size: Including the target column, this dataset has 155 rows and 20 columns. 19 of the columns are feature columns. 

There are 12 categorical features and 7 numeric features (1 continuous and 6 Integer).


**Data Cleaning**

To Do:

- Columns with Missing Values are indicated above in the "missing_values" column of the variables dataframe. 
    - Decide how to handle missing values in each column with missing values 

In [37]:
# First, I want to see how many missing values are in each of the columns that are indicated to have 
# missing values 


df_hepatitis.isna().sum()

Class               0
Age                 0
Sex                 0
Steroid             1
Antivirals          0
Fatigue             1
Malaise             1
Anorexia            1
Liver Big          10
Liver Firm         11
Spleen Palpable     5
Spiders             5
Ascites             5
Varices             5
Bilirubin           6
Alk Phosphate      29
Sgot                4
Albumin            16
Protime            67
Histology           0
dtype: int64

In [45]:
# From the above, I see that there are 4 columns with only one missing value (Steroid, Fatigue, Malaise,
# and Anorexia. Given the number of rows I have plus the fact that these are cateogrical variables,
# I'm going to take the approach of just dropping the rows with missing values 

columns_with_one_missing_value = df_hepatitis.columns[df_hepatitis.isna().sum() == 1]

df_hepatitis_clean = df_hepatitis.dropna(subset = columns_with_one_missing_value)

df_hepatitis_clean.isna().sum()

Class               0
Age                 0
Sex                 0
Steroid             0
Antivirals          0
Fatigue             0
Malaise             0
Anorexia            0
Liver Big           9
Liver Firm         10
Spleen Palpable     4
Spiders             4
Ascites             4
Varices             4
Bilirubin           5
Alk Phosphate      28
Sgot                3
Albumin            15
Protime            66
Histology           0
dtype: int64

In [49]:
# I Want to check to see how many in each class I have left to make sure I don't lose too many of either class 

target_class_counts = df_hepatitis_clean['Class'].value_counts()

# Display the counts
print(target_class_counts)

# Calculate the percentage of observations for each class
target_class_percentage = df_hepatitis_clean['Class'].value_counts(normalize=True) * 100

# Display the percentage
print(target_class_percentage)

# 2 = Live 

#1 = Die 

df_hepatitis_clean.dtypes


2    121
1     32
Name: Class, dtype: int64
2    79.084967
1    20.915033
Name: Class, dtype: float64


Class                int64
Age                  int64
Sex                  int64
Steroid            float64
Antivirals           int64
Fatigue            float64
Malaise            float64
Anorexia           float64
Liver Big          float64
Liver Firm         float64
Spleen Palpable    float64
Spiders            float64
Ascites            float64
Varices            float64
Bilirubin          float64
Alk Phosphate      float64
Sgot               float64
Albumin            float64
Protime            float64
Histology            int64
dtype: object

In [55]:
# Filtering for categorical


# Identify categorical columns
categorical_columns = df_variable_descriptions[df_variable_descriptions['type'] == 'Categorical']['name'].tolist()

print(categorical_columns)

['Class', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia', 'Liver Big', 'Liver Firm', 'Spleen Palpable', 'Spiders', 'Ascites', 'Varices']


In [65]:
# Filtering for numeric columns 

# Identify numeric columns 

numeric_columns = df_variable_descriptions[df_variable_descriptions['type'].isin(['Integer', 'Continuous'])]['name'].tolist()

print(numeric_columns)



['Age', 'Bilirubin', 'Alk Phosphate', 'Sgot', 'Albumin', 'Protime', 'Histology']


In [69]:
# Imputing the mode for the remaining categorical variables with missing values 



# Create a SimpleImputer instance for categorical columns with 'most_frequent' strategy
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the imputer on the DataFrame
df_hepatitis_clean[categorical_columns] = imputer.fit_transform(df_hepatitis_clean[categorical_columns])

# Display the DataFrame after imputing the mode
df_hepatitis_clean

df_hepatitis_clean.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hepatitis_clean[categorical_columns] = imputer.fit_transform(df_hepatitis_clean[categorical_columns])


Class              0
Age                0
Sex                0
Steroid            0
Antivirals         0
Fatigue            0
Malaise            0
Anorexia           0
Liver Big          0
Liver Firm         0
Spleen Palpable    0
Spiders            0
Ascites            0
Varices            0
Bilirubin          0
Alk Phosphate      0
Sgot               0
Albumin            0
Protime            0
Histology          0
dtype: int64

In [72]:
# Imputing the median for numeric/integer variables with missing values 


# Create a SimpleImputer instance for numeric columns with 'median' strategy
numeric_imputer = SimpleImputer(strategy='median')

# Fit and transform the imputer on the DataFrame for numeric columns
df_hepatitis_clean[numeric_columns] = numeric_imputer.fit_transform(df_hepatitis_clean[numeric_columns])

# Display the DataFrame after imputing missing values with median for numeric columns
# print(df_hepatitis_clean)

df_hepatitis_clean.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hepatitis_clean[numeric_columns] = numeric_imputer.fit_transform(df_hepatitis_clean[numeric_columns])


Class              0
Age                0
Sex                0
Steroid            0
Antivirals         0
Fatigue            0
Malaise            0
Anorexia           0
Liver Big          0
Liver Firm         0
Spleen Palpable    0
Spiders            0
Ascites            0
Varices            0
Bilirubin          0
Alk Phosphate      0
Sgot               0
Albumin            0
Protime            0
Histology          0
dtype: int64

In [73]:
# Now that all of the missing values have been taken care of, I want to again check the balance of the 
# target variable 

target_class_counts = df_hepatitis_clean['Class'].value_counts()

# Display the counts
print(target_class_counts)

# Calculate the percentage of observations for each class
target_class_percentage = df_hepatitis_clean['Class'].value_counts(normalize=True) * 100

# Display the percentage
print(target_class_percentage)

2.0    121
1.0     32
Name: Class, dtype: int64
2.0    79.084967
1.0    20.915033
Name: Class, dtype: float64


**Data Cleaning Summary**

In summary, I decided to drop the observations in which there was only 1 missing value in some columns. Next, I decided to impute the missing values in the categorical columns with the mode and to impute the missing values in the numeric/integer columns with the median. 

One important thing to note here is that the number of missing values in the Protime columns was relatively large, however, I hypothesized that the in the current analysis dropping it completely would result in the loss of too much information. However, if time allowed or in a future version, it may be prudent to try the analysis and modeling both ways. 

**Exploratory Data Analysis**

**Exploratory Data Analysis Summary**

**Model Building**