# Python Project on Hepatitis Dataset

In [None]:
# For imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
import warnings
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency

# 1. INTRODUCTION

The given dataset - Hepatits Domain provides the dataset with various attributes and also provides if the patient with the disease was a survivor or not under Class atrribute. The dataset is used to conduct various data analysis procedures such as data inspection, exploration, standardizing and handling outlilers. Based on the analysis we made from our previous processes, the results are summarized.

The project is divided in to three sections - Introduction, Analysis and Conclusion. This introduction section mainly deals with data introduction and preprocessing. A brief overview of how the data is tabulated, what features does it have and what statistical values it provides are examined. Based on that, some pre-processing and data cleaning is done to get the dataset to be in the best possible form to conduct further examination. 

Analysis section contains some graphs and Exploratory Data Analysis to see more indepth about the features. We visualize dataset to understand their main characteristics, identify patterns and spot anomalies.

Finally, based on the preprocessing and analysis, we conclude our findings. 

### 1.1. Import Dataset

In [None]:
df = pd.read_csv("hepatitis.txt", header = None)
df.head()

### 1.2. Assigning column names

In [None]:
col_names = ['Class', 'Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia', 'Liver_Big','Liver_Firm',
             'Spleen_Palpable', 'Spiders', 'Ascites', 'Varices', 'Bilirubin', 'Alk_Phosphate', 'Sgot','Albumin', 'Protime', 
             'Histology'] 

In [None]:
df.columns = col_names
df.head()

### 1.3. General Description of dataset and handling anomalies

#### Descriptive Statistics of dataset

In [None]:
df.describe()

The table above gives statistical description of the data with the number of observations in each numeric column, their mean, standard deviation, minimum and maximum values along with the quartiles.

#### Shape of dataset

In [None]:
df.shape

The dataset consists of 155 rows and 20 columns.

#### Handling missing or misleading values
The dataset contains '?' or '.' or '9999' or '99999' as values for some observations. First, replacing these with NaN values and counting the total missing values. 

In [None]:
df1 = df.replace('?', np.nan)
df1 = df1.replace('.', np.nan)
df1 = df1.replace(9999, np.nan)
df1 = df1.replace(99999, np.nan)

nan_counts = df1.isna().sum()
missing_data = pd.DataFrame((df1.isnull().sum()) * 100 / df1.shape[0]).reset_index()
print(f"NaN Count\n{nan_counts}")

Displaying all the rows and columns of the dataset to inspect and verify missing and misleading values.

In [None]:
#df1

#### Overview of the datatype

In [None]:
df1.info()

In [None]:
#Getting categorical columns
categorical_cols = df1.select_dtypes(include=['object']).columns
print(categorical_cols)

#Getting numerical columns
numerical_cols = df1.select_dtypes(include=['int', 'float']).columns
print(numerical_cols)

As seen from above, 16 variables are classified as categorical columns and 4 variables are numerical variables in the dataset. To generalize, the original dataset has been copied to a new one and all of the parameters will be converted to float type for ease of data cleaning for visualization.

In [None]:
df_converted = df1.copy()
df_converted[categorical_cols] = df_converted[categorical_cols].apply(pd.to_numeric, errors='coerce')

In [None]:
df_converted.info()

In [None]:
df_converted.describe().transpose()

It looked like some of the columns - Malaise, Spiders and Alk_Phosphate at rows 24, 130 and 135 still have data with 9999 or 99999 on them. Carrying our more data cleaning.

#### Handling more missing or misleading values

In [None]:
df1.replace(9999, np.nan, inplace = True)
df1.replace(99999, np.nan, inplace = True)
df1.replace('9999', np.nan, inplace = True)
df1.replace('99999', np.nan, inplace = True)
nan_counts = df1.isna().sum()
missing_data = pd.DataFrame((df1.isnull().sum())*100/df1.shape[0]).reset_index()
print(f"NaN Count\n{nan_counts}")

In [None]:
df1

In [None]:
#for column in df1.columns:
 #   has_question_mark = "?" in df1[column].values
  #  has_dot = "." in df1[column].values
   # has_9999 = 9999 in df1[column].values
    #has_99999 = 99999 in df1[column].values

    #print(f"Column: {column}")
    #print(f"Contains '?' : {has_question_mark}")
    #print(f"Contains '.' : {has_dot}")
    #print(f"Contains 9999 : {has_9999}")
    #print(f"Contains 99999 : {has_99999}")
    #print("\n")

In [None]:
warnings.simplefilter(action='ignore', category=FutureWarning)

table_rows = []

for column in df1.columns:
    contains_question_mark = "?" in df1[column].values
    contains_dot = "." in df1[column].values
    contains_9999 = 9999 in df1[column].values
    contains_99999 = 99999 in df1[column].values
    contains_9999_str = "9999" in df1[column].values
    contains_99999_str = "99999" in df1[column].values
    
    table_rows.append([column, contains_question_mark, contains_dot, contains_9999, contains_99999, contains_9999_str, contains_99999_str])

headers = ["Column", "Contains '?'", "Contains '.'", "Contains 9999", "Contains 99999", "Contains str_9999", "Contains str_99999"]

print(tabulate(table_rows, headers=headers, tablefmt="grid"))

In [None]:
print("NaN Counts:",df1.isnull().sum().sum())

Now the data has been cleaned of all the misleading values. Next is cleaning the data of all NaN values with backfill method.

#### Replacing NaN values

In [None]:
df2 = df1.bfill()
df2

In [None]:
df2.describe().transpose()

In [None]:
print("NaN Counts:",df2.isnull().sum().sum())

It seems like all the NaN values have been replaced with the backfill method and also there are no '.', '?', '9999' or '99999' values any longer in the dataset to give us misleading information.

# 2. DATA ANALYSIS

### 2.1. Changing categorical variables to numerical variables

In [None]:
df3 = df2.copy()

In [None]:
categorical_cols

In [None]:
# Display unique values in the columns before conversion
#for col in numerical_cols_to_convert:
 #   print(f"\nUnique values in '{col}' before conversion:")
#    print(df3[col].unique())

# Convert columns to regular numerical values in a loop
for col in categorical_cols:
    df3[col] = pd.to_numeric(df3[col], errors='coerce')

# Display unique values in the columns after conversion
#for col in numerical_cols_to_convert:
 #   print(f"\nUnique values in '{col}' after conversion:")
 #   print(df3[col].unique())


In [None]:
df3.info()

In [None]:
df3.describe().transpose()

### 2.2. Exploratory Data Analysis

#### Generating correlation matrix and correlation heatmap

In [None]:
correlation_matrix = df3.corr()
correlation_matrix

In [None]:
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

From the correlation heatmap, we can see most of the columns have a relatively low correlation with each other. But some columns do have a positive correlation coefficient up to 0.59 which is not a concerning factor. Therefore, we can say that there is not a very strong correlation between the different columns. Some of the highlights are:
Class:
Strong negative correlation with 'Bilirubin' (-0.44), 'Histology' (-0.34), and 'Ascites' (-0.47).
Strong positive correlation with 'Ascites' (0.47), 'Spiders' (0.38), 'Varices' (0.36), and 'Albumin' (0.38).

Age:
Weak negative correlations with 'Class' (-0.25) and 'Albumin' (-0.25).
Weak positive correlation with 'Spiders' (0.31).

Fatigue:
Moderate positive correlations with 'Class' (0.31), 'Malaise' (0.33), 'Spiders' (0.38), 'Ascites' (0.47), 'Varices' (0.36), and 'Albumin' (0.38).
Weak negative correlation with 'Sgot' (-0.20).

Malaise:
Moderate positive correlations with 'Class' (0.33), 'Fatigue' (0.58), 'Spiders' (0.31), 'Ascites' (0.31), 'Varices' (0.16), and 'Albumin' (0.26).

Spiders:
Strong positive correlations with 'Class' (0.38), 'Fatigue' (0.38), 'Malaise' (0.31), 'Ascites' (0.29), 'Varices' (0.38), and 'Albumin' (0.31).

Ascites:
Strong positive correlations with 'Class' (0.47), 'Fatigue' (0.29), 'Malaise' (0.31), 'Spiders' (0.29), 'Varices' (0.37), and 'Albumin' (0.47).
Weak negative correlation with 'Sgot' (-0.27).

Varices:
Strong positive correlations with 'Class' (0.36), 'Fatigue' (0.18), 'Malaise' (0.16), 'Spiders' (0.38), 'Ascites' (0.37), and 'Albumin' (0.36).
Strong negative correlation with 'Bilirubin' (-0.37).

Bilirubin:
Strong negative correlations with 'Class' (-0.44), 'Albumin' (-0.40), and 'Sgot' (-0.40).
Moderate positive correlation with 'Histology' (0.28).

Sgot:
Weak negative correlation with 'Class' (-0.06).
Moderate positive correlation with 'Bilirubin' (0.40).
Weak negative correlation with 'Ascites' (-0.27).

Albumin:
Strong positive correlations with 'Class' (0.38), 'Fatigue' (0.29), 'Malaise' (0.26), 'Spiders' (0.31), 'Ascites' (0.47), and 'Varices' (0.36).
Strong negative correlation with 'Bilirubin' (-0.40).

Histology:
Strong negative correlations with 'Class' (-0.34), 'Bilirubin' (0.28), 'Sgot' (0.13), 'Alk_Phosphate' (0.14), and 'Protime' (-0.38).
Weak positive correlation with 'Sex' (-0.14).

We will take the parameter "CLASS" as our response variable as it makes sense that our prediction is to see if a patient who has incurred Hepatitis is likely to survive or decease. Based on that, taking a look at some EDA. 


#### Correlation matrix with "CLASS"

In [None]:
correlation_matrix = df3.corr()

correlation_with_class = correlation_matrix['Class']

print("\nCorrelation with the 'Class' column:")
print(correlation_with_class)

Above is the correlation matrix of each parameter with Class. The parameters - Age, Antivirals, Liver_Big, Bilirubin, Alk_phosphate, Sgot, Histology are negetively correlated with the Class parameter. This means these columns are as the value for these columns increase, the likelihood of being in "Live" category of Class parameter may slightly decrease. Similarly the other columns have positive correlation with the Class incurring that with the increase in values for those columns the likelihood of being in "Live" category of Class parameter also increases. 

#### Ratio of patients who die or live after contracting Hepatitis

In [None]:
df3['Class'].value_counts()

In [None]:
patients = df3.shape[0]
survivors = (np.sum(df3['Class'] == 2)/patients)*100
non_survivors = (np.sum(df3['Class'] == 1)/patients)*100
print(f"Survivors:{survivors:.2f}%")
print(f"Non_survivors: {non_survivors:.2f}%")

In [None]:
class_counts = df3['Class'].value_counts()

sns.set(style="whitegrid")

plt.figure(figsize=(10, 6))
sns.barplot(x=class_counts.index, y=class_counts.values, palette="Set2")

plt.xlabel("Class", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.title("Count of TARGET Variable per Class", fontsize=16)
    
plt.xticks([0, 1], ['DIE', 'LIVE'], rotation=0)

plt.show()

From the document we know, 1 = DIE and 2 = LIVE. It seems like more people live even after contracting Hepatitis i.e. 123 people have lived which accounts for around 79.35% and 32 have died which accounts for around 20.65% of the total patients. The bar graph clearly gives a visual representation. 

#### Distribution plots

In [None]:
df3.info()

Since, 'Age', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime' are the only ones with non-boolean values, we categorize them separately to conduct some analysis. And leave the rest of the columns to do their own separate analysis.

In [None]:
num_vars = ['Age', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime']
nonnum_vars = ['Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia', 'Liver_Big',
           'Liver_Firm', 'Spleen_Palpable', 'Spiders', 'Ascites', 'Varices', 'Histology']

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(15, 20))
axes = axes.flatten()

for i, column in enumerate(num_vars[0:]):
    sns.histplot(data=df3, x=column, stat='probability', kde=True, ax=axes[i], hue='Class', common_norm=False, palette="Set2")

    axes[i].set_title(f'Histogram of {column}')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].legend(title='Class', labels=['Class 1', 'Class 2'])
    axes[i].tick_params(axis='x', rotation=90)

fig.subplots_adjust(hspace=0.5)
plt.show()

#### Description for 'Age', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime'

In [None]:
df3[num_vars].describe(include = "all")

The above histograms show the distribution of data for 'Age', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime'. As we can see, the distribution of Age is almost normal. For Age, Bilirubin, Alk_Phosphate and SGOT, it is skewed towards the right side. And we can also see for these column the mean is greater than median value. For Albumin and Protime, the mean is smaller in value than median which is also indicated in the histogram for these columns.  

### 2.3. Outlier checking and removal

In [None]:
sns.set_palette("Set2")

fig, axes = plt.subplots(3, 2, figsize=(20, 20))
axes = axes.flatten()

for i, column in enumerate(num_vars[0:]):
    sns.boxplot(x='Class', y=column, data=df3, hue='Class', ax=axes[i]).set_title(f"{column} Outlier Detection")

    axes[i].set_xlabel('CLASS')
    axes[i].set_ylabel(column)
    axes[i].legend(title='CLASS')

fig.subplots_adjust(hspace=0.5)
plt.show()

As we can see from the above boxplots, "Age" does not have any outliers. "Bilirubin" has some outliers for both the values of "Class" parameter". "Alk_Phosphate" has one for "Die" value of "Class" and some for "Live" value of "Class". "SGOT" also has some outliers for both values of "Class". "Albumin" has some outliers outside of the upper and lower range of "Live" value of "Class" parameter. And "Protime" has one below the lower bound for "Live" value of "Class" parameter. 

In [None]:
Q1 = df3[num_vars].quantile(0.25)
Q3 = df3[num_vars].quantile(0.75)
IQR = Q3 - Q1

def filter_outliers(column):
    lower_bound = Q1[column] - 1.5 * IQR[column]
    upper_bound = Q3[column] + 1.5 * IQR[column]
    return (df3[column] >= lower_bound) & (df3[column] <= upper_bound)

outlier_filter = pd.DataFrame()
for column in num_vars:
    outlier_filter[column] = filter_outliers(column)

#print("Outlier Filter DataFrame:")
#print(outlier_filter)

df_no_outliers = df3[outlier_filter.all(axis=1)]

print("Original DataFrame shape:", df3.shape)
print("DataFrame shape after removing outliers:", df_no_outliers.shape)

Therefore, the size of the dataframe has been reduced from 155 to 117 after removing all the outliers from 'Age', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime'.

#### Diagrams for non-numeric variables 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia', 'Liver_Big', 'Liver_Firm', 'Spleen_Palpable', 'Spiders', 'Ascites', 'Varices', 'Histology'

In [None]:
num_subplots = len(nonnum_vars)
num_rows = 3
num_cols = 5

fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(18, 4 * num_rows))
axes = axes.flatten()

for i, column in enumerate(nonnum_vars):
    if i < num_subplots:
        sns.countplot(data=df_no_outliers, x=column, hue='Class', palette='Set2', ax=axes[i])
        axes[i].set_xlabel(column)
        axes[i].set_ylabel('Count')
        axes[i].set_title(f'Count Plot of {column} with respect to Class')
        axes[i].tick_params(axis='x', rotation=45)
        axes[i].legend(title='Class', loc='upper right')
    else:
        fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

The histograms show the distribution of each of the mentioned variables with respect to "Class" (1 = Die, 2 = Live)
From the diagrams above, we can see the following patterns:

1. Sex (1 = Male, 2 = Female): The majority of individuals in the dataset are male compared to female. And it seems like some male patients have also died because of the disease.

2. Steroid (1 = No, 2 = Yes): Most individuals have positive value for steroids. Class distribution appears balanced between those who have taken steroids and those who haven't.

3. Antivirals (1 = No, 2 = Yes): The majority of individuals have taken antiviral drugs. Patients even after taking antiviral drugs have died but majority of them have lived.

4. Fatigue (1 = No, 2 = Yes): Fatigue seems prevalent in the dataset, with most individuals experiencing it.Both classes show a significant presence of fatigue, some patients without fatigue have also died which suggests it may not be a differentiator.

5. Malaise (1 = No, 2 = Yes): Malaise is also common in the dataset, with most individuals experiencing it. Similar to fatigue, malaise is present in both classes.

6. Anorexia (1 = No, 2 = Yes): Anorexia also seems common, with most individuals experiencing it but does not seem to be of any affect to Hepatits.

7. Liver_Big (1 = No, 2 = Yes): The majority of individuals have an enlarged liver. Patients with big liver seem to have some death due to Hepatits.

8. Liver_Firm (1 = No, 2 = Yes): Firm liver is predominant in the dataset. But unlike Liver_Big, people without firm liver also seem to die because of the disease. 

9. Spleen_Palpable (1 = No, 2 = Yes): Most individuals have a palpable spleen. The distribution across classes is fairly consistent, suggesting spleen palpability may not be a strong indicator.

10. Spiders(1 = No, 2 = Yes): Spider veins are common, with the majority having them. And with respect to Class, patients without spider viens have more deaths than with. 

11. Ascites (1 = No, 2 = Yes): Ascites is more common, with the majority having it. The distribution across classes is somewhat balanced, with both classes having individuals with and without ascites.

12. Varices (1 = No, 2 = Yes): Varices are more common, with the majority having them. The distribution across classes is somewhat balanced, with both classes having individuals with and without varices.

13. Histology(1 = No, 2 = Yes): Most individuals have a histology value of 1.0 meaning there number of patients with histology is very less. And for class distribution, the patients with histology have higher chance of being in non-survivor class.

### 2.4. Contingency Table for further confirmation

#### Contingency Table for 'Age', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime' attributes

Comparing the mean of the mentioned groups with each "Class" attribute.

In [None]:
for var in num_vars:
    class_1 = df_no_outliers[df_no_outliers['Class'] == 1][var]
    class_2 = df_no_outliers[df_no_outliers['Class'] == 2][var]
    t_stat, p_value = ttest_ind(class_1, class_2)
    print(f'Test for {var}: {p_value:.4f}')

1. Age:
p-value = 0.1030
With a p-value of 0.1030, we fail to reject the null hypothesis at a significance level of 0.05. There is not enough evidence to conclude that there is a significant difference in the mean age between two values of Class.
2. Bilirubin:
p-value = 0.0153
The p-value of 0.0153 is less than the significance level of 0.05. Therefore, we reject the null hypothesis. This suggests that there is a significant difference in the mean bilirubin levels between  two values of Class.
3. Alk_Phosphate:
p-value = 0.1971
The p-value of 0.1971 is greater than the significance level of 0.05. Thus, we fail to reject the null hypothesis. There is not enough evidence to conclude that there is a significant difference in the mean alkaline phosphatase levels between two values of Class.
4. Sgot:
p-value = 0.5311
With a p-value of 0.5311, we fail to reject the null hypothesis at a significance level of 0.05. There is not enough evidence to suggest a significant difference in the mean serum glutamic oxaloacetic transaminase levels between two values of Class.
5. Albumin:
p-value = 0.0000
The p-value is less than 0.05, indicating strong evidence to reject the null hypothesis. There is a significant difference in the mean albumin levels between two values of Class.
6. Protime:
Result: p-value = 0.1407
The p-value of 0.1407 is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis. There is not enough evidence to conclude that there is a significant difference in the mean prothrombin time levels between two values of Class.

In summary, based on a significance level of 0.05:
Bilirubin and Albumin levels show significant differences between Class 1 and Class 2 patients.
Age, Alk_Phosphate, Sgot, and Protime do not show significant differences between the classes.

#### Contingency table for 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia', 'Liver_Big', 'Liver_Firm', 'Spleen_Palpable', 'Spiders', 'Ascites', 'Varices', 'Histology'

In [None]:
for var in nonnum_vars:
    contingency_table = pd.crosstab(df_no_outliers[var], df_no_outliers['Class'])
    chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)
    print(f'Chi-Square Test for {var}: {p_value:.4f}')

1. Sex: p-value = 0.2463
2. Steroid: p-value = 0.5227
3. Antivirals: p-value = 0.3272
4. Anorexia: p-value = 0.6789
5. Liver_Big: p-value = 0.1635
6. Liver_Firm:p-value = 1.0000
7. Spleen_Palpable: p-value = 0.0707
Explanation for the parameters above: The p-value of 0.0707 is greater than the significance level of 0.05. Thus, we fail to reject the null hypothesis. There is not enough evidence to suggest a significant association between palpable spleen and class.

8. Fatigue: Result: p-value = 0.0170
9. Malaise: Result: p-value = 0.0125
10. Spiders: Result: p-value = 0.0001
11. Ascites: Result: p-value = 0.0000
12. Varices: Result: p-value = 0.0001
13. Histology: Result: p-value = 0.0001
Explanation for the parameters above: The p-value is less than 0.05, indicating strong evidence to reject the null hypothesis. There is a significant association between histology and class.

### 2.5. Kernel Density plot

In [None]:
num_vars = ['Age', 'Bilirubin', 'Alk_Phosphate', 'Sgot', 'Albumin', 'Protime']

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))
fig.suptitle('KDE Plots for Variables with Significant Associations', fontsize=16)

axes = axes.flatten()

for i, var in enumerate(num_vars):
    sns.kdeplot(data=df_no_outliers, x=var, hue='Class', fill=True, common_norm=False, ax=axes[i], palette='Set2')
    axes[i].set_title(f'KDE Plot for {var}')
    axes[i].set_xlabel(var)
    axes[i].set_ylabel('Density')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


# 3. CONCLUSION

The dataset provided for Hepatitis included with various numerical and categorical attributes was imported and various analysis, visualization, inspection and cleaning was conducted to investigate what sort of information could be extracted and what kind of prediction can we make from the dataset. 

Initially, the data was given some descriptive attribute names (column names) and with the features in python, we tried to learn more about the distribution, description and shape of data. It was clear that there were missing and misleading values in the dataset for which several steps were taken to clear the dataset with misleading values and use backfill to fill out the missing values. 

The major goal of the project was to determine if a patient is going to live or die if they contract Hepatitis i.e. "Class" is our target variable and all other are reponse variable. During data analysis, various plots and diagrams were used to visualize the distribution of data, figure out the correlation between attributes and also determine outliers. The outliers were cleared using inter-quartile range values. Since, the dataset has numeric and categorical attributes, separate tests were done to get the idea about each type of attributes. We concluded that being male has higher chances of contracting the disease and also some males who contract the disease might not survive. Similarly, all other attributes were compared to figure out what kind of significance they had to our target variable. And finally for further confirmation, simple hypothesis testing was done with p-value and $\alpha = 0.05.$

In summary, all the processes carried out in this project is a compilation of what we learned in class. And the learnings were used to determine the properties of various parameters with respect to the "Class" attribute of the dataset. 