# Introduction

# 0. Importing Libraries and Other Code

In [None]:
# Import libraries

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [8]:
# Relevant variables

URL_DATASET = "../Datasets/"

Features = ['age','gender','tot_bilirubin','direct_bilirubin','alkphos','sgpt','sgot','tot_proteins','albumin','ag_ratio']
Target = 'is_patient'

# 1. Load Dataset and First Exploration

This section deals with data loading, the renaming of the dataset attributes and a first exploration of the dataset for missing values or incorrect data types.

It was detected that there are 4 missing values in the ``ag_ratio`` attribute and the values of the ``is_patient`` target were transformed to a more understandable convention in its values.

First load the data set and rename the columns to use the attribute names in [Technical Requirements](../TechnicalRequirements.pdf).

In [None]:
# Loading dataset

LiverDataset = pd.read_csv(URL_DATASET+'IndianLiverPatientDataset.csv',index_col=None,header=None)

# Renaming attributes/columns

attributes_names = dict(zip(range(11),Features+[Target]))
LiverDataset.rename(columns=attributes_names,inplace=True)

LiverDataset.head(5)

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,alkphos,sgpt,sgot,tot_proteins,albumin,ag_ratio,is_patient
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


The dataset has missing values in ``ag_ratio`` and the data types match the description of the dataset provided in [[1]](#references)

In [None]:
# Exploring missing values and data types

LiverDataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               583 non-null    int64  
 1   gender            583 non-null    object 
 2   tot_bilirubin     583 non-null    float64
 3   direct_bilirubin  583 non-null    float64
 4   alkphos           583 non-null    int64  
 5   sgpt              583 non-null    int64  
 6   sgot              583 non-null    int64  
 7   tot_proteins      583 non-null    float64
 8   albumin           583 non-null    float64
 9   ag_ratio          579 non-null    float64
 10  is_patient        583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


The values taken by the ``is_patient`` target are not the typical values of 1 and 0 or 'Yes' and 'No', but others. Therefore, when counting how many of each value there are, it is possible to determine which value corresponds to the positive and negative class of the dataset. And with that, values are transformed into 'Yes' and 'No' values.

In [25]:
# Weird target value

print('Values of `is_patient` :: ',LiverDataset.is_patient.unique())

LiverDataset.groupby(Target).size()

Values of `is_patient` ::  [1 2]


is_patient
1    416
2    167
dtype: int64

Because of the dataset contains records of 416 patients diagnosed with liver disease and 167 patients without liver disease, values with 1 correspond to 'Yes' and values with 2 correspond to 'No'.

In [33]:
# Transforming target values

target_values = np.array(['No','Yes'])

LiverDataset.is_patient = (target_values[LiverDataset.is_patient%2])

LiverDataset.groupby(Target).size()

is_patient
No     167
Yes    416
dtype: int64

In [None]:
# Checking data types

LiverDataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               583 non-null    int64  
 1   gender            583 non-null    object 
 2   tot_bilirubin     583 non-null    float64
 3   direct_bilirubin  583 non-null    float64
 4   alkphos           583 non-null    int64  
 5   sgpt              583 non-null    int64  
 6   sgot              583 non-null    int64  
 7   tot_proteins      583 non-null    float64
 8   albumin           583 non-null    float64
 9   ag_ratio          579 non-null    float64
 10  is_patient        583 non-null    object 
dtypes: float64(5), int64(4), object(2)
memory usage: 50.2+ KB


# References

* [1] B. Ramana and N. Venkateswarlu. "ILPD (Indian Liver Patient Dataset)," UCI Machine Learning Repository, 2022. [Online]. Available: https://doi.org/10.24432/C5D02C.