# Module 3: Exercise A

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In this exercise, we will work with variables in a cirrhosis dataset to predict the disease’s stage. The target variable is __Stage__, which represents the histologic stage of the disease and can be 1, 2, 3, or 4.

Cirrhosis is a late-stage liver condition characterized by fibrosis (scarring), resulting from various liver diseases and conditions, such as hepatitis and chronic alcoholism. The data set includes information gathered from a trial on primary biliary cirrhosis (PBC) conducted at the Mayo Clinic between 1974 and 1984.

## Data Preprocessing

In [2]:
cirr = pd.read_csv('cirrhosis.csv')
cirr.head()

Unnamed: 0,ID,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,1,21464,F,Y,Y,Y,Y,14.5,261.0,2.6,156.0,1718.0,137.95,172.0,190.0,12.2,4.0
1,2,20617,F,N,Y,Y,N,1.1,302.0,4.14,54.0,7394.8,113.52,88.0,221.0,10.6,3.0
2,3,25594,M,N,N,N,S,1.4,176.0,3.48,210.0,516.0,96.1,55.0,151.0,12.0,4.0
3,4,19994,F,N,Y,Y,S,1.8,244.0,2.54,64.0,6121.8,60.63,92.0,183.0,10.3,4.0
4,5,13918,F,N,Y,Y,N,3.4,279.0,3.53,143.0,671.0,113.15,72.0,136.0,10.9,3.0


In [3]:
cirr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID             418 non-null    int64  
 1   Age            418 non-null    int64  
 2   Sex            418 non-null    object 
 3   Ascites        312 non-null    object 
 4   Hepatomegaly   312 non-null    object 
 5   Spiders        312 non-null    object 
 6   Edema          418 non-null    object 
 7   Bilirubin      418 non-null    float64
 8   Cholesterol    284 non-null    float64
 9   Albumin        418 non-null    float64
 10  Copper         310 non-null    float64
 11  Alk_Phos       312 non-null    float64
 12  SGOT           312 non-null    float64
 13  Tryglicerides  282 non-null    float64
 14  Platelets      407 non-null    float64
 15  Prothrombin    416 non-null    float64
 16  Stage          412 non-null    float64
dtypes: float64(10), int64(2), object(5)
memory usage: 55.6

### Missing Values

>__Task 1__
>
>Count the number of NAs for each column

In [None]:
...

We see some missing values in some columns, let's deal with them.

>__Task 2__
>
>- Drop rows with missing values for string columns: __Sex__, __Ascites__, __Hepatomegaly__, __Spiders__, __Edema__ 
>- Fill missing values with mean for numerical columns: __Bilirubin__, __Cholesterol__, __Albumin__, __Copper__, __Alk_Phos__, __SGOT__, __Tryglicerides__, __Platelets__, __Prothrombin__
>- Drop any missing records in the target variable

In [None]:
...

### Categorical Data

The data set contains many categorical variables. Let's convert them.

>__Task 3__
>
>- Check values and counts in each value of the columns: __Sex__, __Ascites__, __Hepatomegaly__, __Spiders__, __Edema__
>- Convert variables into numerical ones (Hint: create a dictionary and use `.replace` for binary categorical variables, and use `pd.get_dummies` for multi-categorical variables)

In [None]:
...

You can double check the results by printing the meta information again:

In [11]:
cirr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 312 entries, 0 to 311
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID             312 non-null    int64  
 1   Age            312 non-null    int64  
 2   Sex            312 non-null    int64  
 3   Ascites        312 non-null    int64  
 4   Hepatomegaly   312 non-null    int64  
 5   Spiders        312 non-null    int64  
 6   Bilirubin      312 non-null    float64
 7   Cholesterol    312 non-null    float64
 8   Albumin        312 non-null    float64
 9   Copper         312 non-null    float64
 10  Alk_Phos       312 non-null    float64
 11  SGOT           312 non-null    float64
 12  Tryglicerides  312 non-null    float64
 13  Platelets      312 non-null    float64
 14  Prothrombin    312 non-null    float64
 15  Stage          312 non-null    float64
 16  Edema_N        312 non-null    uint8  
 17  Edema_S        312 non-null    uint8  
 18  Edema_Y   

### Visualization

>__Task 4__
>
> Create a scatter plot of two variables __Bilirubin__ and __Cholesterol__ labelled with __Stage__

In [None]:
...

### Data Scaling

K-means algorithm computes the distance between data points and cluster centers. When certain features have a significantly larger scale compared to others, they can disproportionately influence the distance calculations, potentially leading to suboptimal clustering outcomes. Implementing scaling can help alleviate this issue.

>__Task 5__
>
>- Define `X` by dropping the __Stage__, __ID__ columns
>- Define `y` as the __Stage__ column
>- Define `X_scaled` by scaling `X` using `MinMaxScaler`

In [None]:
...

##  K-means Clustering

>__Task 6__
>
>- Apply k-means to `X` with 4 clusters and print coordinates of cluster centers
>- Print labels of each point
>- Create a scatter plot of __Bilirubin__ and __Cholesterol__ variables labelled with point labels

In [None]:
...

>__Task 7__
>
>- Apply k-means to `X_scaled` with 4 clusters and print coordinates of cluster centers
>- Print labels of each point
>- Create a scatter plot of __Bilirubin__ and __Cholesterol__ variables labelled with point labels

In [None]:
...

The clustering results show some differences between unscaled and scaled variables.

## Hierarchical Clustering

Let's take a sample of 25 from `X` and use hierarchical clustering to cluster them:

In [20]:
np.random.seed(156)
X_sample = X.sample(25, random_state=146)

>__Task 8__
>
>- Apply hierarchy to `X_sample` with each of linkage criteria: single, complete, average, centroid, and ward
>- Plot a dendrogram of each criterion and set `orientation` to `left`

In [None]:
# Single
...

In [None]:
# Complete
...

In [None]:
# Average
...

In [None]:
# Centroid
...

In [None]:
# Ward
...

### Agglomerative Clustering

 Now, let's proceed with clustering the entire data set using agglomerative clustering from scikit-learn.

>__Task 9__
>
>- Apply agglomerative clustering with 4 clusters
>- Define `cluster_labels` as the result of each sample’s clustering assignment
>- Create a scatter plot of __SGOT__ and __Albumin__ variables labelled with `cluster_labels`

In [None]:
...

>__Task 10__
>
>- Apply agglomerative clustering with 4 clusters and `complete` linkage
>- Define `cluster_labels` as the result of each sample’s clustering assignment
>- Create a scatter plot of __SGOT__ and __Albumin__ variables labelled with `cluster_labels`

In [None]:
...