# TEAM 44 - HCC Survival (U05) - kdd cyberattack (K09)

## Part 1: UCI dataset

### Install the required libraries

In [None]:
%pip install pip --upgrade
%pip install scikit-learn --upgrade
%pip install numpy --upgrade
%pip install matplotlib --upgrade
%pip install imbalanced-learn --upgrade
%pip install pandas --upgrade

### Intoduction and Overview

For this part we will use a UCI dataset, [HCC Survival](https://archive.ics.uci.edu/ml/datasets/HCC+Survival).

HCC dataset was obtained at a University Hospital in Portugal and contais several demographic, risk factors, laboratory and overall survival features of 165 real patients diagnosed with HCC. The dataset contains 49 features selected according to the EASL-EORTC (European Association for the Study of the Liver - European Organisation for Research and Treatment of Cancer) Clinical Practice Guidelines, which are the current state-of-the-art on the management of HCC.

This is an heterogeneous dataset, with 23 quantitative variables, and 26 qualitative variables. Overall, missing data represents 10.22% of the whole dataset and only eight patients have complete information in all fields (4.85%). The target variables is the survival at 1 year, and was encoded as a binary variable: 0 (dies) and 1 (lives). A certain degree of class-imbalance is also present (63 cases labeled as dies and 102 as lives).

In [None]:
import pandas as pd
import numpy as np

# read data from hcc_data.txt file and replace missing values with NaN
df = pd.read_csv("resources/hcc-data.txt", header=None, na_values = "?")

# print basic info about dataframe
print(df.info())

The only transformation applied to the original dataset was the replacement of missing values (denoted by "?") by NaN value, using the pandas.read_csv function with the na_values parameter.

There are 165 instances and 49 features in the dataset. The type of the features are as follows.

* Gender: nominal
* Symptoms: nominal
* Alcohol: nominal
* Hepatitis B Surface Antigen: nominal
* Hepatitis B e Antigen: nominal
* Hepatitis B Core Antibody: nominal
* Hepatitis C Virus Antibody: nominal
* Cirrhosis : nominal
* Endemic Countries: nominal
* Smoking: nominal
* Diabetes: nominal
* Obesity: nominal
* Hemochromatosis: nominal
* Arterial Hypertension: nominal
* Chronic Renal Insufficiency: nominal
* Human Immunodeficiency Virus: nominal
* Nonalcoholic Steatohepatitis: nominal
* Esophageal Varices: nominal
* Splenomegaly: nominal
* Portal Hypertension: nominal
* Portal Vein Thrombosis: nominal
* Liver Metastasis: nominal
* Radiological Hallmark: nominal
* Age at diagnosis: integer
* Grams of Alcohol per day: continuous
* Packs of cigarets per year: continuous
* Performance Status: ordinal
* Encefalopathy degree: ordinal
* Ascites degree: ordinal
* International Normalised Ratio: continuous
* Alpha-Fetoprotein (ng/mL): continuous
* Haemoglobin (g/dL): continuous
* Mean Corpuscular Volume (fl): continuous
* Leukocytes(G/L): continuous
* Platelets (G/L): continuous
* Albumin (mg/dL): continuous
* Total Bilirubin(mg/dL): continuous
* Alanine transaminase (U/L): continuous
* Aspartate transaminase (U/L): continuous
* Gamma glutamyl transferase (U/L): continuous
* Alkaline phosphatase (U/L): continuous
* Total Proteins (g/dL): continuous
* Creatinine (mg/dL): continuous
* Number of Nodules: integer
* Major dimension of nodule (cm): continuous
* Direct Bilirubin (mg/dL): continuous
* Iron (mcg/dL): continuous
* Oxygen Saturation (%): continuous
* Ferritin (ng/mL): continuous

All the nominal features (the first 23) are not ordinal features.

There are no labels for the features, and no row indexing.

The last column (50th) is the label of the classes, which is the survival at 1 year, and was encoded as a binary variable: 0 (dies) and 1 (lives).

In [42]:
# slice the dataframe to split the features from the labels
labels_df = df.iloc[:, [49]]
labels = labels_df.values.reshape(165)
features_df = df.iloc[:, 0:49]
features = features_df.values

print("Number of instances with missing values:", features_df.isnull().values.any(axis=1).sum())
print(f'Percentage of instances with missing values: {np.format_float_positional(features_df.isnull().values.any(axis=1).sum()/features_df.shape[0]*100, 2)}%')
print(f'Percentage of missing values to total number of values: {np.format_float_positional(features_df.isnull().values.sum()/features_df.size*100, 2)}%')

print("Class frequencies:", np.bincount(labels))
print(f'Percentage of negative instances: {np.format_float_positional(np.bincount(labels)[0]/labels.shape[0]*100, 2)}%')
print(f'Percentage of positive instances: {np.format_float_positional(np.bincount(labels)[1]/labels.shape[0]*100, 2)}%')
print("Class frequency ratio:", np.format_float_positional(np.max(np.bincount(labels))/np.min(np.bincount(labels)), 2))

Number of instances with missing values: 157
Percentage of instances with missing values: 95.15%
Percentage of missing values to total number of values: 10.22%
Class frequencies: [ 63 102]
Percentage of negative instances: 38.18%
Percentage of positive instances: 61.82%
Class frequency ratio: 1.62


There are missing values. The number of instances with missing values is 157, and their percentage with respect to total number of instances is 95.15%. The missing data represents 10.22% of the whole dataset.

There are 63 cases labeled as dies and 102 as lives, with 38.18% and 61.82% respectively. This is a class-imbalanced binary dataset, as the 60%-40% ratio is not respected, as showcased by the class frequency ratio with a value over 1.5.

### Preparation

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)

In [50]:
from sklearn.impute import SimpleImputer
imp1 = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp2 = SimpleImputer(missing_values=np.nan, strategy='mean')

mask1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 26, 27, 28, 43]
mask2 = [24, 25, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48]
imp1.fit(X_train[:, mask1])
imp2.fit(X_train[:, mask2])

X_train[:, mask1] = imp1.transform(X_train[:, mask1])
X_train[:, mask2] = imp2.transform(X_train[:, mask2])
X_test[:, mask1] = imp1.transform(X_test[:, mask1])
X_test[:, mask2] = imp2.transform(X_test[:, mask2])

For the missing values we will use the SimpleImputer class from the sklearn.impute module. We will use the most frequent and the mean strategy. The most frequent strategy replaces missing values using the most frequent value along each column, while the mean strategy replaces missing values using the mean value along each column.

We used the most frequent strategy on the nominal, integer and ordinal features, where mean value would not be acceptable, and the mean strategy on the continuous features.

In [52]:
train_data = pd.DataFrame(X_train)
test_data = pd.DataFrame(X_test)

# concat the train and test data and create dummy features for the nominal ones
data = pd.concat([train_data, test_data], axis=0)
data = pd.get_dummies(data, columns = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])

# split the data to train and test again with 70-30 ratio
X_train = data.iloc[0:115, :].values
X_test = data.iloc[115:165, :].values

For the nominal features we will use the get_dummies function from the pandas module. This function is used to convert categorical variable into dummy/indicator variables.

We had to concat the train and test data so that the dummy/indicator variables end up being the same for both sets. We then split the data back into train and test sets, following the initial 70-30 ratio.

### Classification

In [None]:
# import NaiveBayesClassifier
# from sklearn.naive_bayes import GaussianNB
# gnb = GaussianNB()
# gnb.fit(X_train, y_train)
# y_pred = gnb.predict(X_test)
# print(y_test)
# # import metrics
# from sklearn import metrics
# print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from imblearn.pipeline import Pipeline

# φέρνουμε τις γνωστές μας κλάσεις για preprocessing
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler # φέρνουμε τον StandarScaler ως transformer που έχει .transform kai ΄όχι ως scale()
from imblearn.over_sampling import RandomOverSampler
from sklearn.decomposition import PCA

# αρχικοποιούμε τον εκτιμητή (ταξινομητής) και τους μετασχηματιστές χωρίς υπερ-παραμέτρους
selector = VarianceThreshold()
scaler = StandardScaler()
ros = RandomOverSampler()
pca = PCA()

## Part 2: Kaggle dataset

Kaggle dataset is [kdd cyberattack](https://www.kaggle.com/datasets/slashtea/kdd-cyberattack)