## Tab2Onto: Unsupervisied Semantification Of Lymphography

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### a) Data Preprocessing:
In this step, we use [Vectograph Library](https://github.com/dice-group/vectograph) to convert lymphography data from tabular format into a knowledge graph (RDF triples)

* For installation and usage of Vectograph, please follow the instructions described in https://github.com/dice-group/vectograph 

In [None]:
python main.py --tabularpath "data/Lymphography/lympho.csv" --kg_name "lymphograph-KG.nt" --num_quantile=10 --min_unique_val_per_column=12

* A preprocessed files of Lymphograpgy data can be found in `data/Lymphography/preprocessed/lymphograph-triples.nt`

### b) Knowledge Graph Embedding: 
* We used [DAIKIRI-Embedding library](https://github.com/dice-group/DAIKIRI-Embedding) to generate KG embeddings for Lymphography dataset
* DAIKIRI-Embedding can be installed and used by following the instructions provided in https://github.com/dice-group/DAIKIRI-Embedding
* A preprocessed file of lymphography embeddings can be found in `data/Lymphography/preprocessed/QMult_entity_embeddings`

### c) Clustering
* We used K-means clustering provided into our package [DAIKIRI-Clustering](https://github.com/dice-group/DAIKIRI-Clustering)
* For further details about the package installation, please check https://github.com/dice-group/DAIKIRI-Clustering
* You can find a pre-processed file for Lymphography Clustering here `data/Lymphography/preprocessed/kmeans_Clusters`
 

In [None]:
lympho_df= pd.read_csv('./lymphograph-raw.csv', header=0, index_col=['patient'])
y_true=lympho_df['class'].tolist()

encoder = LabelEncoder()
y_all = encoder.fit_transform(y_true)
labels = encoder.classes_.tolist()

In [None]:
features_df= pd.read_csv('./data/Lymphograph/QMult_entity_embeddings.csv', header=0,index_col=0)

X_train, X_test, y_train, y_test = train_test_split(features_df, y_all, test_size=0.20, random_state=100)

In [None]:
kmeans = KMeans(n_clusters=3, random_state=103).fit(X_train)
Kmeans_clusters= kmeans.predict(X_test)

### d) Entity Typing: (Human-In-the-Loop)
* We developed [LabENT](https://github.com/dice-group/LabENT), a web application that incorporate human-in-the-loop to assign labels for the computed clusters. 
* We recommend users install and try LabENT. More details can be found https://github.com/dice-group/LabENT
* LabENT Demo allows users to upload input files; labeled_clusters, and clustering_results to generate ontologies. 
* The generated ontology can be found in `data/Lymphography/preprocessed/DAIKIRI-Lympho.owl`

## Evaluation: 

### Evaluating Tab2Onto (Kmeans with ConEX)

In [None]:
## Evaluation of Embedding-based Clustering (Kmeans, with ConEx embeddings) ###

#----------- Evaluation based on Precision, Recall, Accuracy and F1-score: -------#
accuracy = accuracy_score(y_test, Kmeans_clusters)
print('Accuracy: %f' % accuracy)

precision = precision_score(y_test, Kmeans_clusters, average='weighted')
print('Precision: %f' % precision)

recall = recall_score(y_test, Kmeans_clusters, average='weighted')
print('Recall: %f' % recall)

f1 = f1_score(y_test, Kmeans_clusters, average='weighted')
print('F1 score: %f' % f1)

`Accuracy: 0.666667`

`Precision: 0.818182`

`Recall: 0.666667`

`F1 score: 0.728395`

### Evaluation Tab2Onto agains Supervised Baseline (Logistic Regression)

In [10]:
from sklearn.linear_model import LogisticRegression

logistic_clf = LogisticRegression(solver='liblinear',random_state=103).fit(X_train, y_train.ravel())
y_lr = logistic_clf.predict(X_test)

In [None]:
#----------- Evaluation based on Precision, Recall, Accuracy and F1-score: -------#
accuracy = accuracy_score(y_test, y_lr)
print('Accuracy: %f' % accuracy)

precision = precision_score(y_test, y_lr, average='weighted')
print('Precision: %f' % precision)

recall = recall_score(y_test, y_lr, average='weighted')
print('Recall: %f' % recall)

f1 = f1_score(y_test, y_lr, average='weighted')
print('F1 score: %f' % f1)

`Accuracy: 0.833333`

`Precision: 0.814992`

`Recall: 0.833333`

`F1 score: 0.818254`

### Evaluating Tab2Onto against Random Labeling w.r.t Class Distribution:

In [None]:
from collections import Counter

# compute weights for classes according to their distribution:
weights=[]
y_counts=Counter(y_test)

for i in range(4):
    weights.append(y_counts[i]/y_test.shape[0])

In [28]:
y_random_bala=np.random.choice([0,1,2,3], size=y_test.shape[0], p=weights)

# majority voting per cluster
df_tmp = pd.DataFrame({'y_random': y_random_bala, 'y_test': y_test})
y_random_bala = df_tmp.groupby('y_random').transform(lambda x: x.mode().iloc[0]).to_numpy().reshape(-1)

In [29]:
#----------- Evaluation based on Precision, Recall, Accuracy and F1-score: -------#
accuracy = accuracy_score(y_test, y_random_bala)
print('Accuracy: %f' % accuracy)

precision = precision_score(y_test, y_random_bala, average='weighted')
print('Precision: %f' % precision)

recall = recall_score(y_test, y_random_bala, average='weighted')
print('Recall: %f' % recall)

f1 = f1_score(y_test, y_random_bala, average='weighted')
print('F1 score: %f' % f1)


`Accuracy: 0.533333`

`Precision: 0.487164`

`Recall: 0.533333`

`F1 score: 0.485556`