# Unsupervisied Semantification on FB15k-237

## Semantification process has four steps: (Data Preprocessing, KG Embedding, Clustering, and Entity Typing)

### Data Preprocessing:
We skip data preprocessing step, since FB15k-237 dataset in already the knowledge graph format (RDF triples). This step is only required, if you have an input tabular data.

## KG Embedding:

* In this step, we train a knowledge graph embedding (e.g., transE) to learn vector represenations of entities and their relations.
* We use the graphVite embedding library to train transE on FB15k-237. For our experiments, we provide our pre-trained model in 'data/pre-trained'

In [None]:
import pickle
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [None]:
PATH_TRANS_E = 'data/pre-trained/transE_fb15k_256dim.pkl'
BASE_PATH_TRUTH = 'data/FB15k-237'

# transe_fb15k-237.pkl: pre-trained model of fb15k.
with open(PATH_TRANS_E, "rb") as fin:
    model = pickle.load(fin)
    
entity2id = model.graph.entity2id
relation2id = model.graph.relation2id

entity_embeddings = model.solver.entity_embeddings
relation_embeddings = model.solver.relation_embeddings

entity_embeddings.shape

#extract ground-truth types:
fb_train=pd.read_csv(BASE_PATH_TRUTH + '/train.txt', sep='\t', header=None, index_col=0)
fb_valid=pd.read_csv(BASE_PATH_TRUTH + '/valid.txt', sep='\t', header=None, index_col=0)
fb_test=pd.read_csv(BASE_PATH_TRUTH + '/test.txt', sep='\t', header=None, index_col=0)

fb_df=pd.concat([fb_train, fb_valid, fb_test])
fb_df['type']= fb_df[1].apply(lambda x: x.split('/')[1])

#combine entities with their types:

ground_truth={}
for entity_id in entity2id.keys():
    if entity_id in fb_df.index:
        if isinstance(fb_df.loc[entity_id, 'type'], pd.core.series.Series): 
            ground_truth[entity_id]=fb_df.loc[entity_id, 'type'][0]
        else:
            ground_truth[entity_id]=fb_df.loc[entity_id, 'type']
    else:
        ground_truth[entity_id]='unknown' # for missed types

In [None]:
#filter commen types from FB15k-237 dataset:
entity_embedding_filter=[]
y_true_filter=[]

top_types=['people', 'film', 'location', 'music', 'soccer', 'education']

for k, value in ground_truth.items():
    if value in top_types:        
        entity_embedding_filter.append(entity_embeddings[entity2id[k]])
        y_true_filter.append(value)
        
X_all = np.asarray(entity_embedding_filter)

#encode y_labels as one-hot:
encoder = LabelEncoder()
y_all = encoder.fit_transform(y_true_filter)
labels = encoder.classes_.tolist()

## Clustering:
* In this step, we group entities with similar properites (i.e., based on their embedding representations) into clusters. Each group should have similar entities --> similar types.

* We employ a density-based clustering (hdbscan) to detect entities cluster based on their density in the embedding space.
* We use the implementation of hdbscan clustering library. For more information/install, please check (https://hdbscan.readthedocs.io/en/latest/index.html 

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

HDBSCAN requires two main hyper-parameters: 1) eplsion, which specify the area within it, there should be a min_samples to consider a point a core point. We use the eplow approach to find a best value for epslion.

In [None]:
from sklearn.neighbors import NearestNeighbors

# final optimal value for cluster epsilon
neigh = NearestNeighbors(n_neighbors=5)
nbrs = neigh.fit(X_all)
distances, indices = nbrs.kneighbors(X_all)

distances = np.sort(distances, axis=0)
distances = distances[:,-1]
plt.plot(distances)

In [None]:
%%time
import hdbscan
from sklearn.metrics.pairwise import pairwise_distances

# compute the distance between entities using cosine
X_all_double=X_all.astype(np.double)
distance_matrix = pairwise_distances(X_all_double, metric='cosine')


hdbscan_clusterer=hdbscan.HDBSCAN(algorithm='best', alpha=0.1, metric='precomputed', cluster_selection_method='leaf',
                                      min_samples=10, min_cluster_size=700, core_dist_n_jobs=-1,allow_single_cluster=True,
                                      cluster_selection_epsilon=0.9)



hdbscan_clusterer.fit(distance_matrix)

cluster_labels= hdbscan_clusterer.labels_
cluster_probabilities=hdbscan_clusterer.probabilities_

## Entity Typing:

#### Sampling Entities for Labeling:
In the following, we present our strategy to select entities based on its membership in l calsiter
* We compute the cluster probabilies for all entities (cluster_probabilities). For each cluster, we select entities with high values >= 0.9 for labeling. 
* We present the selected entities (with their RDF triples) to human expers for labeling.
* Finally, we propagate the most frequent type in each cluster to  all entities.

### t-SNE Visualization of Labeled Entities:

In [None]:
# propagate the most frequent type in the cluster to all entities. 
df_tmp = pd.DataFrame({'pred_hdbscan': y_hdbscan, 'y_all': y_all})
pred_hdbscan = df_tmp.groupby('pred_hdbscan').transform(lambda x: x.mode().iloc[0]).to_numpy().reshape(-1)

plt.figure(figsize=(6, 5))
X_2d = TSNE(random_state=42).fit_transform(X_all)
label_ids = range(len(labels))
colors=['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', 'tab:brown']

for i, c, label in zip(label_ids, colors, labels):    
    plt.scatter(X_2d[pred_hdbscan == i, 0], X_2d[pred_hdbscan == i, 1], c=c, label=label, s=1)

plt.legend()    
plt.savefig('/src/Figures/fb15k-transE-hdbscan.png', dpi=600, bbox_inches='tight',pad_inches=0)    
plt.show()

## Evaluation:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import os
os.chdir('../../')

from clustering_evaluation import ClusterPurity
evaluator=ClusterPurity()

In [None]:
accuracy = accuracy_score(y_all, pred_hdbscan)
print('Accuracy: %f' % accuracy)

precision = precision_score(y_all, pred_hdbscan, zero_division=0, average='weighted')
print('Precision: %f' % precision)

recall = recall_score(y_all, pred_hdbscan, average='weighted')
print('Recall: %f' % recall)

f1 = f1_score(y_all, pred_hdbscan, average='weighted')
print('F1 score: %f' % f1)

print ('Purity: ' , evaluator.purity_score(y_true=y_all, y_pred=pred_hdbscan))