# Introduction

We aim to apply knowledge graph embedding/completion approaches on a Tabular data. To this end, we have the following plan:

1. Convert a tabular data into a knowledge graph.
2. Apply some knowledge graph embedding appraoches --including Pyke, Distmult.
3. Evaluate the quality of newly learned vector representations.

We assume that reader throughly understand the following papers [1,2].


# Preliminaies

### Input Data definition

Let $ A: a \times b$  and $ A \in P$ be a matrix representing an input data, where $P$ represents a vector space that subsumes 

1. $\mathbb{R}$,
2. empty values (NaN),
3. time (date-time),
4. sequence of characters.

Note that, I do not know how to matematically define NaN, time,
sequence of characters that may include numerical values.


### Input Data Indexing

In this work, $A[i][j]$  denotes the $i.th$ row and $j.th$ columns
while $A[i][:]$ denotes $i-th$ row and all columns, i.e., 
$|A[i][:]|=b$. Similarly, $A[:][j]$ denotes all rows and the $j-th$ column, i.e., $|A[:][j]|=b$.

# Problem definition


We would like to convert $A$ into a knowledge graph $K$ as defined in [1,2]. To this end,we propose the following function:
$$ T: P \mapsto D$$, where D denotes the all possible (hence infinite) RDF knowledge graphs. Expectedly, $T$ is a composition of several another functions

$$T: \phi \circ G \circ F_{c,d} \circ E(A).$$

We elucidate $T$ in the following steps:

1. $E$ only adds **EVENT_** prefix into $A[:][1]$.


2. $F$ generates new columns by applying Quantile-based discretization function with **d** number of discretization all columns that 
    
    1. are numeric,

    2. have at least **c** number of unique values, i.e. $|set(A[:][j])|>= c$.
    
3. $ F_{c,d} \circ E(A)$ transforms the input A into $TransformedA: a \times e$ where $e>b$.
    
    
4. $G$ fills NaN values in $A[:][j]$ by adding **suffix** of **j_dummy**. 


5. Hence, we now have $a \times e$ numbers of triples.

6. Finally $\phi$ denotes selected knowledge graph embedding / completion approach. We omit the details of $\phi$.

# Evaluation of the embeddings


[1,2] shows that we have severals metrics quantifiying the quality of learned vector representations. We have

1. We have head, relation and tail link predictions [2].

2. We have type prediction and cluster purity [1].

Note that I would claim that tail prediction (given subject and predicate, predict) indirectly subsumes type prediction. However, in [1], it is stated that subjects/entities can have several type information and this observation included in the evaluation wherease such observation is not stated/omitted/ignored etc. in [2].

## Let's apply DistMult on tabular/relational formatted data.

The followings are represented due to illustration purposes. Consequently, we focused on only the top $10^4$ rows on the given $AA.

In [None]:
import argparse
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from vectograph.utils import ignore_columns, create_experiment_folder, create_logger
from vectograph.helper_funcs import apply_PYKE
from vectograph.transformers import RDFGraphCreator, KGCreator, ApplyKGE, TypePrediction, ClusterPurity
import time
import warnings
warnings.filterwarnings("ignore")

In [None]:
tabularpath='/home/demir/Desktop/ai4bd-smart-logistics/2020-06-26-ai4bd-smart-logistics/merged.csv'

num_of_quantiles=40
min_num_of_unique_values_per_col=20
consider_only_to_N_rows=1_000
params = {
    'model': 'Distmult',
    'embedding_dim': 50,
    'num_iterations': 200,
    'batch_size': 256,
    'learning_rate': 0.005,
    'input_dropout': .1}


In [None]:

storage_path, _ = create_experiment_folder()
logger = create_logger(name='Vectograph', p=storage_path)

# DASK can be applied.
df = pd.read_csv(tabularpath, low_memory=False).head(consider_only_to_N_rows)
df.index = 'Event_' + df.index.astype(str)

num_rows, num_cols = df.shape  # at max num_rows times num_cols columns.
column_names = df.columns

logger.info('Original Tabular data: {0} by {1}'.format(num_rows, num_cols))
logger.info('Quantisation starts')

for col in df.select_dtypes(exclude='object').columns:
    if len(df[col].unique()) >= min_num_of_unique_values_per_col:
        label_names = [col + '_quantile_' + str(i) for i in range(num_of_quantiles)]
        df.loc[:, col + '_range'] = pd.qcut(df[col].rank(method='first'), num_of_quantiles, labels=label_names)

new_num_rows, new_num_cols = df.shape  # at max num_rows times num_cols columns.

logger.info('Tabular data after conversion: {0} by {1}'.format(new_num_rows, new_num_cols))

params.update({'storage_path': storage_path,
               'logger': logger})

pipe = Pipeline([('createkg', KGCreator(path=storage_path,logger=logger)),
                 ('embeddings', ApplyKGE(params=params)),
                 ('typeprediction', TypePrediction()),
                 ('clusterpruity', ClusterPurity())
                 ])

pipe.fit_transform(df)


# References

+ [1] A Physical Embedding Model for Knowledge Graphs: https://arxiv.org/abs/2001.07418
+ [2] TuckER: Tensor Factorization for Knowledge Graph Completion: https://arxiv.org/pdf/1901.09590.pdf