# Data Preprocessing

This notebook performs and documents the data preprocessing. The final results are a training, validation and testing dataset that can be used for training models in the next stage.

In [None]:
import os
import shutil
import pandas as pd
from data_preprocessing_utility import remove_entity_pairs, filter_properties, filter_entities, split_dataset, concat_files

In [None]:
# file paths to the initial unprocessed files
filepath_mw_bs = "data/exported_data/mutual_wikilinks_properties_both_sides.csv"
filepath_mw_os = "data/exported_data/mutual_wikilinks_properties_one_side.csv"
filepath_mw_no_props = "data/exported_data/mutual_wikilinks_no_properties.csv"
filepath_remaining_triples = "data/raw_data/mappingbased-objects_lang=en.ttl"
filepath_types = "data/raw_data/instance-types_inference=transitive_lang=en.ttl"

# file path of the file containing the filtered subset of property types
filepath_filtered_prop_types = "data/processed_data/filtered_property_types.csv"

# directory where the final processed train, validation and testing files are saved
results_dir_filepath = "data/processed_data/"
if not os.path.isdir(results_dir_filepath):
    os.mkdir(results_dir_filepath)

# create temporary directory for saving intermediate results
# this directory including the files containing intermediate results will be deleted after the proprocessing is done
temp_dir_filepath = "data/temp/"
if not os.path.isdir(temp_dir_filepath):
    os.mkdir(temp_dir_filepath)

## Remove Entity Pairs that Appear in the Mutually Wikilinked Properties Dataset

Entity pairs that appear in the mutually wikilinked properties datasets should be seperated from the remaining properties. The following code removes all triples of entity pairs that appear in the mutually wikilinked datasets from the dataset that contains all property definitions for entity pairs (except for wikilinks which are not part of this dataset and are not included here).

In [None]:
# remove entitiy pairs that appear in dataset of mutual wikilinks with both-sided properties
remove_entity_pairs(
    dataset_filepath=filepath_remaining_triples,
    pairs_dataset_filepath=filepath_mw_bs,
    processed_dataset_filepath=temp_dir_filepath+"remaining_triples_no_mw_1.csv",
    dataset_filetype="ttl",
)

In [None]:
# remove entitiy pairs that appear in dataset of mutual wikilinks with one-sided properties
remove_entity_pairs(
    dataset_filepath=temp_dir_filepath+"remaining_triples_no_mw_1.csv",
    pairs_dataset_filepath=filepath_mw_os,
    processed_dataset_filepath=temp_dir_filepath+"remaining_triples_no_mw_2.csv",
)

## Filter Properties

To reduce the number of properties that have to be considered in the embedding creation and relation prediction, the property types are reduced. The dataset of filtered property types is not created in this notebook but in a separate notebook. The resulting file containing the filtered property definitons is used here to remove any property in the datasets that is not in this list. The datasets that are filtered are the datasets containing properties that connect mutually wikilinked entity pairs (one-sided and both-sided properties) and the dataset containing remaining triples (which excludes the connecting properties of mutually wikilinked pairs now).

In [None]:
# filter properties of the mutual wikilinks with both-sided properties dataset
filter_properties(
    properties_dataset_filepath=filepath_mw_bs,
    filtered_property_types_filepath=filepath_filtered_prop_types,
    processed_dataset_filepath=temp_dir_filepath+"filtered_props_mw_bs.csv",
)

In [None]:
# filter properties of the mutual wikilinks with one-sided properties dataset
filter_properties(
    properties_dataset_filepath=filepath_mw_os,
    filtered_property_types_filepath=filepath_filtered_prop_types,
    processed_dataset_filepath=temp_dir_filepath+"filtered_props_mw_os.csv",
)

In [None]:
# filter properties of the remaining triples dataset
filter_properties(
    properties_dataset_filepath=temp_dir_filepath+"remaining_triples_no_mw_2.csv",
    filtered_property_types_filepath=filepath_filtered_prop_types,
    processed_dataset_filepath=temp_dir_filepath+"filtered_props_remaining_triples.csv",
)

## Filter Entities

Not all entities that appear in the datasets containing properties that connect mutually wikilinked entity pairs are also available in the dataset containing the other remaining triples. The goal of this master thesis is to classify relations between mutually wikilinked entity pairs that are already integrated in the graph and are not unknown. This is also reflected in the models that are choosen to predict the relation / property types between a pair of entities. The models that are used to create the knowledge graph embeddings need to have information on other triples in which an entity appears (transductive link prediction). For that reason any pair containing an entity that is not part of the remaining triples dataset is removed from the datasets containing connecting properties of mutually wikilinked entities. Futhermore, entity pairs in the dataset of mutually wikilinked entities without any connecting properties are filtered to only contain pairs where both entities appear in the remaining triples dataset.

The same filtering is applied on the entity types dataset, to reduce it to type information on entities that are part of the remaining triples dataset.

In [None]:
# filter enitities of the mutual wikilinks with both-sided properties dataset
filter_entities(
    dataset_filepath=temp_dir_filepath+"filtered_props_mw_bs.csv",
    entities_dataset_filepath=temp_dir_filepath+"filtered_props_remaining_triples.csv",
    processed_dataset_filepath=temp_dir_filepath+"filtered_entities_mw_bs.csv",
    filter_subject_only=False,
)

In [None]:
# filter enitities of the mutual wikilinks with one-sided properties dataset
filter_entities(
    dataset_filepath=temp_dir_filepath+"filtered_props_mw_os.csv",
    entities_dataset_filepath=temp_dir_filepath+"filtered_props_remaining_triples.csv",
    processed_dataset_filepath=temp_dir_filepath+"filtered_entities_mw_os.csv",
    filter_subject_only=False,
)

In [None]:
# filter enitities of the mutual wikilinks without connecting properties dataset
filter_entities(
    dataset_filepath=filepath_mw_no_props,
    entities_dataset_filepath=temp_dir_filepath+"filtered_props_remaining_triples.csv",
    processed_dataset_filepath=results_dir_filepath+"mw_no_props.csv",
    filter_subject_only=False,
)

In [None]:
# filter enitities of the types dataset
filter_entities(
    dataset_filepath=filepath_types,
    entities_dataset_filepath=temp_dir_filepath+"filtered_props_remaining_triples.csv",
    processed_dataset_filepath=temp_dir_filepath+"filtered_entities_types.csv",
    filter_subject_only=True,
    dataset_filetype="ttl",
)

## Create Training, Validation and Testing Datasets

To train, validate model settings and test the overall performance the datasets containing connecting properties of mutually wikilinked entities is split into a training, validation and testing set. The data is split so that entity pairs only appear in one of the splits. Individual entities can still appear in more than one split which again reflects the setting of the master thesis in which entities are not completely unknown but the relations between mutually wikilinked pairs are not all classified.

After splitting the datasets containing connecting properties between mutually wikilinked properties, the respective splits with both- and one-sided properties are combined. Furthermore the remaining triples and entity types datasets are added to the training set of connecting properties of mutually wikilinked entities to obtain the final training set.

In [None]:
# split dataset containing both-sided properties of mutual wikilinks into train, val and test set
split_dataset(
    dataset_filepath=temp_dir_filepath+"filtered_entities_mw_bs.csv",
    trainset_filepath=temp_dir_filepath+"train_mw_bs.csv",
    valset_filepath=temp_dir_filepath+"val_mw_bs.csv",
    testset_filepath=temp_dir_filepath+"test_mw_bs.csv",
    val_test_fraction=5,
    random_state=42
)                            

In [None]:
# split dataset containing one-sided properties of mutual wikilinks into train, val and test set
split_dataset(
    dataset_filepath=temp_dir_filepath+"filtered_entities_mw_os.csv",
    trainset_filepath=temp_dir_filepath+"train_mw_os.csv",
    valset_filepath=temp_dir_filepath+"val_mw_os.csv",
    testset_filepath=temp_dir_filepath+"test_mw_os.csv",
    val_test_fraction=5,
    random_state=42
)                            

In [None]:
# final training set
# concatenate training splits of mutual wikilinked entities with both- and one-sided properties, remaining triples and types
concat_files(
    filepaths=[
        temp_dir_filepath+"filtered_props_remaining_triples.csv",
        temp_dir_filepath+"filtered_entity_types.csv",
        temp_dir_filepath+"train_mw_bs.csv",
        temp_dir_filepath+"train_mw_os.csv"
    ],
    filetypes=["csv", "csv", "csv", "csv"],
    processed_dataset_filepath=results_dir_filepath+"train.tsv"
)

In [None]:
# final validation set
# concatenate validation splits of mutual wikilinked entities with both- and one-sided properties
concat_files(
    filepaths=[
        temp_dir_filepath+"val_mw_bs.csv",
        temp_dir_filepath+"val_mw_os.csv"
    ],
    filetypes=["csv", "csv"],
    processed_dataset_filepath=results_dir_filepath+"val.tsv"
)

In [None]:
# final testing set
# concatenate testing splits of mutual wikilinked entities with both- and one-sided properties
concat_files(
    filepaths=[
        temp_dir_filepath+"test_mw_bs.csv",
        temp_dir_filepath+"test_mw_os.csv"
    ],
    filetypes=["csv", "csv"],
    processed_dataset_filepath=results_dir_filepath+"test.tsv"
)

In [None]:
# remove temporary directory and all temporary files inside
shutil.rmtree(temp_dir_filepath)