As we surely may enrich our dataset with the provided additional metadata about the speakers in the Quotebank dataset, we conducted some pre-processing on the Wikidata dataset.

This notebook is essentially used to pre-process the wikidata entites in order to possibly enrich the data with it in the Milestone 3.

## Mounting the Google drive

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Handling the data
Here is how we used to create the wikidata_labels data frame.

We first create the path and open the additional metadata about the speakers in the Quotebank dataset, and then we create the data frame

In [None]:
PATH_1 = "drive/MyDrive/Applied Data Analysis (ADA)/ADA milestone 2/"
PATH_2 = PATH_1 + "Project datasets/speaker_attributes.parquet/"
df = pd.read_parquet(PATH_2 + "part-00001-0d587965-3d8f-41ce-9771-5b8c9024dce9-c000.snappy.parquet", engine='pyarrow')
df

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,,,,[Q6581097],985453603,,,,,,Q45441526,Cui Yan,,item,
1,,,[Q9903],[Q6581097],1008699604,,,,,,Q45441555,Guo Ziyi,,item,
2,,,[Q9903],[Q6581097],1008699709,,,,,,Q45441562,Wan Zikui,,item,
3,,,[Q9903],[Q6581097],1008699728,,,,,,Q45441563,Lin Pei,,item,
4,,,[Q9683],[Q6581097],985261661,,,,,,Q45441565,Guan Zhen,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660353,,,,,1388244101,,,,,,Q106134191,Leonhard Gans,,item,
660354,,,,,1388243838,,,,,,Q106134200,André de Arena,,item,
660355,,,,,1388243780,,,,,,Q106134203,Andreas Byssmann,,item,
660356,,,,,1388243759,,,,,,Q106134204,Johannes,,item,


In [None]:
wikidata_labels = pd.read_csv(PATH_1 + 'Project datasets/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')

In [None]:
wikidata_labels.loc['Q31']['Label']

'Belgium'

## Wikidata Labels



In [None]:
wikidata_labels

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q31,Belgium,country in western Europe
Q45,Portugal,country in southwestern Europe
Q75,Internet,global system of connected computer networks
Q148,People's Republic of China,sovereign state in East Asia
Q155,Brazil,country in South America
...,...,...
Q106302506,didgeridooist,musician who plays the didgeridoo
Q106341153,biochemistry teacher,teacher of biochemistry at any level
Q106368830,2018 Wigan Metropolitan Borough Council electi...,
Q106369692,2018 Wigan Metropolitan Borough Council electi...,


In order to properly use information from the Wikidata_Labels data frame, we implemented two functions: `Qids_to_tuple_words` and `Qids_to_words`. These functions convert labels into words ...

In [None]:
def Qids_to_tuple_words(tuple_qids, wikidata_labels):
    words = []
    for qid in tuple_qids:
        words.append(wikidata_labels.loc[qid]['Label'])
    return tuple(words)

def Qids_to_words(df, column_name, wikidata_labels):
    df_c = df[column_name][df[column_name].notna()].copy()
    df_c = df_c.apply(lambda x: Qids_to_tuple_words(np.sort(x), wikidata_labels))
    return df_c

For example, let's see the nationality feature from the wikidata_labels data frame:

In [None]:
words = Qids_to_words(df, 'nationality', wikidata_labels)

In [None]:
words

1                       (Ming dynasty,)
2                       (Ming dynasty,)
3                       (Ming dynasty,)
4                        (Tang Empire,)
5                        (Tang Empire,)
                      ...              
660332                         (Italy,)
660333              (Kingdom of Italy,)
660337                (Czech Republic,)
660342                      (Bulgaria,)
660343    (Kingdom of the Netherlands,)
Name: nationality, Length: 124148, dtype: object

In [None]:
words.unique()

array([('Ming dynasty',), ('Tang Empire',), ('Song dynasty',), ...,
       ('Vienna', 'Austria'), ('Kingdom of Württemberg', 'Germany'),
       ("Romanian People's Republic", 'Socialist Republic of Romania')],
      dtype=object)

In [None]:
words.unique().shape

(1441,)

## Conclusion

We can now properly augment our dataset with the corresponding labels to each wikidata search.