# Description
This script interpreted the qids in the filtered politician entititie dump from the small quotebank qid catalogue.
The script works in the pipeline:
1. Load the filtred wikidata json dump in chunks
2. Go through record's columns and interpret the QIDs from the label field in the catalogue.
3. The interpretation is done in 3 ways
    <br>single qid interpretation (gender)<br>
    list qid interpretation (candidacy_election, parties)<br>
    list in list qid interpretatio(position_held)
4. Dump the interpreted data into a new json file
5. Dump the unrecognized QIDs into a log file

# Required package
[json](https://docs.python.org/3/library/json.html)
[bz2](https://docs.python.org/3/library/bz2.html)

In [11]:
import pandas as pd
import bz2
import time

In [9]:
Q_catalogue = pd.read_csv('../../wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')
WIKI_DATA_FILTERED = '../../filtered_politician_v2.json.bz2' #path to filtered wiki data dump
WIKI_DATA_FILTERED_LABELED = '../../filtered_politician_labeled_v2.json.bz2' #path to filtered wikidata with interpretations
WIKI_DATA_FILTERED_MISSINGQ = '../../filtered_politician_missingqids_v2.json.bz2'  #path to missing QIDs in small catalogue gile

In [4]:
def single_interpret(x):
    """
    Function for single value interpretation. e.g. qid Record undocumented QIDs if not already recorded.
    """
    try:
        return Q_catalogue.loc[x]['Label']
    except KeyError as e:
        if (x not in Unrecorded_Q):
            Unrecorded_Q.append(x)
        return x

In [5]:
def listlist_interpret(x):
    """
    Function for list interpretation. e.g. [[qid1,time1-1,time1-2],[qid2,time2-1,time2-2] Record undocumented QIDs if not already recorded.
    """
    tmp = []
    for i in x:
        tmp1 = []
        try:
            tmp1.append(Q_catalogue.loc[i[0]]['Label'])
        except KeyError as e:
            tmp.append(i[0])
            if (i[0] not in Unrecorded_Q):
                Unrecorded_Q.append(i[0])
        tmp1.append(i[1:2])
        tmp.append(tmp1)
    return tmp

In [6]:
def list_interpret(x):
    """
    Function for list interpretation. e.g. [qid1,qid2] Record undocumented QIDs if not already recorded.
    """
    tmp = []
    for i in x:
        try:
            tmp.append(Q_catalogue.loc[i]['Label'])
        except KeyError as e:
            tmp.append(i)
            if (i not in Unrecorded_Q):
                Unrecorded_Q.append(i)
    return tmp

In [12]:
CHUNKSIZE = 10000
Unrecorded_Q = []
with bz2.open(WIKI_DATA_FILTERED_LABELED, 'wb') as out_path_labeled:
    with bz2.open(WIKI_DATA_FILTERED_MISSINGQ, 'wb') as out_path_missingq:
        with pd.read_json(WIKI_DATA_FILTERED, lines=True, compression='bz2', chunksize=CHUNKSIZE) as df_reader:
            for chunk in df_reader:
                t1=time.time()
                chunk['gender'] = chunk['gender'].apply(single_interpret)
                chunk['parties'] = chunk['parties'].apply(list_interpret)
                chunk['candidacy_election'] = chunk['candidacy_election'].apply(list_interpret)
                #chunk['religion'] = chunk['religion'].apply(list_interpret)
                #religion property in wikipedia is messed up with instances like churches, shrines etc. better filter this another way
                try:
                    chunk['positions held'] = chunk['positions held'].apply(listlist_interpret)
                except:
                    chunk['positions held'] = chunk['positions held'].apply(list_interpret)
                chunk.to_json(path_or_buf=out_path_labeled,orient='records',lines=True)
                t2=time.time()
                dt=t2-t1
                print("Interpreted {} records and found {} undocumented qids [records/s: {:.2f}]".format(CHUNKSIZE,len(Unrecorded_Q), CHUNKSIZE / dt))
            pd.DataFrame(Unrecorded_Q).to_json(path_or_buf=out_path_missingq,orient='records',lines=True)

Interpreted 10000 records and found 3418 undocumented qids [records/s: 6184.60]
Interpreted 10000 records and found 5127 undocumented qids [records/s: 6918.81]
Interpreted 10000 records and found 15559 undocumented qids [records/s: 2556.90]
Interpreted 10000 records and found 29334 undocumented qids [records/s: 704.49]
Interpreted 10000 records and found 30301 undocumented qids [records/s: 5676.16]
Interpreted 10000 records and found 41719 undocumented qids [records/s: 462.67]
Interpreted 10000 records and found 50504 undocumented qids [records/s: 940.55]
Interpreted 10000 records and found 51566 undocumented qids [records/s: 3957.33]
Interpreted 10000 records and found 56791 undocumented qids [records/s: 538.32]
Interpreted 10000 records and found 66978 undocumented qids [records/s: 601.98]
Interpreted 10000 records and found 74167 undocumented qids [records/s: 249.36]
Interpreted 10000 records and found 75039 undocumented qids [records/s: 2581.66]
Interpreted 10000 records and found 

In [14]:
print(f'Found {len(Unrecorded_Q)} undocumented qids in total, which need to be queryed with full wikidata label catalogue. #NOTE: Religion field is not interpreted here.')

Found 123801 undocumented qids in total, which need to be queryed with full wikidata label catalogue. #NOTE: Religion field is not interpreted here.
