# Preprocessing

The preprocessing pipeline should output a list of speakers with quotes. Each element of this list should contain information about the speaker (full name, gender, date of birth...) - obtained from wikidata, and a string formed by joining multiple quotes in order to get a string of fixed length.

The pipeline is presented through a data analysis example - analysing the personalities of the US politicians. In this example we take 100 politicians from both of the two major political parties, the Democratic party and the Republican party. We select the politicians which have the most quotes in our database. We only consider quotes for which the probability of the speaker is higher than 80% (referenced as significant quotes in this notebook).

In [2]:
import bz2
import json
import re
import random
import sys
import os
import bz2
import time

### Counting significant quotes

<b>This step has been executed once and will not be needed in following analyses since the quote counts calculated are for all the speakers and the output can be simply reused.</b>

Define some methods for better reusability.

In [21]:
PATTERN_INPUT = "../quotebank/quotes-{}.json.bz2"

In [6]:
def write_json_to_file(name, obj):
    # Use current timestamp to make the name of the file unique
    millis = round(time.time() * 1000)
    name = f'{name}_{millis}.json'
    with open(name, 'wb') as f:
        output = json.dumps(obj)
        f.write(output.encode('utf-8'))
    return name

Methods used for counting significant quotes.

In [7]:
signi_count = 0
signi_quote_dict = {}

In [15]:
# The signature remains from an older version of the code, parameter out_file could be removed, but then has to be removed in other places in the code as well.
def initialize(out_file):
    global signi_count
    global signi_quote_dict
    signi_count = 0
    signi_quote_dict = {}

In [17]:
# The signature remains from an older version of the code, parameter out_file could be removed, but then has to be removed in other places in the code as well.
def count_significant_quotes(out_file, row):
    global signi_count
    global signi_quote_dict
    
    probas = row['probas']
    qids = row['qids']
    
    if (len(probas) == 0 or len(qids) == 0):
        return
    
    if (probas[0][0] == 'None'):
        return
    
    p = float(probas[0][1])
    if (p < 0.8):
        return
    
    qid = qids[0]
    
    signi_count = signi_count + 1
    signi_quote_dict[qid] = signi_quote_dict.get(qid, 0) + 1

General methods used for processing the quotes files.

In [5]:
"""
Process a chunk of the input stream.
"""
def proc(input, evaluate_quote, max_length=20):
    # Ugly global variable usage :(
    global index
    global invalid_json_count
    global invalid_chunk_count
    global chunk_stitching
    global stitch_length
    global scrap_next
    global quote_is_open
    global quote_part
    global dat_part
    global euro_error
    global euro_count
    
    global totin
    global totout
    global prev
    global dec
    global start
    """Decompress and process a piece of a compressed stream"""
    dat = dec.decompress(input)
    got = len(dat)
    if got != 0:    # 0 is common -- waiting for a bzip2 block
        try:
            if (euro_error):
                # If the previous chunk ended unexpectedly and could not be decoded, try to combine it with this chunk
                s = (dat_part + dat).decode('utf-8')
                euro_error = False
            else:
                # Decode the current chunk
                s = dat.decode('utf-8')
                
            # List elements in the quote files are separated by new lines (\n)
            lines = s.split('\n')

            for line in lines:
                try:
                    if (scrap_next):
                        # If the object spans too many chunks we decide to scrap it, and keep scraping until JSON can parse the line (chunk)
                        ob = json.loads(line)
                        scrap_next = False
                        quote_is_open = False
                        chunk_stitching -= stitch_length
                    else:
                        if (quote_is_open):
                            # If previous chunk ended in the middle of a JSON object we merge that content with the current line
                            ob = json.loads(quote_part + line)
                            quote_is_open = False
                        else:
                            # Parse the current line
                            ob = json.loads(line)

                    # Parametrization - do work on a single quote JSON object
                    evaluate_quote({}, ob)
                except ValueError:
                    """
                    Error occurs when the line does not contain the whole JSON object, which happens for the last line in almost every chunk of input stream.
                    We solve this by remembering the partial object, and then merging it with the rest of the object when we load the next chunk.
                    JSON object might span more than 2 chunks, and in that case we keep merging until we reach max_length chunks, when we just throw away the object
                    and count it as invalid using invalid_json_count.
                    """
                    if (scrap_next):
                        pass
                    else:
                        if (quote_is_open):
                            chunk_stitching += 1
                            quote_part = quote_part + line
                            stitch_length += 1

                            if (stitch_length > max_length):
                                invalid_json_count += 1
                                scrap_next = True
                        else:
                            quote_is_open = True
                            quote_part = line
                            stitch_length = 0
        except UnicodeDecodeError as e:
            # Error occurs when input stream is split in the middle of a character which is encoded with multiple bytes, for example the euro symbol
            if (euro_error):
                dat_part = dat_part + dat
            else:
                euro_error = True
                dat_part = dat
            
            euro_count += 1
        
        index += 1
    return got

In [85]:
def run_through_quotes(init, evaluate_quote, year, target_dict_name, path_to_input, name='test', chunk_size=16384):
    global index
    global invalid_json_count
    global invalid_chunk_count
    global chunk_stitching
    global stitch_length
    global scrap_next
    global quote_is_open
    global quote_part
    global dat_part
    global euro_error
    global euro_count
    
    global totin
    global totout
    global prev
    global dec
    global start
    
    size = os.path.getsize(path_to_input)
    invalid_json_count = 0
    invalid_chunk_count = 0
    chunk_stitching = 0
    stitch_length = 0
    scrap_next = False
    quote_is_open = False
    quote_part = ''
    dat_part = 0
    euro_error = False
    euro_count = 0
    
    totin = 0
    totout = 0
    prev = -1
    dec = bz2.BZ2Decompressor()
    start = time.time()
    
    init({})
    
    target_dict = poli_quotes if target_dict_name == "poli_quotes" else signi_quote_dict
    index = 0
    with open(path_to_input, 'rb') as f:
        for chunk in iter(lambda: f.read(chunk_size), b''):
            # feed chunk to decompressor
            got = proc(chunk, evaluate_quote)

            # handle case of concatenated bz2 streams
            if dec.eof:
                rem = dec.unused_data
                dec = bz2.BZ2Decompressor()
                got += proc(rem, evaluate_quote)

            # show progress
            totin += len(chunk)
            totout += got
            if got != 0:    # only if a bzip2 block emitted
                frac = round(1000 * totin / size)
                if frac != prev:
                    left = (size / totin - 1) * (time.time() - start)
                    print(f'\r{frac / 10:.1f}% (~{left:.1f}s left)\tyear: {year}\tnumber of speakers: {len(target_dict)}\tstitching: {chunk_stitching}\teuro count: {euro_count}\tinvalid json count: {invalid_json_count}\tinvalid chunk count: {invalid_chunk_count}', end='')
                    prev = frac

    # Show the resulting size.
    print(end='\r')
    print(totout, 'uncompressed bytes')

    output_name = write_json_to_file(f'{name}-{year}', target_dict)
    return output_name

Create files for every year, each file contains a dictionary where the key is the QID of the speaker, and the value is the number of significant quotes.
<br><br>
<font color='red'>WARNING: LONG EXECUTION!</font>

In [None]:
years = [2015, 2016, 2017, 2018, 2019, 2020]
# years = [2020]
for year in years:
    path_to_input = PATTERN_INPUT.format(year)
    
    run_through_quotes(
        initialize, count_significant_quotes, year, "signi_quote_dict", path_to_input, name='signi-quote-count', chunk_size=1_048_576)
    print('')
    print(f'Finished counting quotes for the year {year}')

Now combine the quote counts into a single file.
<br>
An example of the file names is used, the string should be updated if the code is ran again.

In [26]:
signi_quotes_file_names = [
    "signi-quote-count-2015_1636244638891.json",
    "signi-quote-count-2016_1636246832187.json",
    "signi-quote-count-2017_1636249273913.json",
    "signi-quote-count-2018_1636250518608.json",
    "signi-quote-count-2019_1636251729971.json",
    "signi-quote-count-2020_1636237785105.json"
]

In [55]:
combined_signi_dict = {}

for file_name in signi_quotes_file_names:
    with open(file_name, 'r') as f:
        one_dict = json.load(f)
        for k in one_dict.keys():
            combined_signi_dict[k] = combined_signi_dict.get(k, 0) + one_dict[k]

Sort the dictionary so the speakers with the most quotes appear first.

In [56]:
sorted_combined_signi_dict = {k: v for k, v in sorted(combined_signi_dict.items(), key=lambda item: item[1], reverse=True)}

And finally save the resulting dictionary into a file, this file can later be reused for multiple analyses, whenever we need to choose a representation of a group of people using the number of quotes to pick the most quoted individuals.

In [None]:
write_json_to_file('signi-quote-count-combined', sorted_combined_signi_dict)

### Get the wikidata

We used the https://query.wikidata.org/ website to get the relevant wikidata. The SPARQL query is in the following cell.
<br>
We can do this (and did do afterwards) using the provided wikidata parquet file as well.

Merge duplicate objects representing a single speaker but with differing fields.
<br>
Example: Arnold Schwarzenegger has both Austrian and American nationalities, and would appear twice, once with Austrian, and once with American nationality.

In [51]:
with open("../quotebank/american_politicians_fixed.json", "r") as f:
    wiki_poli = json.load(f)

In [52]:
wiki_poli_merged = dict()

index = 0
for row in wiki_poli:
    # Extract the QID from the link (ex. http://www.wikidata.org/entity/Q203286 -> Q203286)
    qid_start = row['item'].rindex('/') + 1
    key = row['item'][qid_start:]
    # Replace the link with the QID
    row['item'] = key
    
    if key in wiki_poli_merged:
        merged_entry = wiki_poli_merged[key]
        columns = ['itemLabel', 'genderLabel', 'citizenshipLabel', 'religionLabel', 'ethnicLabel', 'degreeLabel', 'dateOfBirth', 'placeOfBirthLabel', 'memberOfParty', 'memberOfPartyLabel', 'languageLabel']
        """
        Merge the values for every column:
            - if the values are the same - do nothing
            - if the values are different - create a list and add them both
        """
        for col in columns:
            if row.get(col, None) is None:
                continue
                
            updated_entry = merged_entry.get(col, None)
            
            if updated_entry is None:
                updated_entry = row[col]
            elif isinstance(updated_entry, list):
                if row[col] not in updated_entry:
                    updated_entry.append(row[col])
            elif row[col] != updated_entry:
                updated_entry = [updated_entry, row[col]]
                
            merged_entry[col] = updated_entry
    else:
        wiki_poli_merged[key] = row

In [45]:
write_json_to_file('american_politicians_final', wiki_poli_merged)

'american_politicians_final_1636570708023.json'

### Get the 100 most quoted party members

Using the results of the previous two steps - the number of quotes for each speaker, and the list of US politicians, we can compile a list of 100 most quoted members of the two major US political parties.

In [57]:
dem_list = []
rep_list = []

CAP_TARGET = 100
DEM_PARTY = "http://www.wikidata.org/entity/Q29552"
REP_PARTY = "http://www.wikidata.org/entity/Q29468"

for v in sorted_combined_signi_dict:
    row = wiki_poli_merged.get(v, None)
    
    # Could not find person in the politician dictionary
    if row is None:
        continue
    
    memberOfParty = row.get('memberOfParty', None)
    if memberOfParty is None:
        continue
    
    # Cast to one element list if not already a list
    if isinstance(memberOfParty, list) == False:
        memberOfParty = [memberOfParty]
    
    # Check membership
    if DEM_PARTY in memberOfParty:
        if REP_PARTY in memberOfParty:
            # member of both parties, just skip
            continue

        # Check if the list is already at full capacity
        if len(dem_list) < CAP_TARGET:
            dem_list.append(row)
    elif REP_PARTY in memberOfParty:
        # Check if the list is already at full capacity
        if len(rep_list) < CAP_TARGET:
            rep_list.append(row)
    
    # Check if both lists are at full capacity
    if len(dem_list) == CAP_TARGET and len(rep_list) == CAP_TARGET:
        break

### Get the politician quotes

For the politicians in the previously compiled lists, we now fetch the quotes from the quote files. We use the methods defined at the top of this notebook, which were written in a reusable way.

Define the initialization and visit methods.

In [90]:
poli_quotes = {}
poli_people = set()

In [91]:
def poli_initialize(out_file):
    global poli_quotes
    global poli_people
    global dem_list
    global rep_list
    
    poli_quotes = {}
    poli_people = set()
    
    for v in dem_list:
        poli_people.add(v['item'])
    for v in rep_list:
        poli_people.add(v['item'])

In [92]:
"""
Remember the quote, only if it belongs to one of the politicians in the set poli_people, and if the probability is over 80%.
"""
def save_politician_quotes(out_file, row):
    global poli_quotes
    global poli_people
    
    probas = row['probas']
    qids = row['qids']
    
    # Check if the probability field exists
    if (len(probas) == 0 or len(qids) == 0):
        return
    
    if (probas[0][0] == 'None'):
        return
    
    # Check if the probability is over 80%
    p = float(probas[0][1])
    if (p < 0.8):
        return
    
    # Check if the speaker is one of the 100 party members
    qid = qids[0]
    if qid not in poli_people:
        return
    
    # Remember only the quote and the probability
    data = {}
    data['quotation'] = row['quotation']
    data['proba'] = row['probas'][0][1]
    
    # Append the quote
    arr = poli_quotes.get(qid, [])
    arr.append(data)
    poli_quotes[qid] = arr

Create files for every year, each file contains a dictionary where the key is the QID of the speaker, and the value is the list of significant quotes attributed to the speaker.
<br><br>
<font color='red'>WARNING: LONG EXECUTION!</font>

In [None]:
years = [2015, 2016, 2017, 2018, 2019, 2020]
# years = [2020]
for year in years:
    path_to_input = PATTERN_INPUT.format(year)
    
    run_through_quotes(
        poli_initialize, save_politician_quotes, year, "poli_quotes", path_to_input, name='politician-quotes', chunk_size=1_048_576)
    print('')
    print(f'Finished compiling quotes for the year {year}')

### Combine the quotes and the wikidata

Now we combine the politician quotes with their wikidata information. We use the 6 files of politician quotes created in the previous step, as well as the list of the party members. The result is a file which contains 200 entries, where each entry represents one politician, and contains their wikidata info as well as a list of quotes. The list of quotes can be quite long for some of the politicians.

In [96]:
poli_quote_files = [
    "../quotebank/politician-quotes-2015_1636331534906.json",
    "../quotebank/politician-quotes-2016_1636332058163.json",
    "../quotebank/politician-quotes-2017_1636333168732.json",
    "../quotebank/politician-quotes-2018_1636334221167.json",
    "../quotebank/politician-quotes-2019_1636335010497.json",
    "../quotebank/politician-quotes-2020_1636330658142.json"
]

poli_quotes_combined = {}

both_parties = dem_list + rep_list
for v in both_parties:
    copy = dict(v)
    copy['quotations'] = []
    
    poli_quotes_combined[v['item']] = copy

for poli_quote_file_name in poli_quote_files:
    with open(poli_quote_file_name, 'r') as f:
        quotes = json.load(f)
        
        for k in quotes.keys():
            poli_quotes_combined[k]['quotations'] += quotes[k]

write_json_to_file('politician-quotes-combined', poli_quotes_combined)

'politician-quotes-combined_1636575214462.json'

### Filter the quotes

Some of the quotes in the database do not represent actual quotes, but instead contain junk like html tags, source code, or text from the webpage where the source article was published.
<br>
We filter these quotes out so our dataset is not polluted by junk data. We have found a few filters which detect most of the junk data, while maintaining a low false positive rate:
<ul>
    <li>quotes which contains very long 'words' - more than 50 characters</li>
    <li>quotes which contain URLs - these usually contain other junk characters</li>
    <li>quotes which contains JSON-like key-value pairs</li>
    <li>quotes which contain a lot of special characters (more than 10% of total characters)</li>
</ul>

In [108]:
with open('../quotebank/politician-quotes-combined_1636336204264.json', 'r') as f:
    poli_quotes_filtered = json.load(f)

In [109]:
filtered_quotes = []

weird_pattern = '[_@#+&;:\(\)\{\}\[\]\\/`]'
json_pattern = '\{.*[a-zA-Z]+:\s[\'"`][a-zA-Z0-9]+[\'"`].*\}'
url_pattern = 'https?'

for k in poli_quotes_filtered.keys():
    elem = poli_quotes_filtered[k]
    
    new_arr = []
    for entry in elem['quotations']:
        text = entry['quotation']
        
        longest = max(entry['quotation'].split(), key=len)
        if (len(longest) > 50):
            filtered_quotes.append(entry)
            continue
        
        if re.search(url_pattern, text) is not None:
            filtered_quotes.append(entry)
            continue
        
        if re.search(json_pattern, text) is not None:
            filtered_quotes.append(entry)
            continue
            
        weird_num = len(re.findall(weird_pattern, text))
        total = len(text)
        weird_percent = weird_num / total
        if (weird_percent > 0.1):
            filtered_quotes.append(entry)
            continue
            
        new_arr.append(entry)
    elem['quotations'] = new_arr

In [113]:
write_json_to_file('data/politician-quotes-combined-and-filtered', poli_quotes_filtered)

'data/politician-quotes-combined-and-filtered_1636577222699.json'

### Concatenate the quotes

Finally, we concatenate the quotes into a single fixed-length string. We do this because of the limitation of the CSV file format, which can contains at most ~32000 characters in a single field. This means that most of the quotes will not be used.
<br>
Alternatively, we could use multiple fields for the same speaker, but we think the amount of characters that can fit in a single cell is enough for a decent analysis.
<br>
We sort the quotes by length and use the longest ones first. We do this because the longer quotes are a better representation of a person's speach.

In [116]:
with open('data/politician-quotes-combined-and-filtered_1636577222699.json', 'r') as f:
    poli_quotes_concat = json.load(f)

In [117]:
QUOTE_LENGTH = 5000

for k in poli_quotes_concat.keys():
    elem = poli_quotes_concat[k]
    
    # Sort the quotes by length
    elem['quotations'].sort(key = lambda x: len(x['quotation']), reverse = True)
    
    concat = ''
    for quote in elem['quotations']:
        # Concatenate the quotes
        concat += ' ' + quote['quotation']
        
        # Trim if we are over QUOTE_LENGTH
        if (len(concat) >= QUOTE_LENGTH):
            concat = concat[0:QUOTE_LENGTH]
            break
    
    elem['quotations'] = concat

In [None]:
write_dict_to_file('data/politician-quotes-concatenated', poli_quotes_concat)