# Data Preprocessing
https://colab.research.google.com/drive/1EU3MsQnyIjhK3Eoblj9gbqgsHdIDGAEV?usp=sharing

### Mounting to the drive
( You should add a shortcut of our project drive (ADA_proj) to your drive to access the used datasets)

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


### Importing the needed libraries

In [None]:
import bz2
import json
import numpy as np 
import pandas as pd

### Loading wikidata
We load the provided Wikidata entities, filter them based on whether one of the entity's occupation is 'politician', and index the dataframe on the qid (as it uniquely identifies the entities)

In [None]:
parquet_path = '/content/drive/MyDrive/Project datasets/speaker_attributes.parquet'

## Loading the parquet
speaker_attributes = pd.read_parquet(parquet_path)

In [None]:
## Filtering the wikidata entities based on whether the occupation is politician
politician = 'Q82955'
politician_attributes=speaker_attributes[speaker_attributes['occupation'].apply(lambda x: False if x is None else politician in x )]

In [None]:
## Taking only the needed attributes and reindexing on id
useful_politician_attributes=politician_attributes[['nationality','occupation', 'party', 'academic_degree', 'id']].set_index(politician_attributes['id'])

### Loading Quotebank
We load the quotes for each year separately.  
We only consider quotes whose speaker is known as our analysis is based on the speaker ideology (speaker != None).  
We only consider quotes whose speaker is a politician, that is its QID (we only consider the first QID from the lis of QIDs) is in the previously generated wikidata.  
We add the kept wikidata attributes, that is for each considered quote we add the speaker nationality, party, and academic degree.

In [None]:
## path to files
path_to_file = '/content/drive/MyDrive/Quotebank/quotes-2017.json.bz2' 
path_to_out = '/content/drive/MyDrive/ADA_Proj/quotes-2017-clean.csv'
## number of iteration
iter = 0
## importing the quotes
with bz2.open(path_to_file, 'rb') as s_file:
    with bz2.open(path_to_out, 'wb') as d_file:
        for quote_dp in s_file:
            quote_dp = json.loads(quote_dp)
            ## delete unneeded attributes
            del quote_dp['quoteID'] 
            del quote_dp['numOccurrences'] 
            del quote_dp['probas'] 
            del quote_dp['urls'] 
            del quote_dp['phase'] 
            ## neglect quotes with None as speaker
            if quote_dp['speaker'] == 'None':
              continue
            ## just to check advancement of loading
            iter+=1
            if (iter % 100000 == 0):
              print('nombre de citations lues: {}'.format(iter))
            ## get the ids that are in the wikidata database, i.e. the politicians' ids.
            ids = useful_speaker_attributes.index
            ## check if the id is in 
            if quote_dp['qids'][0] in ids:
              speaker = useful_speaker_attributes.loc[quote_dp['qids'][0]]
              ## add the wikidata attributes to the quote data
              quote_dp['party']=[] if speaker['party'] is None else list(speaker['party'])
              quote_dp['academic_degree']= [] if speaker['academic_degree'] is None else list(speaker['academic_degree'])
              quote_dp['nationality']=[] if speaker['nationality'] is None else list(speaker['nationality'] )
              ## writing the resulting quote_dp to the drive             
              d_file.write((json.dumps(quote_dp)+'\n').encode('utf-8'))    

nombre de citations lues: 100000
nombre de citations lues: 200000
nombre de citations lues: 300000
nombre de citations lues: 400000
nombre de citations lues: 500000
nombre de citations lues: 600000
nombre de citations lues: 700000
nombre de citations lues: 800000
nombre de citations lues: 900000
nombre de citations lues: 1000000
nombre de citations lues: 1100000
nombre de citations lues: 1200000
nombre de citations lues: 1300000
nombre de citations lues: 1400000
nombre de citations lues: 1500000
nombre de citations lues: 1600000
nombre de citations lues: 1700000
nombre de citations lues: 1800000
nombre de citations lues: 1900000
nombre de citations lues: 2000000
nombre de citations lues: 2100000
nombre de citations lues: 2200000
nombre de citations lues: 2300000
nombre de citations lues: 2400000
nombre de citations lues: 2500000
nombre de citations lues: 2600000
nombre de citations lues: 2700000
nombre de citations lues: 2800000
nombre de citations lues: 2900000
nombre de citations lue