# Description
This nodebook handles the data pre-processing pipeline from 3 chosen Datasets Quotebank, WikiData and allSides Media Bias Rating to the the final provided data file available for plotting and analysis.

This nodebook aims at filtering out all quotes in Quotebank data from 2015 to 2020, in which US politicians mention another one or more politicians. The script restructures the quotes in a compact style with clear information about the mentioner politician, the mentioned politician, the mentioned content, the mentioned topics, the sentiment of mentions and the weight of mentions (balanced by the bias of metion sources).

The pre-processing works in the pipeline:
1. Filter out and interpret (from QID to notation) all entities in wikidata dump that are alive politicians and dump their names (including aliases), gender, religion, parties and held positions (with time periods) into a smaller file. 
2. With the interpreted politician catalogue, go through all quotes data in Quotebank from 2015 to 2020 and filter out those quotations that were spoken by US politicians and contained other politicians (names/aliases). Dumped the quotes to files (one for mentioned US politician and the other for mentioned non-US politician)
3. (Check this **@Guoyuan**) Look up the domain of the mention sources in allSides Media Bias Rating dataset, assign a new value to each mention indicating the bias of source.
4. (Check this **@Guoyuan**) Perform Latent Dirichlet Allocation (LDA) topic clustering on the contents of mentions, assign a new value to each mention indicating what the topic is about.
5. (Check this **@Guoyuan**) Perform Sentiment Analysis (special method Noun?) on the contents of mentions, assign a new value to each mention indicating the sentiment.

# Required package
[qwikidata](https://qwikidata.readthedocs.io/en/stable/)\
[json](https://docs.python.org/3/library/json.html)\
[bz2](https://docs.python.org/3/library/bz2.html)\
[pandas](https://pandas.pydata.org/)

# Required data
[wikidata dump](https://dumps.wikimedia.org/wikidatawiki/entities/)\
[Quotebank data_dump](https://zenodo.org/record/4277311#.YaKLoWDMJm8)\
[Partisan Audience Bias Scores](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QAN5VX)


4. Dumped the dataframe into 2 jsons. The first json contains the compact information for each political quotation that appeared in the Quotebank dataset, with possible multiple values of mentioned names(mentioned QIDs), domains of source and a total count of occurences of this quotation, which is more handy for analysis. The second json is a flattened version of the first file, in which a record with m mentions and n sources are exploded and flattened into m*n records, which is more handy for plotting.

In [1]:
import pandas as pd
import numpy as np
import bz2
import os
import  sys, os
sys.path.insert(0, os.path.join(os.getcwd(), 'src'))

## Step 1: Filtering and interpreting politician catalogue
In this part we go through full wikidata dump and filter out alive politician entities. We dumped interesting properties about the politician including country of citizenship, gender, religion, party with start and end date and positions held with start and end date.  If the politician is missing chosen properties (which are quite common), we regard this politician entity as corrupted and discard the record. For US politicians with public figures, we dump the compressed figures as well. After the data acqusition, we interpret the data by looking up the notations for all QIDs in offline catalogue (provided by ADA staffs). If the QID is missing in the offline catalogue, we make request to wikidata API to ask about the QID and extend the offline catalogue with the query result.

In [None]:
#import functions for dumping politicians from wikidata dump
from PoliticianFilter import dump_politicians,dump_figures
#import functions for QID interpretation, offline QID list extension
from QID_interpretation import interpret_qids, single_interpret, list_interpret, listlist_interpret

In [None]:
#Constants for step 1
#path to full wiki data dump (80G compressed file)
WIKI_DATA_FULL = '../../data/latest-all.json.bz2' 
#path to filtered wiki data dump with only alive politician entities
WIKI_DATA_FILTERED = '../../data/filtered_politician_v2.json.bz2' 
#path to downloaded politician images
WIKI_DATA_IMAGE = '../../data/img'
#path to local QID dictionary
Q_catalogue = '../../data/wikidata_labels_descriptions_quotebank.csv.bz2' 
#path to extended local QID dictionary
Q_catalogue_new = '../../data/wikidata_labels_descriptions_quotebank_expanded.csv.bz2'
#path to filtered wiki data dump with only alive politician entities and with interpreted QIDs
WIKI_DATA_FILTERED_LABELED = '../../data/filtered_politician_labeled_v3.json.bz2' 
#path to missing QIDs log (corrupted entities with invalid QIDs)
WIKI_DATA_FILTERED_MISSINGQ = '../../data/filtered_politician_missingqids_v3.json.bz2'  

In [18]:
'''
dump_politicians
function dump_politicians takes as input full wikidata json dump from WIKI_DATA_FULL
and dumps politician entity to the WIKI_DATA_FILTERED
if verbose = True, print all debug information when a corrupted entity is detected (qid, missing value)
'''
dump_politicians(WikidataJsonDump(WIKI_DATA_FULL),WIKI_DATA_FILTERED,verbose = False)
politician_catalogue = pd.read_json(WIKI_DATA_FILTERED, lines=True, compression='bz2')
print(f'Filtered {len(politician_catalogue)} politician records in total')

Filtered 269337 politician records in total


In [4]:
# read the dumped politician catalogue in last cell and filter US politician with country of citizen ship is USA
US_politician_catalogue = politician_catalogue[politician_catalogue['nationality']=='Q30']
# get the qid list of US politician to request images
US_qid = US_politician_catalogue['qid'].tolist()
'''
dump_figures
function dump_figures takes as input the qid list of US politicians
make HTTP request to wikidata API and download first image of the politician into folder WIKI_DATA_IMAGE
the images are scaled within 300x400 pixels to reduce the disk consumption
'''
dump_figures(US_qid,WIKI_DATA_IMAGE,verbose=False)

In [7]:
'''
interpret_qids
The function interpret_qids takes as input the raw filtered politician entities "WIKI_DATA_FILTERED"
and looked up the QID meanings in offline QID dictionary "Q_catalogue"
If the look-up fails, the function makes online query to the wikidata API
The query result is used to extend the offline QID dictionary "Q_catalogue_expanded"
If the query fails again, the politician identity is considered corrupted and removed. The QIDs are saved into a log file "WIKI_DATA_FILTERED_MISSINGQ".
If verbose is True, print all unknown QIDs in off-line dictionary and print their online interpretations at runtime.
'''
Q_catalogue_expanded = interpret_qids(WIKI_DATA_FILTERED, Q_catalogue, WIKI_DATA_FILTERED_LABELED, WIKI_DATA_FILTERED_MISSINGQ, single_interpret, list_interpret, listlist_interpret, verbose = False)
#update qid catalogue
with open(Q_catalogue_new, 'wb') as write_buf:
    Q_catalogue_expanded.to_csv(path_or_buf=write_buf, compression='bz2', encoding= 'utf-8')

## Part 2: Political quotes filtering
In this part
With the interpreted politician catalogue, go through all quotes data in Quotebank from 2015 to 2020 and filter out those quotations that were spoken by US politicians and contained other politicians (names/aliases). Dumped the quotes to files (one for mentioned US politician and the other for mentioned non-US politician)

In this part we go through all Quotebank data from 2015 to 2020 and fliter out all quotes spoken by US politicians.

In [1]:
from QuotesFilterSpeaker import quotes_filter_speaker

In [32]:
WIKI_DATA_FILTERED_LABELED = '../../data/filtered_politician_labeled_v3.json.bz2' 
#path to filtered wikidata with interpretations from Part 2
QUOTES_BY_US = './../data/quotes_by_USA.json.bz2' 
#path to output file of filtered quotes said by US politicians
quotebank_dir ='../../data/' #path to the folder where the quotebank compressed files are stored

In [20]:
#the function takes as input the directory storing all Quotebank data,
#the output path to store the filtered quotes 
#and the catalogue for politician entities
quotes_filter_speaker(quotebank_dir, QUOTES_BY_US,WIKI_DATA_FILTERED)

## Part 4: Quotes Filtering based on Mentions

In this part we go through filtered Quotebank data in last part and fliter out all quotes containing the names/aliases of other politicians. We append a column to the quotes to indicate who is mentioned in the quotes(names or qids). The source urls of each quote is simplified into subdomain.domain.top-level domain. We provide two kinds of dataframes as outputs. The first dataframe includes compact info of each quote such that for those with multiple mentions/sources we include all infos in a list. The second dataframe is a flatten version of first dataframe, where each record is expanded to ${\prod_{i=1}^{n} E(i)}$ records, where ${E(i)}$ is the number of elements in list and ${i}$ the property with multiple mentions/sources.

Similar to last task, we do this in US and world version. The US version contains all quotes that are said by US politician and include other US politician(s), which is then used for our US political network analysis. The world version contains all quotes said by world politician which is then used for a world-wide analysis. In the world version, we only consider outcoming quotes from US to other contries and don't consider the incoming quotes. This is due to the nature that the source of quotebank data is biased towards US websites and we will be in dilemma of lacking data if we want to analyze the incoming quotes.

In [2]:
WIKI_DATA_FILTERED_LABELED = '../../data/filtered_politician_labeled_v3.json.bz2' #path to filtered wikidata with interpretations from Part 2
QUOTES_BY_US = '../../data/quotes_by_USA.json.bz2'  #take as input output of Part 3 all quotes said by US politicians
QUOTES_MENTIONS_US = '../../data/quotes_mentions_USA.json.bz2' #output path for US mentions in quotes
QUOTES_MENTIONS_WORLD = '../../data/quotes_mentions_world.json.bz2' #output path for world mentions in quotes
QUOTES_MENTIONS_US_PARSED_COMPACT = '../../data/quotes_mentions_USA_compact.json.bz2'
QUOTES_MENTIONS_US_PARSED_FLAT = '../../data/quotes_mentions_USA_flattened.json.bz2'
QUOTES_MENTIONS_WORLD_PARSED_COMPACT = '../../data/quotes_mentions_world_compact.json.bz2'
QUOTES_MENTIONS_WORLD_PARSED_FLAT = '../../data/quotes_mentions_world_flattened.json.bz2'

In [6]:
from QuotesFilterMentions import generate_patterns, filter_mentions, parse_mentions

In [7]:
pattern_US, pattern_US_flat = generate_patterns(WIKI_DATA_FILTERED_LABELED,US=True)
pattern_world, pattern_world_flat = generate_patterns(WIKI_DATA_FILTERED_LABELED,US=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  names_us['aliases_all'] = [a+[b] for a,b in zip(names_us['aliases'],names_us['name'])]
  country_list = json_normalize(res["results"]["bindings"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  names_us['aliases_all'] = [a+[b] for a,b in zip(names_us['aliases'],names_us['name'])]


In [12]:
filter_mentions(QUOTES_BY_US,pattern_US,QUOTES_MENTIONS_US,verbose=True)

Dumped 406 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 244.14]
Dumped 372 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 252.31]
Dumped 402 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 255.54]
Dumped 422 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 259.24]
Dumped 397 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 253.34]
Dumped 419 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 254.87]
Dumped 415 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 256.40]
Dumped 424 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 249.56]
Dumped 418 political quotations containi

Dumped 420 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 255.52]
Dumped 443 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 260.34]
Dumped 401 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 254.47]
Dumped 417 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.93]
Dumped 376 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 257.38]
Dumped 397 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 252.39]
Dumped 418 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 257.93]
Dumped 411 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.30]
Dumped 405 political quotations containi

Dumped 829 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 186.41]
Dumped 886 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 184.96]
Dumped 845 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 187.27]
Dumped 808 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 187.08]
Dumped 857 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 187.54]
Dumped 834 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 187.07]
Dumped 888 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 187.99]
Dumped 824 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 185.84]
Dumped 811 political quotations containi

Dumped 783 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.17]
Dumped 825 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 230.54]
Dumped 671 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 255.24]
Dumped 638 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 243.16]
Dumped 623 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 131.96]
Dumped 661 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 138.48]
Dumped 650 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 168.70]
Dumped 603 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 167.70]
Dumped 650 political quotations containi

Dumped 666 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 204.29]
Dumped 619 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 199.28]
Dumped 634 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 209.04]
Dumped 685 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 182.15]
Dumped 663 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 176.31]
Dumped 658 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 175.54]
Dumped 659 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 207.01]
Dumped 668 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 151.53]
Dumped 631 political quotations containi

Dumped 643 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 190.19]
Dumped 623 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 218.32]
Dumped 682 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 156.04]
Dumped 618 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 164.12]
Dumped 693 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 198.46]
Dumped 675 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 165.67]
Dumped 645 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 199.12]
Dumped 637 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 201.88]
Dumped 716 political quotations containi

Dumped 657 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 251.33]
Dumped 659 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 247.56]
Dumped 636 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 259.24]
Dumped 600 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 256.30]
Dumped 614 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 257.76]
Dumped 663 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 255.37]
Dumped 603 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 259.83]
Dumped 623 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 256.28]
Dumped 654 political quotations containi

Dumped 607 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 251.40]
Dumped 658 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.95]
Dumped 664 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 263.65]
Dumped 611 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 260.00]
Dumped 643 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 257.18]
Dumped 635 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 255.49]
Dumped 610 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.06]
Dumped 634 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 257.64]
Dumped 620 political quotations containi

Dumped 650 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.04]
Dumped 613 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 260.44]
Dumped 679 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 262.44]
Dumped 678 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 263.42]
Dumped 659 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 265.09]
Dumped 643 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 257.95]
Dumped 648 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.75]
Dumped 650 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 264.62]
Dumped 705 political quotations containi

Dumped 692 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.34]
Dumped 671 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 257.63]
Dumped 679 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 260.80]
Dumped 646 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 260.06]
Dumped 647 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 259.79]
Dumped 624 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 258.90]
Dumped 645 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 258.69]
Dumped 686 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 261.02]
Dumped 626 political quotations containi

Dumped 656 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 265.87]
Dumped 671 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 263.61]
Dumped 694 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 265.73]
Dumped 695 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 267.83]
Dumped 676 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 267.56]
Dumped 650 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 267.67]
Dumped 649 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 259.79]
Dumped 703 political quotations containing mentions of other politicians out of 10000 quotations [quotations/s: 267.68]
Dumped 696 political quotations containi

In [107]:
# try:
#     pattern_list = list(filter(('An').__ne__, pattern_list))
#     pattern_list = list(filter(('m').__ne__, pattern_list))
#     pattern_list = list(filter(('1').__ne__, pattern_list))
#     pattern_list = list(filter(('Mi').__ne__, pattern_list))
#     pattern_list = list(filter(('Ye').__ne__, pattern_list))
#     pattern_list = list(filter(('Hou').__ne__, pattern_list))
#     pattern_list = list(filter(('De').__ne__, pattern_list))
#     pattern_list = list(filter(('Co').__ne__, pattern_list))
#     pattern_list = list(filter(('las').__ne__, pattern_list))
#     pattern_list = list(filter(('Yo').__ne__, pattern_list))
#     pattern_list = list(filter(('Cali').__ne__, pattern_list))
#     pattern_list = list(filter(('professor').__ne__, pattern_list))
#     pattern_list = list(filter(('saf').__ne__, pattern_list))
    
    
# except:
#     pass
# pattern = '|'.join(pattern_list)
# # pattern += '|(?<![\w\d])' + '(?![\w\d])|(?<![\w\d])'.join(pattern_list1) + '(?![\w\d])'
# pattern = pattern.replace("(", "").replace(")", "").replace("?", "")

In [4]:
filter_mentions(QUOTES_BY_US,pattern_world,QUOTES_MENTIONS_WORLD,verbose=False)

In [4]:
parse_mentions(QUOTES_MENTIONS_US, pattern_US, pattern_US_flat, QUOTES_MENTIONS_US_PARSED_COMPACT, QUOTES_MENTIONS_US_PARSED_FLAT, verbose = False)

Mapped 10000 US political mentionings to the mentioned politicians.  [quotations/s: 145.73]
Mapped 20000 US political mentionings to the mentioned politicians.  [quotations/s: 153.60]
Mapped 30000 US political mentionings to the mentioned politicians.  [quotations/s: 152.97]
Mapped 40000 US political mentionings to the mentioned politicians.  [quotations/s: 154.00]
Mapped 50000 US political mentionings to the mentioned politicians.  [quotations/s: 155.64]
Mapped 60000 US political mentionings to the mentioned politicians.  [quotations/s: 170.17]
Mapped 70000 US political mentionings to the mentioned politicians.  [quotations/s: 170.66]
Mapped 80000 US political mentionings to the mentioned politicians.  [quotations/s: 170.54]
Mapped 90000 US political mentionings to the mentioned politicians.  [quotations/s: 168.92]
Mapped 100000 US political mentionings to the mentioned politicians.  [quotations/s: 172.46]
Mapped 110000 US political mentionings to the mentioned politicians.  [quotatio

In [5]:
parse_mentions(QUOTES_MENTIONS_WORLD, pattern_world, pattern_world_flat, QUOTES_MENTIONS_WORLD_PARSED_COMPACT, QUOTES_MENTIONS_WORLD_PARSED_FLAT, verbose = True, US=False)

Mapped 10000 US political mentionings to the mentioned politicians.  [quotations/s: 1826.31]
Mapped 20000 US political mentionings to the mentioned politicians.  [quotations/s: 1753.82]
Mapped 30000 US political mentionings to the mentioned politicians.  [quotations/s: 4180.98]


## Part 5: Quote Bias Calibration

In [1]:
import pandas as pd

In [4]:
df = pd.read_json(QUOTES_MENTIONS_US_PARSED_COMPACT,lines=True,compression='bz2')

In [10]:
df = pd.read_json(WIKI_DATA_FILTERED_LABELED,lines=True,compression='bz2')

In [12]:
df['aliases_all'] = [a+[b] for a,b in zip(df['aliases'],df['name'])]

In [13]:
df[:10]

Unnamed: 0,qid,name,gender,nationality,aliases,parties,positions held,religion,us_congress_id,candidacy_election,aliases_all
0,Q207,George W. Bush,male,Q30,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[Republican Party],"[[Governor of Texas, [+1995-01-17T00:00:00Z]],...",Q329646,,"[2000 United States presidential election, 200...","[George Walker Bush, Bush Jr., Dubya, GWB, Bus..."
1,Q946,Donald Tusk,male,Q36,[Donald Franciszek Tusk],"[Civic Platform, European People's Party]","[[Prime Minister of Poland, [+2007-11-16T00:00...",Q9592,,[2005 Polish presidential election],"[Donald Franciszek Tusk, Donald Tusk]"
2,Q1058,Narendra Modi,male,Q668,"[Modi, Narendra Bhai, Narendra Damodardas Modi...",[Bharatiya Janata Party],"[[Chief Minister of Gujarat, [+2001-10-07T00:0...",Q9089,,[2014 Indian general election in Vadodara Lok ...,"[Modi, Narendra Bhai, Narendra Damodardas Modi..."
3,Q1253,Ban Ki-moon,male,Q884,"[Ban Kimoon, Ban Ki Moon]",[independent politician],"[[United Nations Secretary-General, [+2007-01-...",Q9581,,[],"[Ban Kimoon, Ban Ki Moon, Ban Ki-moon]"
4,Q3996,V. P. Kalairajan,male,Q668,[],[All India Anna Dravida Munnetra Kazhagam],[[Member of the Tamil Nadu Legislative Assembl...,,,[],[V. P. Kalairajan]
5,Q4496,Mitt Romney,male,Q30,"[Willard Mitt Romney, Pierre Delecto]",[Republican Party],"[[Governor of Massachusetts, [+2003-01-02T00:0...",Q42504,R000615,"[2012 Republican Party presidential primaries,...","[Willard Mitt Romney, Pierre Delecto, Mitt Rom..."
6,Q5335,Harm Wiersma,male,Q29999,[],[Pim Fortuyn List],[[member of the House of Representatives of th...,,,[2002 Dutch general election],[Harm Wiersma]
7,Q11124,Stephen Breyer,male,Q30,[Stephen Gerald Breyer],[Democratic Party],[[Associate Justice of the Supreme Court of th...,Q9268,,[],"[Stephen Gerald Breyer, Stephen Breyer]"
8,Q11509,Xanana Gusmão,male,Q574,"[Kay Rala Xanana Gusmão, José Alexandre Gusmão]",[National Congress for Timorese Reconstruction...,"[[Prime Minister of East Timor, [+2007-08-08T0...",,,[2002 East Timorese presidential election],"[Kay Rala Xanana Gusmão, José Alexandre Gusmão..."
9,Q11674,David Paterson,male,Q30,[David Alexander Paterson],[Democratic Party],"[[Governor of New York, [+2008-03-17T00:00:00Z...",,,[],"[David Alexander Paterson, David Paterson]"


In [5]:
df[:10]

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,phase,mentions,mentions_qids,urls
0,2015-10-25-000242,"' It is not now, nor has it ever been, the gol...",Bernie Sanders,Q359442,2015-10-25 14:12:35,1,E,[Bill Clinton],[Q1124],[examiner.com]
1,2015-08-07-005048,All we ╒ re asking for here is a discussion an...,John Boehner,Q11702,2015-08-07 12:52:52,1,E,[Barack Obama],[Q76],[liveblog.irishtimes.com]
2,2015-10-01-005722,An email included in the latest tranche of Cli...,Hillary Clinton,Q6294,2015-10-01 14:56:48,2,E,[Bill Clinton],[Q1124],"[feeds.foxnews.com, www.foxnews.com]"
3,2015-11-17-006368,"and in fact, Secretary of State Kerry was earl...",Phil Bryant,Q887898,2015-11-17 20:03:05,1,E,[John Kerry],[Q22316],[hottytoddy.com]
4,2015-02-14-014011,I have fought Obamacare from Day One and will ...,John Cornyn,Q719568,2015-02-14 21:01:51,2,E,[Barack Obama],[Q76],"[www.politico.com, politico.com]"
5,2015-09-18-007210,are just flat-out untrue. There is no video sh...,Carly Fiorina,Q256380,2015-09-18 20:55:00,2,E,[Carly Fiorina],[Q256380],"[feeds.people.com, www.people.com]"
6,2015-01-13-006143,as would Jeb Bush.,Susan Collins,Q22279,2015-01-13 10:49:04,1,E,[George W. Bush],[Q207],[nationaljournal.com]
7,2015-09-20-019882,I never called Jeb Bush and I never asked him ...,Donald Trump,Q22686,2015-09-20 14:01:28,2,E,[George W. Bush],[Q207],"[tpmdc.talkingpointsmemo.com, talkingpointsmem..."
8,2015-10-07-012099,Big Bush Park will undergo significant renovat...,Melinda Katz,Q13562433,2015-10-07 14:50:46,1,E,[George W. Bush],[Q207],[www.qgazette.com]
9,2015-10-21-045916,"I think Paul Ryan would make a great speaker,",John Boehner,Q11702,2015-10-21 14:57:37,6,E,[Paul Ryan],[Q203966],"[news.yahoo.com, dailycaller.com, www.post-gaz..."


In [9]:
pattern_US

'George Walker Bush|Bush Jr.|Dubya|Bush 43|President George W. Bush|George Bush|President Bush|Bush|Bush, George W.|George W. Bush|Willard Mitt Romney|Pierre Delecto|Mitt Romney|Stephen Gerald Breyer|Stephen Breyer|David Alexander Paterson|David Paterson|James Warren "Jim" DeMint|James Warren DeMint|Jim DeMint|Olympia J. Snowe|Olympia Bouchles|Olympia Jean Snowe|Olympia Snowe|James Earl Carter Jr.|James E. Carter|James Carter|James Earl Carter|39th President of the United States|James E. Carter Jr.|Jimmy Carter|Todd Christopher Young|Senator Todd Young|Senator Young|Sen. Todd Young|Sen. Young|Todd Young|Michael Gerard Grimm|Michael Grimm|Naphtali Bennett|Naftali Bennett|Clement Leroy Otter|Clement Otter|Clement Leroy "Butch" Otter|Governor Butch Otter|Butch Otter|Tammy Suzanne Green Baldwin|Tammy Baldwin|James Danforth Quayle|James Danforth "Dan" Quayle|J. Danforth Quayle|Dan Quayle|Susan Elizabeth Rice|Susan Rice|Elvin Nimrod|John Richard Kasich|John R. Kasich|John Richard Kasich Jr.|