# Setup

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
from database_creation.database import Database

# Annotation task pipeline

## Creation of the queries (create_queries)

### Parameters

In [3]:
max_size = 10000
shuffle = True
min_articles = 1
min_queries = 1
random_seed = 0

database = Database(max_size=max_size, shuffle=shuffle,
                    min_articles=min_articles, min_queries=min_queries,
                    random_seed=random_seed)

### Preprocessing the database

In [4]:
database.preprocess_database(debug=True)

Preprocessing the database...
Computing the database' article...
Initial length of articles: 0
Debugging Written in results/2006-07/debug/articles.txt...
Final length of articles: 10000
Done (elapsed time: 1s).

Cleaning the database's articles...
Initial length of articles: 10000
Criterion: Check if an article's content file exists.
Final length of articles: 4031
Done (elapsed time: 14s).

Computing the articles' metadata...
   article 500/4031...
   article 1000/4031...
   article 1500/4031...
   article 2000/4031...
   article 2500/4031...
   article 3000/4031...
   article 3500/4031...
   article 4000/4031...
Debugging Written in results/2006-07/debug/metadata.txt...
Done (elapsed time: 10s).

Computing the database' entities...
Initial length of entities: 0
      European Union corresponds to both location and org, ignoring the later...
   article 500/4031...
      Several entities have the same name (New York City (location); Anna Sui (NYC Store) (org); Kuczynski, Alex (person); 

### Preprocessing the articles

In [5]:
database.process_articles(debug=True)

Preprocessing the articles...
Computing the articles' annotations...
   article 500/2745...
   article 1000/2745...
   article 1500/2745...
   article 2000/2745...
   article 2500/2745...
Debugging Written in results/2006-07/debug/annotations.txt...
Done (elapsed time: 297s).

Computing the contexts...
   tuple 1000/34053...
   tuple 2000/34053...
   tuple 3000/34053...
   tuple 4000/34053...
   tuple 5000/34053...
   tuple 6000/34053...
   tuple 7000/34053...
   tuple 8000/34053...
   tuple 9000/34053...
   tuple 10000/34053...
   tuple 11000/34053...
   tuple 12000/34053...
   tuple 13000/34053...
   tuple 14000/34053...
   tuple 15000/34053...
   tuple 16000/34053...
   tuple 17000/34053...
   tuple 18000/34053...
   tuple 19000/34053...
   tuple 20000/34053...
   tuple 21000/34053...
   tuple 22000/34053...
   tuple 23000/34053...
   tuple 24000/34053...
   tuple 25000/34053...
   tuple 26000/34053...
   tuple 27000/34053...
   tuple 28000/34053...
   tuple 29000/34053...
   tuple 

### Processing the wikipedia information

In [6]:
database.process_wikipedia(load=True, debug=True)

Processing the wikipedia information...
Attribute wikipedia loaded from results/2006-07/wikipedia/wikipedia_size10k_shuffle_articles1_queries1_seed0.pkl.
Computing the Wikipedia information...
Initial found entries: 1245/not found: 331
   entity 100/1570...
   entity 200/1570...
   entity 300/1570...
   entity 400/1570...
   entity 500/1570...
   entity 600/1570...
   entity 700/1570...
   entity 800/1570...
   entity 900/1570...
   entity 1000/1570...
   entity 1100/1570...
   entity 1200/1570...
   entity 1300/1570...
   entity 1400/1570...
   entity 1500/1570...
Final found entries: 1245/not found: 331
Debugging Written in results/2006-07/debug/wikipedia.txt...
Done (elapsed time: 0s).

Attribute wikipedia saved at results/2006-07/wikipedia/wikipedia_size10k_shuffle_articles1_queries1_seed0.pkl.
Done (elapsed time: 0s).



### Processing the queries

In [7]:
database.process_queries(load=False, check_changes=False, debug=True)

Processing the aggregation queries...
Computing the Queries...
Initial length of queries: 0
   tuple 1000/1955...
Debugging Written in results/2006-07/debug/queries.txt...
Final length of queries: 6228
Done (elapsed time: 3s).

Attribute queries saved at results/2006-07/queries/queries_size10k_shuffle_articles1_queries1_seed0.pkl.
Attribute queries saved at results/2006-07/queries/queries_short_size10k_shuffle_articles1_queries1_seed0.csv
Done (elapsed time: 3s).



## Gather the wikipedia files together

In [8]:
database.combine_pkl()

Current wikipedia information: 1245 found/331 not_found...
Global file updated: 1245 found/331 not_found.

Object loaded from results/2006-07/wikipedia/wikipedia_global.pkl
File wikipedia_global: 1245 found/331 not_found...
Global file updated: 1245 found/331 not_found.

Object saved at results/2006-07/wikipedia/wikipedia_global.pkl.


# Stuff

In [9]:
database.process_queries(load=True, check_changes=True, debug=True)

Processing the aggregation queries...
Attribute queries loaded from results/2006-07/queries/queries_size10k_shuffle_articles1_queries1_seed0.pkl.
Done (elapsed time: 0s).



In [13]:
for query_id, query in database.queries.items():
    #if "Russia" in query.entities_names and "Europe" in query.entities_names:
    #    print(query.entities_names)
    if query.entities_names == 'Europe, Iran and Russia':
        print(query_id, query.context)

1762937_13255_2_4 [...] The diplomats said the administration was also resisting the idea of protecting European companies from punishment by the United States for violating its sanctions if they did business with <div class="popup" onclick="pop(0)"><color1>Iran</color1><span class="popuptext" id="0">Iran</span></div>, as called for in the European proposal. The disagreements on these issues are clouding the possibility of a deal with Iran on its nuclear program, even as tensions have increased over Tehran 's refusal to change its behavior, the diplomats said. In addition, they said, <div class="popup" onclick="pop(1)"><color0>Europe</color0><span class="popuptext" id="1">Europe</span></div>, the United States and <div class="popup" onclick="pop(4)"><color2>Russia</color2><span class="popuptext" id="4">Russia</span></div> have not agreed on the need to impose sanctions on <div class="popup" onclick="pop(2)"><color1>Iran</color1><span class="popuptext" id="2">Iran</span></div> if <div c

In [14]:
ids = ['1762937_13255_4_4']
for id_ in ids:
    for title, item in database.queries[id_].to_dict().items():
        print(title, ': ', item)

id_ :  1762937_13255_4_4
entities :  <th><a href="https://en.wikipedia.org/wiki/Europe" target="_blank">Europe</a></th><th><a href="https://en.wikipedia.org/wiki/Iran" target="_blank">Iran</a></th><th><a href="https://en.wikipedia.org/wiki/Russia" target="_blank">Russia</a></th>
entities_names :  Europe, Iran and Russia
info :  <td>Europe is a continent located entirely in the Northern Hemisphere and mostly in the Eastern Hemisphere. It is bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west, Asia to the east, and the Mediterranean Sea to the south. It comprises the westernmost part of Eurasia.</td><td>Iran, also called Persia, and officially the Islamic Republic of Iran, is a country in Western Asia. With 82 million inhabitants, Iran is the world's 18th most populous country. Its territory spans 1,648,195 km2 , making it the second largest country in the Middle East and the 17th largest in the world. Iran is bordered to the northwest by Armenia and the Republic of

In [6]:
for tuple_ in database.tuples:
    print(str(tuple_))

Baghdad and Iraq
New York City and New York State
Iran and Iraq
Israel and Lebanon
Gaza Strip and Israel
Connecticut and New Jersey
Iraq and United States
China and Russia
Louisiana and New Orleans
Security Council and United Nations
New Jersey and New York City
Iran and Russia
I. Lewis Libby Jr. and Valerie Plame Wilson
Darfur and Sudan
Afghanistan and Iraq
Lebanon and Syria
Joseph I. Lieberman and Ned Lamont
Pittsburgh Steelers and Seattle Seahawks
Israel and West Bank
George W. Bush and Nuri Kamal Al- Maliki
France and Germany
Chicago Bears and Indianapolis Colts
Al Qaeda and Taliban
Mogadishu and Somalia
Mississippi and New Orleans
Louisiana and Mississippi
Iran and Middle East
Iran and Israel
George W. Bush and Saddam Hussein
George W. Bush and Henry M. Paulson Jr.
George E. Pataki and Michael R. Bloomberg
Fatah and Hamas
Egypt and Israel
China and North Korea
Madrid and Spain
Louisiana, Mississippi and New Orleans
Iraq and Jordan
George W. Bush and Valerie Plame Wilson
Europe and

In [10]:
for query_id, query in database.queries.items():
    #if "Moscow" in query.entities_names:
    #    print(query.entities_names)
    if query.entities_names == 'New York City and New York State':
        print(query_id, query.context)

1801660_2_20_20 [...] For <div class="popup" onclick="pop(1)"><color1>New York State</color1><span class="popuptext" id="1">New York State</span></div> to succeed and become vibrant again, education and economic development have to become top priorities,'' said Abraham M. Lackman, who has been both a top budget aide to the Republican-controlled State Senate and a <div class="popup" onclick="pop(0)"><color0>New York City</color0><span class="popuptext" id="0">New York City</span></div> budget director.'' [...]
1815087_2_16_16 [...] <div class="popup" onclick="pop(0)"><color0>New York City</color0><span class="popuptext" id="0">New York City</span></div>, meanwhile, will not be directly affected by the proposed rule change, because <div class="popup" onclick="pop(1)"><color0>it</color0><span class="popuptext" id="1">New York City</span></div> does not use the same method of tracking <div class="popup" onclick="pop(2)"><color0>its</color0><span class="popuptext" id="2">New York City</span

In [15]:
ids = ['1801660_2_20_20']
for id_ in ids:
    for title, item in database.queries[id_].to_dict().items():
        print(title, ': ', item)

id_ :  1801660_2_20_20
entities :  <th><a href="https://en.wikipedia.org/wiki/New_York_City" target="_blank">New York City</a></th><th><a href="https://en.wikipedia.org/wiki/New_York_State_Assembly" target="_blank">New York State</a></th>
entities_names :  New York City and New York State
info :  <td>The City of New York, usually called either New York City or simply New York , is the most populous city in the United States. With an estimated 2018 population of 8,398,748 distributed over a land area of about 302.6 square miles , New York is also the most densely populated major city in the United States.</td><td>The New York State Assembly is the lower house of the New York State Legislature, the New York State Senate being the upper house. There are 150 seats in the Assembly. Assemblymembers serve two-year terms without term limits.The Assembly convenes at the State Capitol in Albany. As of January 2019, 106 of the 150 seats in the Assembly were held by Democrats.</td>
title :  <th co