# Setup

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import sys, os
sys.path.append(os.path.abspath('../'))
del sys, os

In [3]:
from database_creation.database import Database
from database_creation.article import Article
from database_creation.sentence import Sentence
from database_creation.coreference import Coreference
from database_creation.np import Np
from database_creation.word import Word
from database_creation.utils import BaseClass

# Processing the database

## Preprocessing

### Re-initializing the display parameters

In [4]:
Database.set_parameters(to_print=[], print_attribute=True, random_print=False, limit_print=20)
Article.set_parameters(to_print=[], print_attribute=True)
Coreference.set_parameters(to_print=[], print_attribute=True)
Sentence.set_parameters(to_print=[], print_attribute=True)
Np.set_parameters(to_print=[], print_attribute=True)
Word.set_parameters(to_print=[], print_attribute=True)

### Initializing the database

In [5]:
database = Database(max_size=10000, root='../databases/nyt_jingyun')

In [6]:
print(database)

max_size: 10000

root: ../databases/nyt_jingyun

year: 2000

size: 10000

articles: 

article 1165027: 
original_path: ../databases/nyt_jingyun/data/2000/01/01/1165027.xml
annotated_path: ../databases/nyt_jingyun/content_annotated/2000content_annotated/1165027.txt.xml

article 1165028: 
original_path: ../databases/nyt_jingyun/data/2000/01/01/1165028.xml
annotated_path: ../databases/nyt_jingyun/content_annotated/2000content_annotated/1165028.txt.xml

article 1165029: 
original_path: ../databases/nyt_jingyun/data/2000/01/01/1165029.xml
annotated_path: ../databases/nyt_jingyun/content_annotated/2000content_annotated/1165029.txt.xml

article 1165030: 
original_path: ../databases/nyt_jingyun/data/2000/01/01/1165030.xml
annotated_path: ../databases/nyt_jingyun/content_annotated/2000content_annotated/1165030.txt.xml

article 1165031: 
original_path: ../databases/nyt_jingyun/data/2000/01/01/1165031.xml
annotated_path: ../databases/nyt_jingyun/content_annotated/2000content_annotated/1165031.txt

In [7]:
Database.set_parameters(to_print=['articles'], print_attribute=False)

### Preprocessing the database

In [8]:
database.preprocess_tuples(limit=100, display=True)


Preprocessing the articles (most frequent tuples)...

Cleaning the database...
Initial size: 10000
Final size: 6514
Done (elapsed time: 0s).

File 1000/6514...
File 2000/6514...
File 3000/6514...
File 4000/6514...
File 5000/6514...
File 6000/6514...

Cleaning the database...
Initial size: 6514
Final size: 4353
Done (elapsed time: 0s).


Computing the most frequent entities...
File 1000/4353...
File 2000/4353...
File 3000/4353...
File 4000/4353...

Most frequent entity tuples:
George Bush | John Mccain (108)
Al Gore | Bill Bradley (84)
St Louis Rams | Tennessee Titans (53)
New York City | New York State (45)
Chechnya | Russia (44)
 (37)
George Bush | Steve Forbes (36)
Hillary Clinton | Rudolph Giuliani (33)
Israel | Syria (31)
Bill Bradley | John Mccain (31)
Al Gore | George Bush (30)
Al Gore | John Mccain (29)
Bill Bradley | George Bush (29)
America Online | Time Warner Inc (29)
John Mccain | Steve Forbes (27)
Al Gore | Bill Bradley | George Bush (27)
Al Gore | Bill Clinton (26)
Frank

In [9]:
print(database)


original_path: ../databases/nyt_jingyun/data/2000/01/01/1165082.xml
annotated_path: ../databases/nyt_jingyun/content_annotated/2000content_annotated/1165082.txt.xml
title: Gore's Latest Attack on Bradley Tells Only Part of Story
entities: United States | John Broder | Bill Bradley | Al Gore | Bill Clinton
entities_locations: United States
entities_persons: Al Gore | Bill Bradley | Bill Clinton | John Broder
sentences: 
    parse: (ROOT (S (NP (NNP Vice) (NNP President) (NNP Gore)) (VP (VBZ has) (VP (VBN denounced) (NP (JJ former) (NNP Senator) (NNP Bill) (NNP Bradley)) (SBAR (IN for) (S (S (VP (VBG taking) (NP (NP (JJ large) (NNS amounts)) (PP (IN of) (NP (NN campaign) (NN cash)))) (PP (IN from) (NP (NN drug) (NNS companies))))) (CC and) (PRN (, ,) (S (NP (NNP Mr.) (NNP Gore)) (VP (VBZ says))) (, ,)) (S (VP (VBG doing) (NP (PRP$ their) (NN bidding)) (PP (IN during) (NP (NP (PRP$ his) (CD 18) (NNS years)) (PP (IN in) (NP (DT the) (NNP Senate))))))))))) (. .))) 
    text:  Vice Presiden

In [10]:
Article.set_parameters(to_print=['title', 'entities', 'coreferences', 'tuple_contexts'])
Coreference.set_parameters(to_print=['representative', 'entity'], print_attribute=True)
Sentence.set_parameters(to_print=['text'], print_attribute=False)
Np.set_parameters(to_print=['words'], print_attribute=False)
Word.set_parameters(to_print=['text'], print_attribute=False)

In [11]:
print(database)


title: Gore's Latest Attack on Bradley Tells Only Part of Story
entities: United States | John Broder | Bill Bradley | Al Gore | Bill Clinton
coreferences: 
    representative: Vice President Gore | 1 | 1 | 4
    entity: Al Gore 
    representative: former Senator Bill Bradley | 1 | 6 | 10
    entity: Bill Bradley 
    representative: drug companies | 1 | 18 | 20 
    representative: his 18 years in the Senate | 1 | 30 | 36 
    representative: the Senate | 1 | 34 | 36 
    representative: New Jersey | 2 | 34 | 36 
    representative: campaign literature | 2 | 8 | 10 
    representative: the vice president | 2 | 11 | 14 
    representative: the largest manufacturing-based industry in the state | 40 | 7 | 14 
    representative: Washington against predatory drug manufacturers | 2 | 52 | 57 
    representative: Clinton-Gore | 3 | 19 | 20 
    representative: the drug industry | 3 | 13 | 16 
    representative: the Clinton-Gore administration | 3 | 18 | 21 
    representative: pharmaceut

In [12]:
Article.set_parameters(to_print=['title', 'entities', 'tuple_contexts'])

## Processing

In [13]:
database.process_tuples()


Processing the articles (compute frequent entity tuples contexts)...

Cleaning the database...
Initial size: 665
Final size: 658
Done (elapsed time: 0s).

Done (elapsed time: 11s).



In [14]:
print(database)


title: Gore's Latest Attack on Bradley Tells Only Part of Story
entities: United States | John Broder | Bill Bradley | Al Gore | Bill Clinton
tuple_contexts: :
sentences: 
Al Gore Bill Bradley:
sentences: 1 | 2 | 4 | 10 | 14 | 16 | 18 | 22 | 24 | 32
sample_1:  Vice President Gore has denounced former Senator Bill Bradley for taking large amounts of campaign cash from drug companies and, Mr. Gore says, doing their bidding during his 18 years in the Senate.  In speeches, debates, advertisements and campaign literature, the vice president has accused Mr. Bradley of repeatedly siding with the pharmaceutical industry-- many of whose leading companies are based in New Jersey, his home state-- while Mr. Gore portrays himself as the consumer's advocate in Washington against predatory drug manufacturers.
sample_2:  Vice President Gore has denounced former Senator Bill Bradley for taking large amounts of campaign cash from drug companies and, Mr. Gore says, doing their bidding during his 18 yea

In [28]:
database.process_tuple(0)

Entity tuples: George Bush | John Mccain



0 samples out of 108 articles


In [29]:
database.process_tuple(1)

Entity tuples: Al Gore | Bill Bradley


sentences: 2 | 3
sample_2:  After several days of obfuscation, Al Gore has confessed, sort of.  When Bill Bradley cited Mr. Gore's record of supporting anti-choice legislation as a Congressman from Tennessee, Mr. Gore contended that he had always supported a woman's constitutional right to an abortion under Roe v. Wade.  Then, Mr. Bradley produced a 1987 letter to a constituent signed by Mr. Gore that said abortion was'' arguably the taking of a human life.''
sample_3:  When Bill Bradley cited Mr. Gore's record of supporting anti-choice legislation as a Congressman from Tennessee, Mr. Gore contended that he had always supported a woman's constitutional right to an abortion under Roe v. Wade.  Then, Mr. Bradley produced a 1987 letter to a constituent signed by Mr. Gore that said abortion was'' arguably the taking of a human life.''  Finally, the Vice President demurred, saying,'' I would not use that phrasing today.''


sentences: 4 | 19
sample_4:

In [30]:
database.process_tuple(2)

Entity tuples: St Louis Rams | Tennessee Titans


sentences: 2 | 19 | 22
sample_2:  If the four teams in Sunday's National Football League conference championship games are strangers to a broad viewership, blame it on the networks' lack of clairvoyance.  The St. Louis Rams, Tampa Bay Buccaneers, Jacksonville Jaguars and Tennessee Titans were orphans, given little scheduling respect.  The Jaguars were the only team with a winning record last season and apparently the only one deserving of regard in 1999.''
sample_19:  St. Louis averaged only 20 percent nationwide coverage.  The Rams went as high as 48 percent for an October game against the Titans and as low as 2 percent against the Eagles.  Dallas was Fox's most popular team, sending its games to 53 percent of the nation on average.
sample_22:  The four final teams are an interesting mixture of league lineage.  The Rams and Titans are modern-day carpetbaggers-LRB- moving from Anaheim and Houston, respectively-RRB- and the Buccaneers an

In [31]:
database.process_tuple(3)

Entity tuples: New York City | New York State


sentences: 1 | 3 | 4
sample_1:  Even as New York State imposes rigorous new academic standards, it is not providing New York City's public schools with enough money to ensure that students meet the higher standards, a new study has found.  The study, by the Council of the Great City Schools, comes as the state is defending itself against a lawsuit that accuses it of shortchanging city schools.
sample_3:  The study, by the Council of the Great City Schools, comes as the state is defending itself against a lawsuit that accuses it of shortchanging city schools.  The council, a coalition of urban school districts based in Washington, found that New York City has a disproportionate share of the state's poor and immigrant students, and that its achievement levels are far lower, on average, than those in the state's other school districts.  In New York City, the Board of Education spent$ 8,171 per pupil in 1997-98, the study found, compared with

In [33]:
database.process_tuple(4)

Entity tuples: Chechnya | Russia


sentences: 7 | 16
sample_7:  After bombings last fall in Moscow and other cities for which Russian authorities blamed Chechen separatists, Russian security forces set out to hunt down suspected terrorists, a reasonable response.  The Kremlin at one point said it would work with moderate Chechen leaders to strengthen the uneasy peace that followed the 1994-1996 conflict between the rebels and Russia, a peace that gave Chechnya near-autonomy.  But it is clear that preventing Chechen terrorism was never the Kremlin's primary purpose, and these moderate approaches were soon overtaken by a war strategy.
sample_16:  Mr. Putin, for his part, has exploited public support for the war to make himself the leading contender in presidential elections next month.  Few Russian politicians have summoned the courage to question the carnage in Chechnya and the damage it has done to democratic values in Russia.  Secretary of State Madeleine Albright was right to challen

In [34]:
database.process_tuple(5)

Entity tuples: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 


sentences: 



0 samples out of 37 articles


In [35]:
database.process_tuple(6)

Entity tuples: George Bush | Steve Forbes


sentences: 56 | 63 | 70
sample_56:  Stan Fitzgerald, 44, a systems analyst with the Principal Financial Group, said he planned to vote for Mr. Gore at a caucus, even though he thought the vice president might well lose to Mr. Bush in the general election.  On the face of things, Mr. Bush seems to have little to fear from Mr. Forbes here, and his main rival in New Hampshire, Senator John McCain of Arizona, is skipping Iowa-- a tactic that has often backfired in the past.  Mr. Forbes trails badly in all the polls.
sample_63:  If someone aborts a child, it's called a choice.''  Mr. Forbes is relying heavily on television; he outspent Mr. Bush last year by three to two But neither he nor the other Republican candidates-- Senator Orrin G. Hatch of Utah, Alan Keyes and Gary L. Bauer-- have been able to deny Mr. Bush the support of some of the major leaders of the state's powerful anti-abortion forces.  Ione Dilley and Paul Carbone, two influential 

In [36]:
database.process_tuple(7)

Entity tuples: Hillary Clinton | Rudolph Giuliani


sentences: 1 | 2 | 3 | 4 | 15 | 19 | 29 | 31 | 32 | 42 | 44
sample_1:  In a crash course in New York racial politics, Hillary Rodham Clinton observed Martin Luther King's Birthday in a flurry of celebrations from Brooklyn to Harlem yesterday, tying Dr. King's legacy to her own political ambitions while denouncing what she and some ministers asserted was the racial divide that had marked Rudolph W. Giuliani's years as mayor.  Mr. Giuliani, the Republican mayor and Mrs. Clinton's probable opponent in this year's race for United States Senate, kept a more limited schedule, skipping annual celebrations that he had attended in the past in favor of events organized by black supporters.
sample_2:  In a crash course in New York racial politics, Hillary Rodham Clinton observed Martin Luther King's Birthday in a flurry of celebrations from Brooklyn to Harlem yesterday, tying Dr. King's legacy to her own political ambitions while denouncing what