# Setup

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import sys, os
sys.path.append(os.path.abspath('../'))
del sys, os

In [3]:
import time as time

In [4]:
import database_creation.nyt as nyt

# Extracting the data

## Parameters

In [5]:
file_limit = 10000
n_entities = 2
filter_entity_types = ['locations']
compute_entity_types = ['locations']
display_count = 3333

### Initialize and apply entities' filter to the articles

In [6]:
t0 = time.time()
file_paths = nyt.get_file_paths(file_limit)
articles = nyt.initialize_articles(file_paths, display_count)
articles = nyt.filter_articles(articles, filter_entity_types, n_entities, display_count)
print("Elapsed time: {}s".format(round(time.time()-t0)+1))

Computing paths of 10000 files

Initializing the articles...
File 3333/10000...
File 6666/10000...
File 9999/10000...
Articles initialized (10000 articles).

Filtering articles (2 locations)...
Article 3333/10000...
Article 6666/10000...
Article 9999/10000...
9156 articles filtered out.
Articles filtered (844 articles).

Elapsed time: 4s


### Compute the metadata of the articles.

In [7]:
t0 = time.time()
articles = nyt.compute_entities(articles, compute_entity_types, display_count)
articles = nyt.compute_text_type(articles, 'content_annotated', display_count)
print("Elapsed time: {}s".format(round(time.time()-t0)+1))

Computing entities(locations)...
Entities computed (844 articles).

Computing articles' data (content annotated)...
27 requested data don't exist.
Articles' data computed (817 articles).

Elapsed time: 3s


### Compute the NPs of the articles.

In [8]:
t0 = time.time()
articles = nyt.compute_nps(articles, 'content_annotated', display_count)
articles = nyt.compute_filtered_nps(articles, ['to_exclude', 'is_entity'], 'nps', 'nps', display_count)
print("Elapsed time: {}s".format(round(time.time()-t0)+1))

Computing NPs (from content annotated)...
NPs computed (817 articles).

Computing filtered NPs (filters: to_exclude, is_entity; from 'nps', to 'nps')...
Filtered NPs computed (817 articles treated: 817 filtered in, 0 filtered out).

Elapsed time: 59s


## Experiment 1

### Parameters

In [9]:
n_best = 100
similarity_threshold = 0.2
n_synsets = 2

### Experiment

In [10]:
t0 = time.time()
articles = nyt.compute_filtered_nps(articles, ['similar_wn_path_pluralnouns'], 'nps', 'similar_pluralnouns', similarity_threshold, n_synsets)
articles = nyt.compute_filtered_nps(articles, ['is_numeric_parse', 'similar_wn_path_pluralnouns'], 'nps',
                                'similar_numeric_pluralnouns', similarity_threshold, n_synsets)
articles = nyt.compute_most_similar_nps(articles, 2, n_best, 'similar_pluralnouns', 'best_similar_pluralnouns')
articles = nyt.compute_most_similar_nps(articles, 2, n_best, 'similar_numeric_pluralnouns',
                                    'best_similar_numeric_pluralnouns')
print("Elapsed time: {}s".format(round(time.time()-t0)+1))

Computing filtered NPs (filters: similar_wn_path_pluralnouns; from 'nps', to 'similar_pluralnouns')...
Filtered NPs computed (817 articles treated: 530 filtered in, 287 filtered out).

Computing filtered NPs (filters: is_numeric_parse, similar_wn_path_pluralnouns; from 'nps', to 'similar_numeric_pluralnouns')...
Filtered NPs computed (817 articles treated: 183 filtered in, 634 filtered out).

Computing 100 most similar NPs (from similar_pluralnouns, in best_similar_pluralnouns)...
Most similar NPs computed (threshold: 1.0):
530 articles treated: 63 filtered in, 467 filtered out).
2664 NPs treated: 118 filtered in, 2546 filtered out).

Computing 100 most similar NPs (from similar_numeric_pluralnouns, in best_similar_numeric_pluralnouns)...
Most similar NPs computed (threshold: 0.33):
183 articles treated: 99 filtered in, 84 filtered out).
298 NPs treated: 150 filtered in, 148 filtered out).

Elapsed time: 111s


In [11]:
articles = nyt.compute_score(articles, 'similar_pluralnouns', n_synsets)
articles = nyt.compute_score(articles, 'similar_numeric_pluralnouns', n_synsets)

In [12]:
nyt.display_articles(articles, ['entities', 'similar_pluralnouns'], limit=30)

Displaying the articles (keys: entities, similar_pluralnouns; limit: 30; random: False):


id: 1165043
entities: Great Britain; Massachusetts
similar_pluralnouns: 
scholars; NNS; 0.25; scholars; Great Britain; 0.21
Experts; NNS; 0.25; experts; Great Britain; 0.21
most experts; JJS NNS; 0.25; experts; Great Britain; 0.21
children; NNS; 0.2; children; Great Britain; 0.17
experts; NNS; 0.25; experts; Great Britain; 0.21
kids; NNS; 0.2; kids; Great Britain; 0.17
All children; DT NNS; 0.2; children; Great Britain; 0.17
talents; NNS; 0.2; talents; Great Britain; 0.17
all children; DT NNS; 0.2; children; Great Britain; 0.17
a few kids '; DT JJ NNS POS; 0.2; kids; Great Britain; 0.24

id: 1165103
entities: Pakistan; India; Afghanistan; Kandahar (Afghanistan); Afghanistan
similar_pluralnouns: 
garlands; NNS; 0.33; garlands; Kandahar (Afghanistan); 0.18
most countries; JJS NNS; 0.33; countries; Pakistan; 0.31

id: 1165105
entities: Seattle (Wash); New York City; Canada
similar_pluralnouns: 
seve

In [13]:
nyt.display_articles(articles, ['entities', 'best_similar_pluralnouns'], limit=30)

Displaying the articles (keys: entities, best_similar_pluralnouns; limit: 30; random: False):


id: 1165105
entities: Seattle (Wash); New York City; Canada
best_similar_pluralnouns: 
several cities; JJ NNS; 1.0; cities; New York City; 0.57

id: 1165413
entities: Delaware River; Delaware and Raritan Canal
best_similar_pluralnouns: 
the canals; DT NNS; 1.0; canals; Delaware and Raritan Canal; 0.6

id: 1165575
entities: New York City; Los Angeles (Calif)
best_similar_pluralnouns: 
other major cities; JJ JJ NNS; 1.0; cities; New York City; 0.54

id: 1165607
entities: Times Square and 42d Street (NYC); New York City
best_similar_pluralnouns: 
50 times; CD NNS; 1.0; times; Times Square and 42d Street (NYC); 0.54

id: 1166820
entities: New York City; New York City
best_similar_pluralnouns: 
the few other cities; DT JJ JJ NNS; 1.0; cities; New York City; 1.08

id: 1167149
entities: Columbus Circle (NYC); New York City
best_similar_pluralnouns: 
Traffic circles; NN NNS; 1.0; circles; Columbus C

In [14]:
nyt.display_articles(articles, ['entities', 'similar_numeric_pluralnouns'], limit=30)

Displaying the articles (keys: entities, similar_numeric_pluralnouns; limit: 30; random: False):


id: 1165136
entities: Europe; Far East, South and Southeast Asia and Pacific Areas
similar_numeric_pluralnouns: 
100 empty seats; CD JJ NNS; 0.33; seats; Far East, South and Southeast Asia and Pacific Areas; 0.39

id: 1165137
entities: Russia; Chechnya (Russia); Grozny (Chechnya)
similar_numeric_pluralnouns: 
a few hundred yards; DT JJ CD NNS; 0.2; yards; Chechnya (Russia); 0.21

id: 1165208
entities: Sao Paulo (Brazil); Brazil
similar_numeric_pluralnouns: 
30 world cities; CD NN NNS; 0.2; cities; Brazil; 0.15

id: 1165287
entities: Peru; Lake Titicaca; Andes Mountains; Puno (Peru)
similar_numeric_pluralnouns: 
74 indigenous groups; CD JJ NNS; 0.2; groups; Lake Titicaca; 0.11

id: 1165296
entities: New York City; Clinton Hill (NYC)
similar_numeric_pluralnouns: 
12 buildings; CD NNS; 0.33; buildings; Clinton Hill (NYC); 0.22
four landmarked buildings; CD JJ NNS; 0.33; buildings; Clinton Hi

In [15]:
nyt.display_articles(articles, ['entities', 'best_similar_numeric_pluralnouns'], limit=30)

Displaying the articles (keys: entities, best_similar_numeric_pluralnouns; limit: 30; random: False):


id: 1165136
entities: Europe; Far East, South and Southeast Asia and Pacific Areas
best_similar_numeric_pluralnouns: 
100 empty seats; CD JJ NNS; 0.33; seats; Far East, South and Southeast Asia and Pacific Areas; 0.39

id: 1165296
entities: New York City; Clinton Hill (NYC)
best_similar_numeric_pluralnouns: 
12 buildings; CD NNS; 0.33; buildings; Clinton Hill (NYC); 0.22
four landmarked buildings; CD JJ NNS; 0.33; buildings; Clinton Hill (NYC); 0.22

id: 1165363
entities: Union Square (NYC); New York City
best_similar_numeric_pluralnouns: 
three counts; CD NNS; 0.33; counts; Union Square (NYC); 0.21

id: 1165398
entities: Palm Beach (Fla); Bahama Islands
best_similar_numeric_pluralnouns: 
 more than 50 miles; QP JJR IN CD NNS; 0.33; miles; Palm Beach (Fla); 0.32

id: 1165434
entities: Shelter Island (NY); Sagaponack (NY)
best_similar_numeric_pluralnouns: 
3 main buildings; CD JJ NNS;

In [16]:
articles = nyt.compute_most_similar_nps(articles, 5, 20, 'best_similar_pluralnouns', 'best_score_similar_pluralnouns')
articles = nyt.compute_most_similar_nps(articles, 5, 20, 'best_similar_numeric_pluralnouns',
                                    'best_score_similar_numeric_pluralnouns')

Computing 20 most similar NPs (from best_similar_pluralnouns, in best_score_similar_pluralnouns)...
Most similar NPs computed (threshold: 0.78):
63 articles treated: 14 filtered in, 49 filtered out).
118 NPs treated: 21 filtered in, 97 filtered out).

Computing 20 most similar NPs (from best_similar_numeric_pluralnouns, in best_score_similar_numeric_pluralnouns)...
Most similar NPs computed (threshold: 0.56):
99 articles treated: 15 filtered in, 84 filtered out).
150 NPs treated: 21 filtered in, 129 filtered out).



In [18]:
nyt.display_articles(articles, ['entities', 'best_score_similar_pluralnouns'], limit=30)

Displaying the articles (keys: entities, best_score_similar_pluralnouns; limit: 30; random: False):


id: 1166820
entities: New York City; New York City
best_score_similar_pluralnouns: 
the few other cities; DT JJ JJ NNS; 1.0; cities; New York City; 1.08

id: 1167149
entities: Columbus Circle (NYC); New York City
best_score_similar_pluralnouns: 
American cities; JJ NNS; 1.0; cities; New York City; 0.8

id: 1167252
entities: New York City; Canal Street Park (Nyc)
best_score_similar_pluralnouns: 
city parks; NN NNS; 1.0; parks; Canal Street Park (Nyc); 1.17

id: 1168051
entities: Arizona; Grand Canyon-Parashant National Monument; California; Agua Fria (Arizona); Pinnacles National Monument (Calif)
best_score_similar_pluralnouns: 
national monuments; JJ NNS; 1.0; monuments; Grand Canyon-Parashant National Monument; 0.91

id: 1168569
entities: Suffolk County (NY); Nassau County (NY); Long Island (NY)
best_score_similar_pluralnouns: 
counties , schools , towns , taxpayers and utility custom

In [17]:
nyt.display_articles(articles, ['entities', 'best_score_similar_numeric_pluralnouns'], limit=30)

Displaying the articles (keys: entities, best_score_similar_numeric_pluralnouns; limit: 30; random: False):


id: 1168569
entities: Suffolk County (NY); Nassau County (NY); Long Island (NY)
best_score_similar_numeric_pluralnouns: 
the two counties; DT CD NNS; 1.0; counties; Suffolk County (NY); 0.72

id: 1170390
entities: United States; Florida
best_score_similar_numeric_pluralnouns: 
 some two dozen states; QP DT CD NN NNS; 1.0; states; United States; 0.75

id: 1170798
entities: New York State; Florida; New York State
best_score_similar_numeric_pluralnouns: 
35 other states; CD JJ NNS; 1.0; states; New York State; 0.78
FOURTEEN states; CD NNS; 1.0; states; New York State; 0.78
four states; CD NNS; 1.0; states; New York State; 0.78

id: 1171094
entities: Rome (Italy); New York City; Moscow (Russia)
best_score_similar_numeric_pluralnouns: 
nine major Italian cities; CD JJ JJ NNS; 1.0; cities; New York City; 0.7

id: 1171529
entities: New York City; New York State
best_score_similar_nume

## Experiment 2

### Parameters

In [13]:
similarities = [
    'wn_path', 'wn_lch', 'wn_wup', 'wn_res', 'wn_jcn', 'wn_lin',
    # 'we_spacy'
]
n_best = 100

### Experiment

In [15]:
t0 = time.time()
for similarity in similarities:
    f = 'similar_' + similarity + '_pluralnouns'

    articles = nyt.compute_filtered_nps(articles, [f], 'nps', f)
    articles = nyt.compute_most_similar_nps(articles, n_best, f, f)
print("Elapsed time: {}s".format(round(time.time()-t0)+1))

Computing filtered NPs (filters: similar_wn_path_pluralnouns; from 'nps', to 'similar_wn_path_pluralnouns')...
Filtered NPs computed (112 articles treated: 47 filtered in, 65 filtered out).

Computing 100 most similar NPs (from similar_wn_path_pluralnouns, in similar_wn_path_pluralnouns)...
Most similar NPs computed (threshold: 0.33):
47 articles treated: 47 filtered in, 0 filtered out).
146 NPs treated: 146 filtered in, 0 filtered out).

Computing filtered NPs (filters: similar_wn_lch_pluralnouns; from 'nps', to 'similar_wn_lch_pluralnouns')...
Filtered NPs computed (112 articles treated: 111 filtered in, 1 filtered out).

Computing 100 most similar NPs (from similar_wn_lch_pluralnouns, in similar_wn_lch_pluralnouns)...
Most similar NPs computed (threshold: 2.54):
111 articles treated: 47 filtered in, 64 filtered out).
5540 NPs treated: 146 filtered in, 5394 filtered out).

Computing filtered NPs (filters: similar_wn_wup_pluralnouns; from 'nps', to 'similar_wn_wup_pluralnouns')...


KeyboardInterrupt: 