# Exercise Description

Take  a sample text (paragraph) from anywhere ,extract the following.
1. noun phrases (use noun_chunk extraction using textacy)
2. extract main verbs 
3. use dependency parsing to extract words with the dependency relation (*mod)
4. for all the words identified in last 3 steps, rank them based on multiple text ranking mechanisms

## Preparation

In [5]:
import textacy
import textacy.datasets
import spacy
from __future__ import unicode_literals
from collections import Iterator
from collections import Iterable


print 'textacy version:', textacy.__version__
print 'spacy version:', spacy.__version__


textacy version: 0.4.1
spacy version: 1.9.0


In [39]:
## self-defined function, iterator -> iterable
def iterator2list(iterator):
    list_iterable=[]
    while True:
      try:
        x = next(iterator)
        list_iterable.append(textacy.spacy_utils.normalized_str(x))
      except StopIteration:
        break
    print 'Length of iterables: ',len(list_iterable)
    print 'Iterables are: ', list_iterable
    return list_iterable

## Load text

In [36]:
nlp=spacy.load('en')

doc = nlp("Ancient Greek art saw the veneration of the \
animal form and the development of equivalent skills to show musculature, \
poise, beauty and anatomically correct proportions. Ancient Roman art depicted gods \
as idealized humans, shown with characteristic distinguishing features (e.g. Jupiter's thunderbolt). \
In Byzantine and Gothic art of the Middle Ages, the dominance of the church insisted \
on the expression of biblical and not material truths. Eastern art has generally worked in a style \
akin to Western medieval art, namely a concentration on surface patterning and local colour (meaning \
the plain colour of an object, such as basic red for a red robe, rather than the modulations of that \
colour brought about by light, shade and reflection). A characteristic of this style is that the local \
colour is often defined by an outline (a contemporary equivalent is the cartoon). This is evident in, \
for example, the art of India, Tibet and Japan. Religious Islamic art forbids iconography, and expresses \
religious ideas through geometry instead. The physical and rational certainties depicted by the 19th-century \
Enlightenment were shattered not only by new discoveries of relativity by Einstein and of unseen psychology \
by Freud, but also by unprecedented technological development. Paradoxically the expressions of new technologies \
were greatly influenced by the ancient tribal arts of Africa and Oceania, through the works of Paul Gauguin and \
the Post-Impressionists,Pablo Picasso and the Cubists, as well as the Futurists and others.")


## Information Extraction

### Find noun chunks

In [40]:
nouns=textacy.extract.noun_chunks(doc, drop_determiners=True, min_freq=1)

list_noun=iterator2list(nouns)

Length of iterables:  61
Iterables are:  [u'ancient greek art', u'veneration', u'animal form', u'development', u'equivalent skill', u'musculature , poise , beauty and anatomically correct proportion', u'art', u'idealized human', u'characteristic distinguishing feature', u'byzantine and gothic art', u'Middle Ages', u'dominance', u'church', u'expression', u'biblical and not material truth', u'art', u'style', u'western medieval art', u'surface patterning', u'local colour', u'plain colour', u'object', u'red robe', u'modulation', u'colour', u'light', u'shade', u'reflection', u'characteristic', u'style', u'local colour', u'outline', u'contemporary equivalent', u'cartoon', u'example', u'India', u'Tibet', u'Japan', u'religious islamic art', u'iconography', u'religious idea', u'geometry', u'physical and rational certainty', u'19th - century Enlightenment', u'new discovery', u'relativity', u'Einstein', u'unseen psychology', u'Freud', u'unprecedented technological development', u'expression', u'n

unicode

### Find main verbs

In [32]:
verbs=textacy.spacy_utils.get_main_verbs_of_sent(doc)
print verbs

[saw, show, depicted, shown, insisted, worked, meaning, brought, is, defined, is, is, forbids, expresses, depicted, shattered, influenced]


### Find subjects and objects of a verb

In [11]:
verb_obj=[]
verb_sub=[]

for verb in verbs:
    verb_obj.append(textacy.spacy_utils.get_objects_of_verb(verb))
    verb_sub.append(textacy.spacy_utils.get_subjects_of_verb(verb))

print verb_obj
print verb_sub

[[veneration], [proportions], [], [], [], [], [colour], [], [], [], [cartoon], [], [iconography], [ideas], [], [], []]
[[art], [], [art], [], [dominance, church], [Eastern, art], [], [modulations], [characteristic], [colour], [equivalent], [This], [art], [], [], [certainties], [expressions]]


### Extract NER

In [13]:
textacy.extract.named_entities(doc, include_types=None, exclude_types=None, drop_determiners=True, min_freq=1)

<generator object named_entities at 0x14e4154b0>

### Dependency parsing

In [14]:
for np in doc.noun_chunks:
    print(np.text, np.root.text, np.root.dep_, np.root.head.text)

(u'Ancient Greek art', u'art', u'nsubj', u'saw')
(u'the veneration', u'veneration', u'dobj', u'saw')
(u'the animal form', u'form', u'pobj', u'of')
(u'the development', u'development', u'conj', u'form')
(u'equivalent skills', u'skills', u'pobj', u'of')
(u'musculature, poise, beauty and anatomically correct proportions', u'proportions', u'dobj', u'show')
(u'art', u'art', u'nsubj', u'depicted')
(u'idealized humans', u'humans', u'pobj', u'as')
(u'characteristic distinguishing features', u'features', u'pobj', u'with')
(u'Byzantine and Gothic art', u'art', u'pobj', u'In')
(u'the Middle Ages', u'Ages', u'pobj', u'of')
(u'the dominance', u'dominance', u'nsubj', u'insisted')
(u'the church', u'church', u'nsubj', u'insisted')
(u'the expression', u'expression', u'pobj', u'on')
(u'biblical and not material truths', u'truths', u'pobj', u'of')
(u'art', u'art', u'nsubj', u'worked')
(u'a style', u'style', u'pobj', u'in')
(u'Western medieval art', u'art', u'pobj', u'to')
(u'surface patterning', u'patter

In [34]:
# words to semantic network
terms = list_noun + verbs
textacy.network.terms_to_semantic_network(list_noun, normalize='lemma', window_width=10, edge_weighting='cooc_freq')

<networkx.classes.graph.Graph at 0x1115a1450>

## Ranking

In [13]:
# extract key terms with binary ranking

keyterm=textacy.keyterms.key_terms_from_semantic_network(doc, normalize=u'lemma', window_width=2, 
                                                 edge_weighting=u'binary', ranking_algo=u'pagerank', 
                                                 join_key_words=False, n_keyterms=10)

print keyterm

[(u'art', 0.058296831753183724), (u'colour', 0.02445158388509562), (u'ancient', 0.01902886436016287), (u'equivalent', 0.016555927532284038), (u'new', 0.016473484332319183), (u'characteristic', 0.016360840169038025), (u'development', 0.016245045136444307), (u'expression', 0.016211642653313454), (u'religious', 0.016177026687414515), (u'style', 0.015231850751394854)]


### textrank

In [18]:
textacy.keyterms.textrank(doc, normalize=u'lemma', n_keyterms=20)

[(u'art', 0.058296831753183724),
 (u'colour', 0.02445158388509562),
 (u'ancient', 0.01902886436016287),
 (u'equivalent', 0.016555927532284038),
 (u'new', 0.016473484332319183),
 (u'characteristic', 0.016360840169038025),
 (u'development', 0.016245045136444307),
 (u'expression', 0.016211642653313454),
 (u'religious', 0.016177026687414515),
 (u'style', 0.015231850751394854),
 (u'red', 0.012967383068639017),
 (u'cubists', 0.012311564779367429),
 (u'local', 0.012041353203052158),
 (u'picasso', 0.011559073929023183),
 (u'pablo', 0.011128498094486215),
 (u'impressionists', 0.010873919562131388),
 (u'post', 0.01069949095203565),
 (u'gauguin', 0.010549575538327924),
 (u'paul', 0.01036546578578247),
 (u'unseen', 0.01027490013707197)]

### single rank

In [4]:
textacy.keyterms.singlerank(doc, normalize=u'lemma', n_keyterms=20)

NameError: name 'doc' is not defined

### sgrank

In [20]:
textacy.keyterms.sgrank(doc, ngrams=(1, 2, 3, 4, 5, 6), normalize=u'lemma', window_width=1500, n_keyterms=20, idf=None)

[(u'ancient tribal art', 0.12700538499040714),
 (u'western medieval art', 0.11797023598028958),
 (u'religious islamic art', 0.1158523332318917),
 (u'ancient greek art', 0.10322531320900902),
 (u'ancient roman art', 0.10105708602201213),
 (u'local colour', 0.0265430195514136),
 (u'red robe', 0.0147519202596656),
 (u'plain colour', 0.014678230577967506),
 (u'basic red', 0.014127170757887447),
 (u'development', 0.013569826580984523),
 (u'middle ages', 0.012932281003413731),
 (u'e.g. jupiter', 0.012528473845217079),
 (u'style akin', 0.012519476891149916),
 (u'eastern art', 0.01251777766313121),
 (u'gothic art', 0.012272311479141271),
 (u'equivalent skill', 0.0122039410232492),
 (u'surface patterning', 0.011914668218461025),
 (u'distinguishing feature', 0.011783632189874396),
 (u'material truth', 0.011775677088686886),
 (u'correct proportion', 0.01152077873580935)]