In [1]:
%%html
<style> 
table {display: block;} 
td {
  font-size: 20px
}
.rendered_html { font-size: 20px; }
*{ line-height: 200%; }
</style>
<style type="text/css" media="print"> body { -webkit-print-color-adjust: exact; } </style>

# Natural Language Processing and the Web WS 2022/23 - Practice Class - <span style="color:blue">Tutorial 3</span>

We have seen in the previous practice classes how to access text data and tokenization issues. In this section, we will cover the following topics: 
## Contents
* <span style="color:blue"> Revision </span> - Lemmatization and POS tagging
* <span style="color:blue">Parsing and Chunking</span> text documents 
* Description of building small Ontology using <span style="color:blue">Hearst Pattern</span> (<span style="color:red"> Assignment can be done in group!</span>)

## Lemmatization
A lemma is the canonical, <span style="color:blue">uninflected</span> or <span style="color:blue">dictionary form</span> of a word. For example, the lemma of <span style="color:blue">small</span><span style="color:red">est</span> is <span style="color:blue">small,</span> and the lemma of <span style="color:blue">eat</span><span style="color:red">ing</span> is <span style="color:blue">eat.</span> In many languages, the lemma for nouns is the <span style="color:blue">nominative singula</span>r form, the lemma for adjectives is the <span style="color:blue">nominative singular</span></span> positive form, and the lemma for verbs is the <span style="color:blue">infinitive.</span> But given an inflected form, finding the lemma (a process called <span style="color:blue">lemmatization</span>) is not always as easy. Words often undergo regular spelling changes when inflected for example, in English, verbs and adjectives ending in -e often drop this letter when inflecting: <span style="color:blue">bake → baking</span>. Sometimes final consonants are doubled, as in (British) English <span style="color:blue">travel → travelling</span>. 

An accurate algorithm for lemmatization must be aware of these sorts of inflectional rules and know how to undo them to arrive at the <span style="color:blue">base form</span> of the word. It must also know about completely irregular cases, such as <span style="color:blue">go → went</span>, <span style="color:blue">mouse → mice</span>, and <span style="color:blue">good → better</span>. Lemmatization is a difficult task for computers, and requires some basic understanding of the grammatical context and properties of the word. For example, the lemma of <span style="color:blue">dove</span> depends on whether the word is being used as a noun (as in the <span style="color:blue">bird</span>) or a verb (as in the past tense of <span style="color:blue">dive</span>).
![dove.jpg](dove.jpg) ![dive.png](dive.png)

However, lemmatization is an important task because, as with part-of-speech tagging, many NLP applications rely on lemmatized text.

Examples of lemmatization:
>```rocks : rock
corpora : corpus
better : good```

## NLTK Lemmatizer

#### optional  --> coloring outputs
```
COLOR = {
    'blue': '\033[94m',
    'default': '\033[99m',
    'grey': '\033[90m',
    'yellow': '\033[93m',
    'black': '\033[90m',
    'cyan': '\033[96m',
    'green': '\033[92m',
    'magenta': '\033[95m',
    'white': '\033[97m',
    'red': '\033[91m'
}
```

In [2]:
HR='\033[91m' # hilight in red
HD ='\x1b[0m'# hilight in default

In [3]:
import nltk
# Lemmatize using WordNet's buil-in morphy function
# Returns the input unchanged if it cannot be found in WordNet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :"+HR, lemmatizer.lemmatize("rocks") + HD)
print("corpora :"+HR, lemmatizer.lemmatize("corpora") +HD)
#Give the POS tag as a context to the tager,  a denotes adjective in "pos" 
print("better :"+HR, lemmatizer.lemmatize("better", pos ="a") +HD)
print("drove of verb :"+HR, lemmatizer.lemmatize("drove", pos ="v") +HD)
print("drove as noun (bird): :"+HR, lemmatizer.lemmatize("drove", pos ="n") +HD)
#Lemmatizing sentence
sentence = "The striped bats are hanging on their feet for best"
word_list = nltk.word_tokenize(sentence)
print("words:",word_list)
# Lemmatize list of words and join
lemmatized_output = ',   '.join([lemmatizer.lemmatize(w) for w in word_list])
print("lemma: "+HR,lemmatized_output)

rocks :[91m rock[0m
corpora :[91m corpus[0m
better :[91m good[0m
drove of verb :[91m drive[0m
drove as noun (bird): :[91m drove[0m
words: ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
lemma: [91m The,   striped,   bat,   are,   hanging,   on,   their,   foot,   for,   best


## spaCy Lemmatizer

In [4]:
import spacy
# Initialize spacy 'en_core_web_sm' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their, feet for best"
# Parse the sentence using the loaded 'English' model object `nlp`
finalSentence = ""
punct = "‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’"
print("," not in punct)
for char in sentence:
    if char not in punct:
        finalSentence += char
print(finalSentence)
doc = nlp(finalSentence)
# Extract the lemma for each token and join
print(" ".join(token.lemma_ for token in doc))

False
The striped bats are hanging on their feet for best
the stripe bat be hang on their foot for good


## TextBlob Lemmatizer

In [5]:
from textblob import TextBlob, Word
# Lemmatize a word, use the WordNet's morphy function
word = 'stripes'
w = Word(word)
print(word +" " + HR+w.lemmatize())

stripes [91mstripe


In [6]:
# Lemmatize a sentence
sentence = "The striped bats are hanging on their feet for best"
sent = TextBlob(sentence)
print(" ". join(["["+ w+"-->"+HR+w.lemmatize()+"]"+HD for w in sent.words]))

[The-->[91mThe][0m [striped-->[91mstriped][0m [bats-->[91mbat][0m [are-->[91mare][0m [hanging-->[91mhanging][0m [on-->[91mon][0m [their-->[91mtheir][0m [feet-->[91mfoot][0m [for-->[91mfor][0m [best-->[91mbest][0m


## Parts of speech tagging with NLTK
Part-of-speech tagging (POS tagging) is the process of marking up the words in a text with their corresponding part of speech (e.g., <span style="color:blue">noun, verb, adjective</span>). For example, take the following sentence:


 <span style="color:blue">A dog had seen the cutest ferrets.</span> 
 
A tokenizer would split it into the following tokens:
![token.png](token.png)
A part-of-speech tagger could then assign labels, or tags, to the tokens according to their respective
parts of speech:
![pos.png](pos.png)
The <span style="color:blue">Penn Treebank tags</span> used here are as follows:
<span style="color:blue">DT</span> determiner <span style="color:blue">NN</span> noun, singular or mass <span style="color:blue">VBD</span> verb, past tense
<span style="color:blue">JJS</span> adjective, superlative <span style="color:blue">NNS</span> noun, plural <span style="color:blue">VBN</span> verb, past participle

The inventory from which these POS tags are drawn varies from <span style="color:blue">language to language</span>, and from <span style="color:blue">application to application</span>.

NLTK includes a Part-of-speech tagger, which assign a <span style="color:blue">tag</span>, or <span style="color:blue">word class</span>, or <span style="color:blue">lexical category</span> for a given token in a text. The default POS tagset for English is based on [PennTreebank tagset<](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 

NLTK also include the [Universal POS tagset](https://universaldependencies.org/u/pos/)

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tag import pos_tag
import nltk
nltk.download('universal_tagset')
text = "I saw a man sawing the tree with a saw. He can't finish it ontime."
sentences = sent_tokenize(text)
for sentence in sentences:
    for token, pos in pos_tag(word_tokenize(sentence)):
        print(token +" " + HR + pos + HD)
# to get information about a given tag
print("=====")
nltk.help.upenn_tagset("VB")

I [91mPRP[0m
saw [91mVBD[0m
a [91mDT[0m
man [91mNN[0m
sawing [91mVBG[0m
the [91mDT[0m
tree [91mNN[0m
with [91mIN[0m
a [91mDT[0m
saw [91mNN[0m
. [91m.[0m
He [91mPRP[0m
ca [91mMD[0m
n't [91mRB[0m
finish [91mVB[0m
it [91mPRP[0m
ontime [91mRB[0m
. [91m.[0m
=====
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...


[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/silasueberschaer/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


In [8]:
# you can also decide to use the Universal POS tagset
for sentence in sentences:
    for token, pos in pos_tag(word_tokenize(sentence), tagset='universal'):
        print(token +" "+ HR + pos + HD)  

I [91mPRON[0m
saw [91mVERB[0m
a [91mDET[0m
man [91mNOUN[0m
sawing [91mVERB[0m
the [91mDET[0m
tree [91mNOUN[0m
with [91mADP[0m
a [91mDET[0m
saw [91mNOUN[0m
. [91m.[0m
He [91mPRON[0m
ca [91mVERB[0m
n't [91mADV[0m
finish [91mVERB[0m
it [91mPRON[0m
ontime [91mADV[0m
. [91m.[0m


## Parts of speech tagging with spaCy

In [9]:
import spacy 
import pprint
# Load English tokenizer, tagger,  
# parser, NER and word vectors 
nlp = spacy.load("en_core_web_sm") 
text = ("I saw a man sawing the tree with a saw. He can't finish it ontime!") 
doc = nlp(text) 
# Print token and Tag 
for token in doc: 
    print(str(token)+" "+HR+  str(token.pos_) + HD) 
# Example list of Verb tokens 
print("Verbs:", [token.text for token in doc if token.pos_ == "VERB"]) 


I [91mPRON[0m
saw [91mVERB[0m
a [91mDET[0m
man [91mNOUN[0m
sawing [91mVERB[0m
the [91mDET[0m
tree [91mNOUN[0m
with [91mADP[0m
a [91mDET[0m
saw [91mNOUN[0m
. [91mPUNCT[0m
He [91mPRON[0m
ca [91mAUX[0m
n't [91mPART[0m
finish [91mVERB[0m
it [91mPRON[0m
ontime [91mADV[0m
! [91mPUNCT[0m
Verbs: ['saw', 'sawing', 'finish']


## Parts of speech tagging with TextBlob

In [10]:
from textblob import TextBlob 
text = ("I saw a man sawing the tree with a saw. He can't finish it ontime!")   
# create a textblob object 
blob_object = TextBlob(text) 
# print word with pos tag. 
for word, pos in blob_object.tags:
    print(word +" " +HR + pos + HD)

I [91mPRP[0m
saw [91mVBD[0m
a [91mDT[0m
man [91mNN[0m
sawing [91mVBG[0m
the [91mDT[0m
tree [91mNN[0m
with [91mIN[0m
a [91mDT[0m
saw [91mNN[0m
He [91mPRP[0m
ca [91mMD[0m
n't [91mRB[0m
finish [91mVB[0m
it [91mPRP[0m
ontime [91mRB[0m


---
---

## Parsing vs. chunking 
Parsing is the process of analyzing a text to determine its <span style="color:blue">grammatical structure</span>. It goes beyond part-of-speech tagging (though that is often a first step) by grouping words within sentences into <span style="color:blue">hierarchical grammatical structures</span>. Here is a possible parse tree for the example sentence "<span style="color:blue">A dog had seen the cutest ferrets.</span>"
![parse.png](attachment:parse.png)

Proper parsing is a hard problem in computational linguistics. While identifying some sort of sentence structure is important for many NLP applications, not all of them require something as detailed and complicated as a parse tree. <span style="color:blue">Chunking</span>, also known as <span style="color:blue">shallow parsing</span>, is a simplified form of sentence analysis which identifies basic constituents (noun groups, verb groups, etc.) but does not specify their internal structure.
For the POS-tagged sentence example above, a chunker might identify noun chunks (<span style="color:blue">NC</span>) and verb complexes (<span style="color:blue">VC</span>) as follows:
![chunk.png](attachment:chunk.png)

## Chunking with NLTTK
Chunking works on top of POS tagging, it uses pos-tags as input and provides chunks as output. 

We can create <span style="color:blue">rules</span> to create <span style="color:blue">noun phrase</span>, for example, we can define noun phrase chunking as an optional determiner (<span style="color:blue">DT</span>) followed by any number of adjectives (<span style="color:blue">JJ</span>) and then a noun (<span style="color:blue">NN</span>).

In [11]:
import nltk
sentence = "the little yellow dog barked at the cat."
#Define your grammar using regular expressions
grammar = ("Noun-Chunk: {<DT>?<JJ>*<NN>} # NP")
chunkParser = nltk.RegexpParser(grammar)
postags = nltk.pos_tag(nltk.word_tokenize(sentence))
for word, pos in postags:
    print(word +" " + HR+ pos + HD)
tree = chunkParser.parse(postags)
for subtree in tree.subtrees():
    print(subtree)
tree.draw()

the [91mDT[0m
little [91mJJ[0m
yellow [91mJJ[0m
dog [91mNN[0m
barked [91mVBD[0m
at [91mIN[0m
the [91mDT[0m
cat [91mNN[0m
. [91m.[0m
(S
  (Noun-Chunk the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (Noun-Chunk the/DT cat/NN)
  ./.)
(Noun-Chunk the/DT little/JJ yellow/JJ dog/NN)
(Noun-Chunk the/DT cat/NN)


The above code will draw the parsed tree structure (with chunk labels) of the sentence. It should look like the following
![image.png](attachment:image.png)

In [12]:
# another noun-chunk pattern 
# 1) DT or PP$ followed by JJ and end by NN or 
# 2) a number of proper noun sequences NNP+
grammar = r"""
  Noun-Chunk: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = "Rapunzel let down her long golden hair."
postags = nltk.pos_tag(nltk.word_tokenize(sentence))
tree = cp.parse(postags)
print(tree)
tree.draw()

(S
  (Noun-Chunk Rapunzel/NNP)
  let/VBD
  down/RP
  her/PRP$
  (Noun-Chunk long/JJ golden/JJ hair/NN)
  ./.)


![image.png](attachment:image.png)

In [13]:
# List verb chunks from the brown corpus
cp = nltk.RegexpParser('Verb-chunk: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
verbchunks = [] 
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'Verb-chunk': 
            verbchunks.append(subtree)
# print the first ten chunks 
print(verbchunks[:10])
# draw the first Verb-chunk
verbchunks[0].draw()

[Tree('Verb-chunk', [('combined', 'VBN'), ('to', 'TO'), ('achieve', 'VB')]), Tree('Verb-chunk', [('continue', 'VB'), ('to', 'TO'), ('place', 'VB')]), Tree('Verb-chunk', [('serve', 'VB'), ('to', 'TO'), ('protect', 'VB')]), Tree('Verb-chunk', [('wanted', 'VBD'), ('to', 'TO'), ('wait', 'VB')]), Tree('Verb-chunk', [('allowed', 'VBN'), ('to', 'TO'), ('place', 'VB')]), Tree('Verb-chunk', [('expected', 'VBN'), ('to', 'TO'), ('become', 'VB')]), Tree('Verb-chunk', [('expected', 'VBN'), ('to', 'TO'), ('approve', 'VB')]), Tree('Verb-chunk', [('expected', 'VBN'), ('to', 'TO'), ('make', 'VB')]), Tree('Verb-chunk', [('intends', 'VBZ'), ('to', 'TO'), ('make', 'VB')]), Tree('Verb-chunk', [('seek', 'VB'), ('to', 'TO'), ('set', 'VB')])]


![image-2.png](attachment:image-2.png)

## Parsing and Chunking with spaCy
In <span style="color:blue">dependency parsing</span> the syntactic structure of a sentence is described solely in terms of directed <span style="color:blue">binary grammatical relations between the words</span>. Relations among the words are illustrated above the sentence with <span style="color:blue">directed, labeled arcs</span> from <span style="color:blue">heads</span> to <span style="color:blue">dependents</span>. We call this a **typed dependency structure** because in typed dependency the labels are drawn from a fixed inventory of grammatical relations. A <span style="color:blue">root</span> node explicitly marks the root of the tree, the head of the entire structure. Read more [here](https://web.stanford.edu/~jurafsky/slp3/14.pdf)

In [14]:
#Dependency parsing with spaCy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"The cat and the dog sleep in the basket near the door.")
spacy.displacy.render(doc, style='dep')

## noun chunks in spaCy

In [15]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'The cat and the dog sleep in the basket near the door.')
for np in doc.noun_chunks:
    print(np.text)

The cat
the dog
the basket
the door


## Chunking with TextBlob
TextBlob currently has two noun phrases chunker implementations, textblob.np_extractors.<span style="color:blue">FastNPExtractor</span> (default, based on [Shlomi Babluki’s implementation](https://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/) and textblob.np_extractors.<span style="color:blue">ConllExtractor</span>, which uses the CoNLL 2000 corpus to train a tagger.

In [16]:
from textblob import TextBlob
#from textblob.np_extractors import FastNPExtractor
from textblob.np_extractors import ConllExtractor
extractor = ConllExtractor()
sentence ="Swayy is a beautiful new dashboard for discovering and curating online content."
parse = TextBlob(sentence, np_extractor=extractor)
print(parse.noun_phrases)

['swayy', 'beautiful new dashboard', 'online content']


 # <span style="color:red"> Excercise  (15 pts)</span>
 ## Building small Ontology using Hearst Pattern
In this problem, you will employ the POS, lemma and chunking information to discover <span style="color:blue">lexical relationships</span> in a corpus.

<span style="color:blue">Hearst patterns</span> are lexico-syntactic patterns first used by [Marti Hearst](http://people.ischool.berkeley.edu/~hearst/papers/coling92.pdf) to discover <span style="color:blue"><span style="color:blue">hyponyms</span></span> in large text corpora. (A **hyponym** is a term which denotes a more specific or subordinate group of another term, called a  <span style="color:blue">hypernym</span>. For example, <span style="color:blue">tiger</span> is a hyponym of  <span style="color:blue">mammal</span>, which is in turn a hyponym of <span style="color:blue">animal</span>. Therefore animal is a hypernym of mammal, and mammal is a hypernym of tiger.) 

Hearst observed that certain linguistic constructions can be used to infer hyponymy relationships. For example, in the phrase “works by such <span style="color:blue">authors</span> as <span style="color:blue">Herrick</span>, <span style="color:blue">Goldsmith</span>, and <span style="color:blue">Shakespeare</span>”, it is obvious that Herrick, Goldsmith, and Shakespeare are all hyponyms of author. In general, any phrase of the pattern “<span style="color:blue">such NP0 as NP1, . . . , and NPn</span>” implies that the noun phrases NP1 through NPn are hyponyms of NP0. The following table shows some patterns originally proposed by Hearst, along with examples. 
![hearst.png](attachment:hearst.png)

---
---

Write a Python program which looks for hyponyms by finding Hearst patterns in a collection of documents.

 1. Write a program that will read a file or list of files, iterate over each sentences and extract possible <span style="color:blue">hyponym/hypernym</span> relations.  <span style="color:red"> ( 10 pts)</span> 
 2. Once the relations are extracted, report the total number of relations/patterns as follows  <span style="color:red"> ( 5 pts)</span> :
* Print out the most commonly found <span style="color:blue">hyponym-hypernym relations</span>
 
Example output: 

|  count | Hyponym   | Hypernym |
|---|:----------|:---|
| 45  | house | building |
| 32   | Herrick  | author|
| 11  | France  |  country

* Print the top five most <span style="color:blue">productive Hearst patterns</span>

Example output:

|  count | Hearst pattern   | 
|---|:----------|
| 1302  | NP such as NP | 
| 800   | such NP as NP  | 
| 452  | NP, including NP  |  
| 121 | NP, especially NP |
| 32 | NP and/or other NP |

 In this exercise, you can use either <span style="color:blue">NLTK</span>, <span style="color:blue">TextBlob</span>, or <span style="color:blue">spaCy</span> chunkers, or a combination of them to implement Hearst Pattern. We will run your script to test sentences to determine how much patterns your implementation covers.
 
 You can use the corpus `wiki-1000.txt` in the folder `HearstPaternData`. You can compare your output to some of the files there such as `pattern_out_0.txt`.

## Resources
* [Learning POS Tagging & Chunking in NLP](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb)
* [TextBlob Chunking](https://textblob.readthedocs.io/en/dev/advanced_usage.html#noun-phrase-chunkers)
* [Chunking in NLTK](https://www.nltk.org/book/ch07.html)
* [Hearst Pattern](file:///Users/seidmuhieyimam/Downloads/KEOD_2018_58_CR.pdf)
* [Dependency Parsing](https://web.stanford.edu/~jurafsky/slp3/14.pdf)

In [1]:
hearst_patterns = [
            (
                '(NP_\\w+ (, )?such as (NP_\\w+ ?(, )?(and |or )?)+)',
                'first',
                'NP such as NP'
            ),
            (
                '(such NP_\\w+ (, )?as (NP_\\w+ ?(, )?(and |or )?)+)',
                'first',
                'such NP as NP'
            ),
            (
                '((NP_\\w+ ?(, )?)+(and |or )?other NP_\\w+)',
                'last',
                'NP and/or other NP'
            ),
            (
                '(NP_\\w+ (, )?include (NP_\\w+ ?(, )?(and |or )?)+)',
                'first',
                'NP, including NP'
            ),
            (
                '(NP_\\w+ (, )?especially (NP_\\w+ ?(, )?(and |or )?)+)',
                'first',
                'NP, especially NP'
            ),
            (
                    '((NP_\\w+ ?(, )?)+(and |or )?any other NP_\\w+)',
                    'last',
                    'NP, and/or any other NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?some other NP_\\w+)',
                    'last',
                    'NP, and/or some other NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?be a NP_\\w+)',
                    'last',
                    'NP, and/or be a NP'
                ),
                (
                    '(NP_\\w+ (, )?like (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP, like NP'
                ),
                (
                    'such (NP_\\w+ (, )?as (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'such NP as NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?like other NP_\\w+)',
                    'last',
                    'NP like other NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?one of the NP_\\w+)',
                    'last',
                    'NP one of the NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?one of these NP_\\w+)',
                    'last',
                    'NP one of these NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?one of those NP_\\w+)',
                    'last',
                    'NP one of those NP'
                ),
                (
                    'example of (NP_\\w+ (, )?be (NP_\\w+ ? '
                    '(, )?(and |or )?)+)',
                    'first',
                    'example of NP be NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?be example of NP_\\w+)',
                    'last',
                    'NP be example of NP'
                ),
                (
                    '(NP_\\w+ (, )?for example (, )?'
                    '(NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP, for example of NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?which be call NP_\\w+)',
                    'last',
                    'NP which be call NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?which be name NP_\\w+)',
                    'last',
                    'NP which be name NP'
                ),
                (
                    '(NP_\\w+ (, )?mainly (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP mainly NP'
                ),
                (
                    '(NP_\\w+ (, )?mostly (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP mostly NP'
                ),
                (
                    '(NP_\\w+ (, )?notably (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP notably NP'
                ),
                (
                    '(NP_\\w+ (, )?particularly (NP_\\w+ ? '
                    '(, )?(and |or )?)+)',
                    'first',
                    'NP particularly NP'
                ),
                (
                    '(NP_\\w+ (, )?principally (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP principally NP'
                ),
                (
                    '(NP_\\w+ (, )?in particular (NP_\\w+ ? '
                    '(, )?(and |or )?)+)',
                    'first',
                    'NP in particular NP'
                ),
                (
                    '(NP_\\w+ (, )?except (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP except NP'
                ),
                (
                    '(NP_\\w+ (, )?other than (NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP other than NP'
                ),
                (
                    '(NP_\\w+ (, )?e.g. (, )?(NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP e.g NP'
                ),
                (
                    '(NP_\\w+ \\( (e.g.|i.e.) (, )?(NP_\\w+ ?(, )?(and |or )?)+'
                    '(\\. )?\\))',
                    'first',
                    'NP e.g/i.e. NP'
                ),
                (
                    '(NP_\\w+ (, )?i.e. (, )?(NP_\\w+ ?(, )?(and |or )?)+)',
                    'first',
                    'NP i.e. NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and|or)? a kind of NP_\\w+)',
                    'last',
                    'NP a kind of NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and|or)? kind of NP_\\w+)',
                    'last',
                    'NP kind of NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and|or)? form of NP_\\w+)',
                    'last',
                    'NP form of NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?which look like NP_\\w+)',
                    'last',
                    'NP which look like NP'
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and |or )?which sound like NP_\\w+)',
                    'last',
                    'NP which sound like NP'
                ),
                (
                    '(NP_\\w+ (, )?which be similar to (NP_\\w+ ? '
                    '(, )?(and |or )?)+)',
                    'first',
                    "NP which be similar to NP"
                ),
                (
                    '(NP_\\w+ (, )?example of this be (NP_\\w+ ? '
                    '(, )?(and |or )?)+)',
                    'first',
                    "NP exmaple of this be NP"
                ),
                (
                    '((NP_\\w+ ?(, )?)+(and|or)? sort of NP_\\w+)',
                    'last',
                    "NP sort of NP"
                ),
                (
                    '(NP_\\w+ (, )?which may include (NP_\\w+ '
                    '?(, )?(and |or )?)+)',
                    'first',
                    "NP which may include NP"
                )
        ]

In [2]:
import string
import re
def format_text(text):
    text = text.replace('\"','')
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    punct = "‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’"
    banned_words = ["or" , "other", "and" ,"such", "as", "include", "especially", "any", "be", "a", "."]
    chunks = [] 
    for sentence in doc.sents:
        sentence_text = sentence.lemma_
        #sentence_text = sentence_text.translate(str.maketrans('','',string.punctuation))
        for chunk in sentence.noun_chunks:
            chunk_arr = []
            replace_arr = []
            for token in chunk:
                #if token.lemma_ in ["Dr.", "Prof."] or (token.lemma_[-1] == "." and len(token.lemma_) == 2):
                 #   token.lemma_ = token.lemma_[:-1]
                if token.lemma_ not in banned_words:
                    chunk_arr.append(token.lemma_)
                    if token.lemma_[-1] == ".":
                        token.lemma_ = token.lemma_[:-1]
                    replace_arr.append(token.lemma_)
                    
                else:
                    #n = "".join(char for char in token.lemma_ if char.isalnum())
                    #replace_arr.append(n)
                    continue
            if len(chunk_arr) == 0:
                chunk_arr.append(chunk[-1].lemma_)
            chunk_lemma = " ".join(chunk_arr)
            
            replacement_value = "NP_" + "_".join(replace_arr)
            if chunk_lemma:
                sentence_text = re.sub(r'\b%s\b' % re.escape(chunk_lemma),
                                                   r'%s' % replacement_value,
                                            sentence_text)
       
        chunks.append(sentence_text)
    return chunks


In [19]:
pattern = "(NP_\\w+ (, )?like (NP_\\w+ ?(, )?(and |or )?)+)"
s = "be find guilty of NP_crime against NP_humanity and other more specific NP_crime like NP_murder"
re.search(pattern, s)

<re.Match object; span=(71, 94), match='NP_crime like NP_murder'>

In [3]:
f = open("./HearstPaternData/wiki-1000.txt").read()
test = 'After early victories for the activists, newspapers including the Ohio State University "Lantern" and "The Akron Beacon Journal" suggested that trademark protection for Chief Wahoo might be in jeopardy.'
text = format_text(f)
hyp = []
def clean_hyponym_term(term):
        return term.replace("NP_", "").replace("_", " ").replace(',', "").replace(".","")           
    
freq = {}
pairs = {}
count = 0
for sentence in text:
    for (pattern, parser, name) in hearst_patterns:
        matches = re.search(pattern, sentence)
        if matches:
            match_str = matches.group(0)
            nps = [a for a in match_str.split() if a.startswith("NP_")]
            if parser == "first":
                general = nps[0]
                specific = nps[1:]
            else:
                general = nps[-1]
                specific = nps[:-1]
            for i in range(len(specific)):
                if name not in freq:
                    freq[name] = 0
                freq[name] += 1
                count+= 1
                pair = (
                    clean_hyponym_term(specific[i]),
                    clean_hyponym_term(general)
                )
                if pair not in pairs:
                    pairs[pair] = 0
                pairs[pair] += 1   
                print(pair)

FileNotFoundError: [Errno 2] No such file or directory: './HearstPaternData/wiki-1000.txt'

In [28]:
import operator
pairs_Sorted = sorted(pairs.items(), key=operator.itemgetter(1), reverse=True)
i = 0
while i < 5:
    print("Count: ", str(pairs_Sorted[i][1]), " Hyponym: ",  pairs_Sorted[i][0][0], " Hypernym: ",  pairs_Sorted[i][0][1])
    i += 1

Count:  2  Hyponym:  Alexander Alyabyev  Hypernym:  composer
Count:  2  Hyponym:  pea  Hypernym:  vegetable
Count:  2  Hyponym:  corn  Hypernym:  vegetable
Count:  2  Hyponym:  party  Hypernym:  thing
Count:  2  Hyponym:  workshop  Hypernym:  thing


In [31]:
freq_Sorted = sorted(freq.items(), key=operator.itemgetter(1), reverse=True)
i = 0
while i < 5:
    print("Count: ", str(freq_Sorted[i][1]), "Pattern: ", freq_Sorted[i][0])
    i += 1

Count:  27 Pattern:  NP, including NP
Count:  25 Pattern:  NP such as NP
Count:  24 Pattern:  NP and/or other NP
Count:  19 Pattern:  NP, and/or be a NP
Count:  12 Pattern:  such NP as NP
