## 1. Syntactic Patterns for Technical Terms ##

In [106]:
import nltk
from nltk.corpus import brown
from bs4 import BeautifulSoup
from urllib.request import urlopen
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
import string
from nltk.corpus import stopwords

As seen in the Manning and Schuetze chapter, there is a well-known part-of-speech 
based pattern defined by Justeson and Katz for identifying simple noun phrases 
that often words well for pulling out keyphrases.

 Technical Term  T = (A | N)+ (N | C)  | N

Below, write a function to  define a chunker using the RegexpParser as illustrated in the NLTK book Chapter 7 section 2.3 *Chunking with Regular Expressions*.  You'll need to revise the grammar rules shown there to match the pattern shown above.  You can be liberal with your definition of what is meant by *N* here.  Also, C refers to cardinal number, which is CD in the brown corpus.



In [268]:
grammar = r"""
  NP: {<JJ|NN.*>+<(NN.*|CD)>+<NN.*>?}  
"""

def chunker(sent):
    cp = nltk.RegexpParser(grammar)
    print (cp.parse(sent))

Below, write a function to call the chunker, run it on some sentences, and then print out the results for  those sentences.

For uniformity, please run it on sentences 100 through 104 from the full tagged brown corpus.

 

In [169]:
brown = nltk.corpus.brown
brown_subset = brown.tagged_sents()[100:104]

def callChunk(brown_subset):
    for sub in brown_subset:
        chunker(sub)
callChunk(brown_subset)

(S
  Daniel/NP
  personally/RB
  led/VBD
  the/AT
  fight/NN
  for/IN
  the/AT
  measure/NN
  ,/,
  which/WDT
  he/PPS
  had/HVD
  watered/VBN
  down/RP
  considerably/RB
  since/IN
  its/PP$
  rejection/NN
  by/IN
  two/CD
  (NP previous/JJ Legislatures/NNS-TL)
  ,/,
  in/IN
  a/AT
  (NP public/JJ hearing/NN)
  before/IN
  the/AT
  (NP House/NN-TL Committee/NN-TL)
  on/IN-TL
  Revenue/NN-TL
  and/CC-TL
  Taxation/NN-TL
  ./.)
(S
  Under/IN
  (NP committee/NN rules/NNS)
  ,/,
  it/PPS
  went/VBD
  automatically/RB
  to/IN
  a/AT
  subcommittee/NN
  for/IN
  one/CD
  week/NN
  ./.)
(S
  But/CC
  questions/NNS
  with/IN
  which/WDT
  (NP committee/NN members/NNS)
  taunted/VBD
  bankers/NNS
  appearing/VBG
  as/CS
  witnesses/NNS
  left/VBD
  little/AP
  doubt/NN
  that/CS
  they/PPSS
  will/MD
  recommend/VB
  passage/NN
  of/IN
  it/PPO
  ./.)
(S
  Daniel/NP
  termed/VBD
  ``/``
  extremely/RB
  conservative/JJ
  ''/''
  his/PP$
  estimate/NN
  that/CS
  it/PPS
  would/MD
  produce/VB



Then extract out the phrases themselves on sentences 100 through 160 using the subtree extraction technique shown in the 
*Exploring Text Corpora* category.  

In [454]:
brown_subset2 = brown.tagged_sents()[100:160]
cp = nltk.RegexpParser('CHUNK: {<JJ|NN.*>+<(NN.*|CD)>+<NN.*>?} ')
for sent in brown_subset2:
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

(CHUNK previous/JJ Legislatures/NNS-TL)
(CHUNK public/JJ hearing/NN)
(CHUNK House/NN-TL Committee/NN-TL)
(CHUNK committee/NN rules/NNS)
(CHUNK committee/NN members/NNS)
(CHUNK current/JJ fiscal/JJ year/NN)
(CHUNK escheat/NN law/NN)
(CHUNK bank/NN accounts/NNS)
(CHUNK personal/JJ property/NN)
(CHUNK insurance/NN firms/NNS)
(CHUNK pipeline/NN companies/NNS)
(CHUNK such/JJ property/NN)
(CHUNK state/NN treasurer/NN)
(CHUNK escheat/NN law/NN)
(CHUNK such/JJ property/NN)
(CHUNK Bankers/NNS-TL Association/NN-TL)
(CHUNK opposition/NN keynote/NN)
(CHUNK contractual/JJ obligations/NNS)
(CHUNK bank/NN customers/NNS)
(CHUNK taxpayers'/NNS$ pockets/NNS)
(CHUNK pipeline/NN companies/NNS)
(CHUNK day/NN schools/NNS)
(CHUNK special/JJ schooling/NN)
(CHUNK deaf/JJ students/NNS)
(CHUNK scholastic/JJ age/NN)
(CHUNK Education/NN-TL Agency/NN-TL)
(CHUNK county-wide/JJ day/NN schools/NNS)
(CHUNK deaf/JJ children/NNS)
(CHUNK day/NN schools/NNS)
(CHUNK day/NN schools/NNS)
(CHUNK year's/NN$ capital/NN outlay/NN

## 2. Identify Proper Nouns ##
For this next task, write a new version of the chunker, but this time change it in two ways:
 1. Make it recognize proper nouns
 2. Make it work on your personal text collection which means that you need to run a tagger over your personal text collection.

Note that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer.  You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)



**Tagger:** Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:

In [148]:
url = "https://archive.org/stream/SteveJobs/SteveJobs_djvu.txt" #load the url
html = urlopen(url).read() #read the text from the url
soup = BeautifulSoup(html) 
soupText = soup.get_text() #use Beautiful Soup to clean up the HTML mark up
soupText.find("The Adoption") #find the beginning of the first chapter
soupText.rfind("ACKNOWLEDGMENTS") #find the end of the book
text = soupText[26001:1235482] #read the text from the beginning of the chapter to the end of the book
sents = sent_tokenizer.tokenize(text) #dividing it up into sentences

In [109]:
def tokenize_text(corpus):
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sents = sent_tokenizer.tokenize(corpus) # Split text into sentences
    
    return [nltk.word_tokenize(word) for word in raw_sents]

In [246]:
tokenized_Steve = tokenize_text(soupText)

In [244]:
def create_data_sets(sentences):
    size = int(len(sentences) * 0.9)
    train_sents = sentences[:size]
    test_sents = sentences[size:]
    return train_sents, test_sents

def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    return t2

brown_tagged_sents = brown.tagged_sents(categories=['belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor','mystery', 'religion', 'reviews', 'romance','science_fiction'])

train_sents, test_sents = create_data_sets(brown_tagged_sents)

ngram_tagger = build_backoff_tagger(train_sents)

In [368]:
steve_tagged = []
for sent in tokenized_Steve:
    steve_tagged.append(ngram_tagger.tag(sent))

**Chunker:** Code for the proper noun chunker goes here:

In [538]:
def chunk(text):
    chunklist = []
    cp = nltk.RegexpParser('CHUNK: {<NN-.*>+<IN>?}') #My text's proper noun is being tagged as NN-TL..etc
    for sub in text:
        tree = cp.parse(sub)
        for subtree in tree.subtrees():
            if subtree.label() == 'CHUNK':
                chunklist.append (str(subtree))
    return chunklist

**Test the Chunker:** Test your proper noun recognizer on a lot of sentences to see how well it is working.  You might want to add prepositions in order to improve your results.  


In [543]:
chunklist = chunk(steve_tagged)

**FreqDist Results:** After you have your proper noun recognizer working to your satisfaction, below  run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency.  That code goes here, along with the output:


In [544]:
fdist = nltk.FreqDist(chunklist)

In [545]:
fdist.most_common(23)

[('(CHUNK Time/NN-TL)', 61),
 ('(CHUNK Bill/NN-TL)', 52),
 ('(CHUNK Story/NN-TL)', 44),
 ('(CHUNK Valley/NN-TL)', 31),
 ('(CHUNK Rock/NN-TL)', 30),
 ('(CHUNK Wall/NN-TL Street/NN-TL Journal/NN-TL)', 28),
 ('(CHUNK Art/NN-TL)', 26),
 ('(CHUNK Life/NN-TL)', 24),
 ('(CHUNK Box/NN-TL)', 22),
 ('(CHUNK Grove/NN-TL)', 21),
 ('(CHUNK Horn/NN-TL)', 20),
 ('(CHUNK News/NN-TL)', 19),
 ('(CHUNK Park/NN-TL)', 18),
 ('(CHUNK King/NN-TL)', 17),
 ('(CHUNK Stone/NN-TL)', 16),
 ('(CHUNK Man/NN-TL)', 16),
 ('(CHUNK University/NN-TL)', 15),
 ('(CHUNK President/NN-TL)', 15),
 ('(CHUNK Homestead/NN-TL)', 14),
 ('(CHUNK Thing/NN-TL)', 14),
 ('(CHUNK Music/NN-TL)', 14),
 ('(CHUNK Center/NN-TL)', 14),
 ('(CHUNK College/NN-TL)', 14)]