# Week One
For Chapter 1, you may need to install the numpy, Pandas, spaCy,
and NLTK Python packages (if you don't already have them). They'll
be used extensively in the course.

The easiest way to install these package is by using pip:

* `pip install numpy`
* `pip install nltk`
* `pip install pandas`
* `pip install spacy`

## Page 12

In [2]:
import nltk
import spacy
import numpy as np
import pandas as pd

nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

## Page 13

In [3]:
sentence = 'The brown fox is quick and he is jumping over the lazy dog'
print(sentence)

words = sentence.split()
np.random.shuffle(words)
print(words, '\n')

The brown fox is quick and he is jumping over the lazy dog
['is', 'quick', 'jumping', 'lazy', 'brown', 'he', 'the', 'fox', 'over', 'is', 'The', 'dog', 'and'] 



## Page 16

In [4]:
pos_tags = nltk.pos_tag(sentence.split())
print(pd.DataFrame(pos_tags).T, '\n')
spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in nlp(sentence)]
print(pd.DataFrame(spacy_pos_tagged).T)

    0      1    2    3      4    5    6    7        8     9    10    11   12
0  The  brown  fox   is  quick  and   he   is  jumping  over  the  lazy  dog
1   DT     JJ   NN  VBZ     JJ   CC  PRP  VBZ      VBG    IN   DT    JJ   NN 

    0      1     2    3      4      5     6    7        8     9    10    11  \
0  The  brown   fox   is  quick    and    he   is  jumping  over  the  lazy   
1   DT     JJ    NN  VBZ     JJ     CC   PRP  VBZ      VBG    IN   DT    JJ   
2  DET    ADJ  NOUN  AUX    ADJ  CCONJ  PRON  AUX     VERB   ADP  DET   ADJ   

     12  
0   dog  
1    NN  
2  NOUN  


## Pages 19-20

In [8]:
grammar = '''
            NP: {<DT>?<JJ>?<NN.*>}
            ADJP: {<JJ>}
            ADVP: {<RB.*>}
            PP: {<IN>}
            VP: {<MD>?<VB.*>+}
          '''
pos_tagged_sent = nltk.pos_tag(sentence.split())
rp = nltk.RegexpParser(grammar)
shallow_parsed_sent = rp.parse(pos_tagged_sent)
print(shallow_parsed_sent, '\n')
# This line will cause another window to appear;
# It will take a few moments for that to happen.
shallow_parsed_sent.draw()

(S
  (NP The/DT brown/JJ fox/NN)
  (VP is/VBZ)
  (ADJP quick/JJ)
  and/CC
  he/PRP
  (VP is/VBZ jumping/VBG)
  (PP over/IN)
  (NP the/DT lazy/JJ dog/NN)) 



## Page 26

In [6]:
from spacy import displacy
displacy.render(nlp(sentence),
                style='dep',
                options={'distance': 100,
                         'arrow_stroke': 1.5,
                         'arrow_width': 8})

## Pages 32-33
These two pages - and many others - use the Standford Parser, which is now
depricated in NLTK. For simplicity, I'm going to try to avoid this way of
parsing altogether because even the replacement (nltk.parse.CoreNLPParser) requires
a web server based on the Standford parser.

In [10]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

tokens = nltk.word_tokenize(sentence)
print(tokens)

tagged = nltk.pos_tag(tokens)
print(tagged)

entities = nltk.chunk.ne_chunk(tagged)
print(entities)

# This example uses an NLTK corpus instead of the
# sentence. We'll get to this later.
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\neugg\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\neugg\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


['The', 'brown', 'fox', 'is', 'quick', 'and', 'he', 'is', 'jumping', 'over', 'the', 'lazy', 'dog']
[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
(S
  The/DT
  brown/JJ
  fox/NN
  is/VBZ
  quick/JJ
  and/CC
  he/PRP
  is/VBZ
  jumping/VBG
  over/IN
  the/DT
  lazy/JJ
  dog/NN)


## Corpora Demo - Pages 55-56

In [12]:
import nltk
# The following downloads from NLTK only need to happen once.
nltk.download('brown')
nltk.download('reuters')
nltk.download('wordnet')

from nltk.corpus import brown
print('Total Categories:', len(brown.categories()))
print(brown.categories(), '\n')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\neugg\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\neugg\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\neugg\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Total Categories: 15
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 



In [13]:
# tokenized sentences
brown.sents(categories='mystery')
# POS tagged sentences
brown.tagged_sents(categories='mystery')
# get sentences in natural form
sentences = brown.sents(categories='mystery')
# get tagged words
tagged_words = brown.tagged_words(categories='mystery')
# get nouns from tagged words
nouns = [(word, tag) for word, tag in tagged_words if any(noun_tag in tag for noun_tag in ['NP', 'NN'])]
print(nouns[0:10]) # prints the first 10 nouns
# build frequency distribution for nouns
nouns_freq = nltk.FreqDist([word for word, tag in nouns])
# print(top 10 occuring nouns
print(nouns_freq.most_common(10), '\n')

[('patients', 'NNS'), ('bus', 'NN'), ('morning', 'NN'), ('Hanover', 'NP'), ('interne', 'NN'), ('nurse', 'NN'), ('attendants', 'NNS'), ('charge', 'NN'), ('bus', 'NN'), ('window', 'NN')]
[('man', 106), ('time', 82), ('door', 80), ('car', 69), ('room', 65), ('Mr.', 63), ('way', 61), ('office', 50), ('eyes', 48), ('hand', 46)] 



In [14]:
# REUTERS CORPUS DEMO
from nltk.corpus import reuters
print('Total Categories:', len(reuters.categories()))
print(reuters.categories(), '\n')

Total Categories: 90
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc'] 



In [15]:
# get sentences in housing and income categories
sentences = reuters.sents(categories=['housing', 'income'])
sentences = [' '.join(sentence_tokens) for sentence_tokens in sentences]
print(sentences[0:5])  # prints the first 5 sentences
# fileid based access
print(reuters.fileids(categories=['housing', 'income']))
print(reuters.sents(fileids=[u'test/16118', u'test/18534']), '\n')

["YUGOSLAV ECONOMY WORSENED IN 1986 , BANK DATA SHOWS National Bank economic data for 1986 shows that Yugoslavia ' s trade deficit grew , the inflation rate rose , wages were sharply higher , the money supply expanded and the value of the dinar fell .", 'The trade deficit for 1986 was 2 . 012 billion dlrs , 25 . 7 pct higher than in 1985 .', 'The trend continued in the first three months of this year as exports dropped by 17 . 8 pct , in hard currency terms , to 2 . 124 billion dlrs .', 'Yugoslavia this year started quoting trade figures in dinars based on current exchange rates , instead of dollars based on a fixed exchange rate of 264 . 53 dinars per dollar .', "Yugoslavia ' s balance of payments surplus with the convertible currency area fell to 245 mln dlrs in 1986 from 344 mln in 1985 ."]
['test/16118', 'test/18534', 'test/18540', 'test/18664', 'test/18665', 'test/18672', 'test/18911', 'test/19875', 'test/20106', 'test/20116', 'training/1035', 'training/1036', 'training/10602', 't

In [16]:
# WORDNET CORPUS DEMO
from nltk.corpus import wordnet as wn
word = 'hike' # taking hike as our word of interest
# get word synsets
word_synsets = wn.synsets(word)
print(word_synsets)
# get details for each synonym in synset
for synset in word_synsets:
    print('Synset Name:', synset.name())
    print('POS Tag:', synset.pos())
    print('Definition:', synset.definition())
    print('Examples:', synset.examples())
    print('\n')

[Synset('hike.n.01'), Synset('rise.n.09'), Synset('raise.n.01'), Synset('hike.v.01'), Synset('hike.v.02')]
Synset Name: hike.n.01
POS Tag: n
Definition: a long walk usually for exercise or pleasure
Examples: ['she enjoys a hike in her spare time']


Synset Name: rise.n.09
POS Tag: n
Definition: an increase in cost
Examples: ['they asked for a 10% rise in rates']


Synset Name: raise.n.01
POS Tag: n
Definition: the amount a salary is increased
Examples: ['he got a 3% raise', 'he got a wage hike']


Synset Name: hike.v.01
POS Tag: v
Definition: increase
Examples: ['The landlord hiked up the rents']


Synset Name: hike.v.02
POS Tag: v
Definition: walk a long way, as for pleasure or physical exercise
Examples: ['We were hiking in Colorado', 'hike the Rockies']


