# Intro to NLTK - part 2

In order to download the required resources, do the following in your Python console:<br>
    `>>> import nltk`<br>
    `>>> nltk.download('wordnet_ic')`<br>
    `>>> nltk.download('movie_reviews')`<br>
    `>>> nltk.download('sentiwordnet')`<br>

## Example of lemmatization:

In [28]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))

Word                Lemma               
He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun                 


In [6]:
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))

He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


## Example of POS tagging:

In [17]:
from nltk import word_tokenize
from nltk import pos_tag
text = word_tokenize("And now for something completely different")
pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

# WordNet

**WordNet reader:**
http://www.nltk.org/howto/wordnet.html
Provides an interface to access WordNet data, such as:
- synsets of a given lemma+PoS pair,
- lemmas of a given synset,
- hypernyms and hyponyms of a given synset,
- synonyms and antonyms of a given lemma in a synset
- least common subsumers of a pair of synsets
- different measures of synset similarity
- ...

In [29]:
from nltk.corpus import wordnet as wn

### synsets

In [30]:
wn.synsets('age', 'n')

[Synset('age.n.01'),
 Synset('historic_period.n.01'),
 Synset('age.n.03'),
 Synset('long_time.n.01'),
 Synset('old_age.n.01')]

In [31]:
age = wn.synset('age.n.1')
age

Synset('age.n.01')

### definitions, examples and lemmas

In [16]:
age.definition()

'how long something has existed'

In [12]:
age.examples()

['it was replaced because of its age']

In [9]:
ls = wn.synsets('age', 'n')
ll = ls[1].lemmas()
[lemma.name() for lemma in ll]

['historic_period', 'age']

### antonyms

In [21]:
good = wn.synset('good.a.01')
good.lemmas()[0].antonyms()

[Lemma('bad.a.01.bad')]

### hyponyms and hypernyms

In [22]:
age.hyponyms()

[Synset('bone_age.n.01'),
 Synset('chronological_age.n.01'),
 Synset('developmental_age.n.01'),
 Synset('fetal_age.n.01'),
 Synset('mental_age.n.01'),
 Synset('newness.n.01'),
 Synset('oldness.n.01'),
 Synset('oldness.n.02'),
 Synset('youngness.n.01')]

In [23]:
age.hypernyms()

[Synset('property.n.02')]

In [24]:
age.root_hypernyms()

[Synset('entity.n.01')]

In [14]:
hyper = lambda s: s.hypernyms()
list(age.closure(hyper))

[Synset('property.n.02'),
 Synset('attribute.n.02'),
 Synset('abstraction.n.06'),
 Synset('entity.n.01')]

In [15]:
age.tree(hyper)

[Synset('age.n.01'),
 [Synset('property.n.02'),
  [Synset('attribute.n.02'),
   [Synset('abstraction.n.06'), [Synset('entity.n.01')]]]]]

### similarities

In [25]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
dog.path_similarity(cat)

0.2

In [26]:
dog.lch_similarity(cat)

2.0281482472922856

In [27]:
dog.wup_similarity(cat)

0.8571428571428571

In [19]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
dog.lin_similarity(cat,brown_ic)

0.8768009843733973

In [20]:
dog.lowest_common_hypernyms(cat)

[Synset('carnivore.n.01')]

### Exercise 1:
Given the following (lemma, category) pairs:

(’the’,’DT’), (’man’,’NN’), (’swim’,’VB’), (’with’, ’PR’), (’a’, ’DT’), (’girl’,’NN’), (’and’, ’CC’), (’a’, ’DT’), (’boy’, ’NN’), (’whilst’, ’PR’), (’the’, ’DT’), (’woman’, ’NN’), (’walk’, ’VB’)

For each pair, when possible, print their most frequent WordNet synset, their corresponding least common subsumer (LCS) and their similarity value, using the following functions:
- Path Similarity
- Leacock-Chodorow Similarity
- Wu-Palmer Similarity
- Lin Similarity

Normalize similarity values when necessary. What similarity seems better?

# Sentiment Analysis

Polarity corpus:
- 1000 positive examples
- 1000 negative examples

In [21]:
from nltk.corpus import movie_reviews as mr

mr.fileids('pos')[:5]

['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt']

In [22]:
mr.words('pos/cv000_29590.txt')[:7]

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had']

### SentiWordnet

In [23]:
from nltk.corpus import sentiwordnet as swn

# getting the wordnet synset
synset = wn.synset('good.a.1')
# getting the sentiwordnet synset
sentiSynset = swn.senti_synset(synset.name())
# getting the scores: positivity, negativity and objectivity
sentiSynset.pos_score(), sentiSynset.neg_score(), sentiSynset.obj_score()

(0.75, 0.0, 0.25)

### Exercise 2: unsupervised polarity system
1. Get the first synset (most frequent) of one of the next alternatives:
    - nouns, verbs, adjectives and adverbs
    - nouns, adjectives and adverbs
    - only adjectives
2. Sum all the positive scores and negative ones to get the polarity
3. Apply the system to the movie reviews corpus and give the accuracy
4. Give some conclusions about the work

Notes: We can assign the proper sense, instead of the first one, using a Word Sense Disambiguation tagger. We will see them tomorrow.