In [1]:
t = '''Cyprus, officially the Republic of Cyprus, is an island country in the Eastern Mediterranean and the 
third largest and third most populous island in the Mediterranean. Cyprus is located south of Turkey, west of 
Syria and Lebanon, northwest of Israel, north of Egypt, and southeast of Greece. Cyprus is a major tourist 
destination in the Mediterranean. With an advanced, high-income economy and a very high Human Development Index,
the Republic of Cyprus has been a member of the Commonwealth since 1961 and was a founding member of the
Non-Aligned Movement until it joined the European Union on 1 May 2004. On 1 January 2008, the Republic of 
Cyprus joined the eurozone.'''

# Tokenization

In [2]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [3]:
sents=sent_tokenize(t.lower())
sents

['cyprus, officially the republic of cyprus, is an island country in the eastern mediterranean and the \nthird largest and third most populous island in the mediterranean.',
 'cyprus is located south of turkey, west of \nsyria and lebanon, northwest of israel, north of egypt, and southeast of greece.',
 'cyprus is a major tourist \ndestination in the mediterranean.',
 'with an advanced, high-income economy and a very high human development index,\nthe republic of cyprus has been a member of the commonwealth since 1961 and was a founding member of the\nnon-aligned movement until it joined the european union on 1 may 2004. on 1 january 2008, the republic of \ncyprus joined the eurozone.']

In [5]:
words=word_tokenize(sents[2])
words

['cyprus',
 'is',
 'a',
 'major',
 'tourist',
 'destination',
 'in',
 'the',
 'mediterranean',
 '.']

# Pos tagging

In [51]:
from nltk import pos_tag
tags=pos_tag(words)
tags

[('cyprus', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('major', 'JJ'),
 ('tourist', 'NN'),
 ('destination', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mediterranean', 'NN'),
 ('.', '.')]

To access documentation for tags, we can do

In [8]:
import nltk.help
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


In [9]:
nltk.help.upenn_tagset('VB')

VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...


# Wordnet

**Wordnet is a lexical database for the english language in the form of semantic graph.**

WordNet groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. NLTK provides interface for wordnet API

In [11]:
from nltk.corpus import wordnet as wn

In [12]:
wn.synsets('human')

[Synset('homo.n.02'),
 Synset('human.a.01'),
 Synset('human.a.02'),
 Synset('human.a.03')]

In [14]:
wn.synsets('human')[0].definition()

'any living or extinct member of the family Hominidae characterized by superior intelligence, articulate speech, and erect carriage'

In [15]:
wn.synsets('human')[1].definition()

'characteristic of humanity'

In [16]:
wn.synsets('human')[2].definition()

'relating to a person'

In [17]:
wn.synsets('human')[3].definition()

'having human form or attributes as opposed to those of animals or divine beings'

In [22]:
human=wn.synsets('human')[0]
human

Synset('homo.n.02')

In [23]:
human.hypernyms() # A hypernym is a word with a broad meaning constituting a category into which words with 
                  #more specific meanings fall; a superordinate. For example, colour is a hypernym of red.

[Synset('hominid.n.01')]

In [24]:
human.hyponyms()

[Synset('homo_erectus.n.01'),
 Synset('homo_habilis.n.01'),
 Synset('homo_sapiens.n.01'),
 Synset('homo_soloensis.n.01'),
 Synset('neandertal_man.n.01'),
 Synset('rhodesian_man.n.01'),
 Synset('world.n.08')]

In [28]:
#we can get all the lemmas for given synset
wn.synset('homo.n.02').lemmas()

[Lemma('homo.n.02.homo'),
 Lemma('homo.n.02.man'),
 Lemma('homo.n.02.human_being'),
 Lemma('homo.n.02.human')]

In [30]:
#we can look up a particular lemma
wn.lemma('homo.n.02.homo')

Lemma('homo.n.02.homo')

In [31]:
#get synset corresponding to lemma
wn.lemma('homo.n.02.homo').synset()

Synset('homo.n.02')

In [32]:
#get the name of a lemma
wn.lemma('homo.n.02.homo').name()

'homo'

In [33]:
wn.synsets('bicycle')

[Synset('bicycle.n.01'), Synset('bicycle.v.01')]

In [35]:
wn.synset('bicycle.n.01').definition()

'a wheeled vehicle that has two wheels and is moved by foot pedals'

The Wu-Palmer metric (WUP) is a measure of similarity based on distance in the graph. There are many other metrics too.

In [39]:
girl=wn.synsets('girl')[1]
girl

Synset('female_child.n.01')

In [40]:
human=wn.synsets('human')[1]
human

Synset('human.a.01')

In [46]:
girl.wup_similarity(human)

In [47]:
#synonyms
synonyms=[]
for syn in wn.synsets('girl'):
    print('syn: ',syn)
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())

synonyms

syn:  Synset('girl.n.01')
syn:  Synset('female_child.n.01')
syn:  Synset('daughter.n.01')
syn:  Synset('girlfriend.n.02')
syn:  Synset('girl.n.05')


['girl',
 'miss',
 'missy',
 'young_lady',
 'young_woman',
 'fille',
 'female_child',
 'girl',
 'little_girl',
 'daughter',
 'girl',
 'girlfriend',
 'girl',
 'lady_friend',
 'girl']

In [49]:
#antonyms
antonyms=[]
for syn in wn.synsets('girl'):
    print('syn: ',syn)
    for lemma in syn.lemmas():
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())

antonyms

syn:  Synset('girl.n.01')
syn:  Synset('female_child.n.01')
syn:  Synset('daughter.n.01')
syn:  Synset('girlfriend.n.02')
syn:  Synset('girl.n.05')


['male_child', 'boy', 'son', 'boy']

# Chunking and entity recognition

The goal of chunking is to divide a sentence into chunks. Usually each chunk contains a **head** and optionally additionally words and modifiers. Examples of chunks include noun groups and verb groups.

## chunking

In order to create a chunker, we need to first define a chunk grammar, consisting of rules that indicate how sentences should be chunked.


We can define a simple grammar for a noun phrase (NP) chunker with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

Note how grammatical structures which are not noun phrases are not chunked, which is totally fine:


In [50]:
from nltk.chunk import RegexpParser

In [53]:
grammar="NP:{<DT>?<JJ>*<NN>}"
chunker=RegexpParser(grammar)
result=chunker.parse(tags)
result

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

Tree('S', [Tree('NP', [('cyprus', 'NN')]), ('is', 'VBZ'), Tree('NP', [('a', 'DT'), ('major', 'JJ'), ('tourist', 'NN')]), Tree('NP', [('destination', 'NN')]), ('in', 'IN'), Tree('NP', [('the', 'DT'), ('mediterranean', 'NN')]), ('.', '.')])

## Entity recognition

The goal of entity recogintion is to detect entities such as Person, Location, Time, etc.


In [54]:
from nltk.chunk import ne_chunk
ne_chunk(tags) #ne=named entity

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

Tree('S', [('cyprus', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('major', 'JJ'), ('tourist', 'NN'), ('destination', 'NN'), ('in', 'IN'), ('the', 'DT'), ('mediterranean', 'NN'), ('.', '.')])