## Accessing Text Corpora and Lexical Resources

In [26]:
#Load library and texts used in the book
import nltk
from nltk.book import *
from nltk.corpus import gutenberg
from nltk.corpus import inaugural

In [13]:
gutenberg.fileids() #corpus in the gutenberg 

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [19]:
emma = gutenberg.words('austen-emma.txt')
print(len(emma))


192427


In [22]:
# average word length
# average sentence length
# lexical diversity (number of times a vocab appears in the text)

for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sent = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower for w in gutenberg.words(fileid)))
    print(num_chars/num_words, num_words/num_sent, num_words/num_vocab, fileid)


4.609909212324673 24.822884416924666 1.203496153605604 austen-emma.txt
4.749793727271801 26.19989324793168 1.1981570757307622 austen-persuasion.txt
4.753785952421314 28.32086417283457 1.1936563609230484 austen-sense.txt
4.286881563819072 33.57319868451649 1.2438588826052441 bible-kjv.txt
4.567033756284415 19.073059360730593 1.25661853188929 blake-poems.txt
4.489300433741879 19.40726510653161 1.211895829698133 bryant-stories.txt
4.464641670621737 17.99146110056926 1.203083365055196 burgess-busterbrown.txt
4.233216065669891 20.029359953024077 1.2610447706015009 carroll-alice.txt
4.716173862839705 20.296296296296298 1.1959901850778658 chesterton-ball.txt
4.724783007796614 22.61245401996847 1.2120354331263115 chesterton-brown.txt
4.63099417739442 18.496258685195084 1.2081798662872902 chesterton-thursday.txt
4.4391184023772565 20.59266862170088 1.2642257882544978 edgeworth-parents.txt
4.76571875515204 25.928919375683467 1.2248186151353635 melville-moby_dick.txt
4.835734572682675 52.30956239

In [36]:
nltk.corpus.brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## WordNet

Wordnet is a semantically-oriented dictionary of English. It is a large and comprehensive resource for English language words.


In [56]:
# Accessing synonyms in wordnet
from nltk.corpus import wordnet as wn

wn.synsets("motorcar")

[Synset('car.n.01')]

Motorcar has just one possible meaning and identified as 'car.n.01' in wordnet. The entity *car.n.01*: synset (synonym set) and it is a collection of synonym words also called lemmas. 

In [67]:
#get all lemmas
print("all lemma names:", wn.synset("car.n.01").lemma_names())
# look up lemma 
print("lemma:", wn.lemma("car.n.01.auto"))
#get synset for a given lemma
print("synset:", wn.lemma("car.n.01.auto").synset())
#name of a lemma
print("lemma name:", wn.lemma("car.n.01.auto").name())
#

all lemma names: ['car', 'auto', 'automobile', 'machine', 'motorcar']
lemma: Lemma('car.n.01.auto')
synset: Synset('car.n.01')
lemma name: auto


In [80]:
print(wn.synsets("artefact"))
wn.synset('artifact.n.01').definition()

[Synset('artifact.n.01')]


'a man-made object taken as a whole'

Hypo/hypernyms are called lexical relations since they relate one synset to another.

## Semantic Similarity

Knowing words that are semantically related is very useful for indexing a collection of texts, so that a search for a general term like car will match documents containing specific terms like BMW, Mercedes, etc.

In [97]:
wn.synsets("lady")
lady = wn.synset('lady.n.01')
lady.path_similarity(lady) #semantic similarity with the word itself is 1
#the further away the words are from each other the lower the close to 0
#the semantic similarity will be


1.0

In [123]:
from nltk.corpus import wordnet as wn 
#find semantic similarity between car and road
car = wn.synsets("car")[0] #finding the first synset for "car"
vehicle = wn.synsets("vehicle")[0] #the first synset for "vehicle"
semantic_similarity = car.path_similarity(vehicle)
print("Similarity between car and vehicle is", round(semantic_similarity, 2))

Similarity between car and vehicle is 0.2
