# Natural Language Toolkit

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

http://nltk.org/book


In [1]:
import nltk
# It is going to take some minutes!!!!

#step 1: Run jupiter as administrator
#step 2: uncoment the next line the first time you runthis.
#nltk.download('all') 


In [2]:

text = 'The Athens University of Economics and Business (AUEB) was originally founded in \
1920 under the name of Athens School of Commercial Studies. It was renamed in 1926 as the \
Athens School of Economics and Business, \
a name that was retained until 1989 when it assumed its present name, \
the Athens University of Economics and Business.It is the third oldest \
university in Greece and the oldest in the fields of economics and business. \
Up to 1955 the school offered only one degree in the general area of economics and \
commerce. In 1955 it started two separate programs leading to two separate degrees: \
one in economics and the other in business administration. In 1984 the school was \
divided into three departments, namely the Department of Economics, the Department of \
Business Administration and the Department of Statistics and Informatics.In 1989, the \
university expanded to six departments. From 1999 onwards, the university developed \
even further and nowadays it includes eight academic departments, offering eight \
undergraduate degrees, 28 master\'s degrees and an equivalent number of doctoral programs.'



In [3]:

'''
Sentence tokenization
'''
#nltk.download('punkt')
from nltk import sent_tokenize

sentences = sent_tokenize(text)
print(sentences)


['The Athens University of Economics and Business (AUEB) was originally founded in 1920 under the name of Athens School of Commercial Studies.', 'It was renamed in 1926 as the Athens School of Economics and Business, a name that was retained until 1989 when it assumed its present name, the Athens University of Economics and Business.It is the third oldest university in Greece and the oldest in the fields of economics and business.', 'Up to 1955 the school offered only one degree in the general area of economics and commerce.', 'In 1955 it started two separate programs leading to two separate degrees: one in economics and the other in business administration.', 'In 1984 the school was divided into three departments, namely the Department of Economics, the Department of Business Administration and the Department of Statistics and Informatics.In 1989, the university expanded to six departments.', "From 1999 onwards, the university developed even further and nowadays it includes eight acad

In [4]:

'''
word tokenization
'''

from nltk import word_tokenize

tokens = word_tokenize(text)
print(tokens)


['The', 'Athens', 'University', 'of', 'Economics', 'and', 'Business', '(', 'AUEB', ')', 'was', 'originally', 'founded', 'in', '1920', 'under', 'the', 'name', 'of', 'Athens', 'School', 'of', 'Commercial', 'Studies', '.', 'It', 'was', 'renamed', 'in', '1926', 'as', 'the', 'Athens', 'School', 'of', 'Economics', 'and', 'Business', ',', 'a', 'name', 'that', 'was', 'retained', 'until', '1989', 'when', 'it', 'assumed', 'its', 'present', 'name', ',', 'the', 'Athens', 'University', 'of', 'Economics', 'and', 'Business.It', 'is', 'the', 'third', 'oldest', 'university', 'in', 'Greece', 'and', 'the', 'oldest', 'in', 'the', 'fields', 'of', 'economics', 'and', 'business', '.', 'Up', 'to', '1955', 'the', 'school', 'offered', 'only', 'one', 'degree', 'in', 'the', 'general', 'area', 'of', 'economics', 'and', 'commerce', '.', 'In', '1955', 'it', 'started', 'two', 'separate', 'programs', 'leading', 'to', 'two', 'separate', 'degrees', ':', 'one', 'in', 'economics', 'and', 'the', 'other', 'in', 'business', 

In [6]:

'''
Counting words
'''

from collections import Counter
count = Counter(tokens)
print(count.most_common(10))

'''
h'
count = nltk.FreqDist(tokens)
'''


[('the', 15), ('of', 11), ('and', 11), (',', 8), ('in', 7), ('.', 6), ('Athens', 4), ('Economics', 4), ('was', 4), ('Business', 3)]


"\nh'\ncount = nltk.FreqDist(tokens)\n"

In [None]:

'''
Removing stopwords
'''
#nltk.download(u'stopwords')
from nltk.corpus import stopwords
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print(count.most_common(10))


In [None]:

'''
Creating ngrams
'''
from nltk.util import ngrams
bigrams = [ gram for gram in ngrams(tokens, 2) ]
trigrams = [ gram for gram in ngrams(tokens, 3) ]
print(trigrams)


# A Universal Part-of-Speech Tagset

Tag	Meaning	English Examples

**ADJ**	adjective	new, good, high, special, big, local

**ADP**	adposition	on, of, at, with, by, into, under

**ADV**	adverb	really, already, still, early, now

**CONJ**	conjunction	and, or, but, if, while, although

**DET**	determiner, article	the, a, some, most, every, no, which

**NOUN**	noun	year, home, costs, time, Africa

**NUM**	numeral	twenty-four, fourth, 1991, 14:24

**PRT**	particle	at, on, out, over per, that, up, with

**PRON**	pronoun	he, their, her, its, my, I, us

**VERB**	verb	is, say, told, given, playing, would

**.**	punctuation marks	. , ; !

**X**	other	ersatz, esprit, dunno, gr8, univeristy

**CC** coordinating conjunction
**CD** cardinal digit
**DT** determiner
**EX** existential there (like: “there is” … think of it like “there exists”)
**FW** foreign word
**IN** preposition/subordinating conjunction
**JJ** adjective ‘big’
**JJR** adjective, comparative ‘bigger’
**JJS** adjective, superlative ‘biggest’
**LS** list marker 1)
**MD** modal could, will
**NN** noun, singular ‘desk’
**NNS** noun plural ‘desks’
**NNP** proper noun, singular ‘Harrison’
**NNPS** proper noun, plural ‘Americans’
**PDT** predeterminer ‘all the kids’
**POS** possessive ending parent’s
**PRP** personal pronoun I, he, she
**PRP\$** possessive pronoun my, his, hers
**RB** adverb very, silently,
**RBR** adverb, comparative better
**RBS** adverb, superlative best
**RP** particle give up
**TO**, to go ‘to’ the store.
**UH** interjection, errrrrrrrm
**VB** verb, base form take
**VBD** verb, past tense took
**VBG** verb, gerund/present participle taking
**VBN** verb, past participle taken
**VBP** verb, sing. present, non-3d take
**VBZ** verb, 3rd person sing. present takes
**WDT** wh-determiner which
**WP** wh-pronoun who, what
**WP\$** possessive wh-pronoun whose
**WRB** wh-abverb where, when


In [None]:

'''
POS tagging
'''
#nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
pos_tags = pos_tag(tokens)
print(pos_tags)

# nltk.help.upenn_tagset()
# nltk.help.upenn_tagset('CC')
# nltk.batch_pos_tag([['this', 'is', 'batch', 'tag', 'test'], ['nltk', 'is', 'text', 'analysis', 'tool']])


## Stemming and lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is $\Rightarrow$ be 

car, cars, car's, cars' $\Rightarrow$ car

The result of this mapping of text will be something like:

the boy's cars are different colors $\Rightarrow$ the boy car be differ color

However, the two words differ in their flavor. *Stemming* usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. *Lemmatization* usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.

For instance:

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.

The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context, e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.



In [None]:

'''
Stemming
'''

from nltk.stem import StemmerI, RegexpStemmer, LancasterStemmer, ISRIStemmer, PorterStemmer, SnowballStemmer, RSLPStemmer

#stemmer = WordNetLemmatizer()
#stemmer = LancasterStemmer()
#stemmer = SnowballStemmer('english')
stemmer = PorterStemmer()
stems = [  stemmer.stem(token) for token in tokens ]
print(stems)



In [None]:

'''
Lemmatization
'''
#nltk.download('wordnet')

from nltk.stem import  WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('are'))
print(lemmatizer.lemmatize('is'))
print(lemmatizer.lemmatize("bats"))
print(lemmatizer.lemmatize("feet"))
print(lemmatizer.lemmatize('is', pos='n'))
print(lemmatizer.lemmatize('is', pos='v'))



In [None]:
#Sometimes, the same word can have a multiple lemmas based on the meaning / context.
print(lemmatizer.lemmatize("stripes", 'v'))  
print(lemmatizer.lemmatize("stripes", 'n'))  
 

In [None]:
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)

# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().

In [None]:
from nltk.corpus import wordnet
# Lemmatize with POS Tag
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

# 2. Lemmatize Single Word with the appropriate POS tag
word = 'feet'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

# 3. Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
