# Getting started with NLP (natural language processing) with Python

In this notebook I'm using an NLTK library to perform some basic NLP tasks.
**NLTK (natural language toolkit)** is a Python library that comes with built-in functinality to perform the task such as:

1. Tokenization
2. Stopword removal
3. N-Grams
4. Stemming
5. Part-of-speech
6. Word sense disambiguation

I'll provide a short explanation of what each task means and an example of its use along with NLTK modules needed to perform the task.

Let's get started!




## Tokenizing text

The process of breaking the text down into small pieces - words or sentences

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/c75d4454-e391-4890-b9d7-5d2bf0a84bec/nltk_data..
[nltk_data]     .
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/c75d4454-e391-4890-b9d7-5d2bf0a84bec/nltk_data..
[nltk_data]     .
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/c75d4454-e391-4890-b9d7-5d2bf0a84bec/nltk_data..
[nltk_data]     .
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/c75d4454-e391-4890-b9d7-5d2bf0a84bec/nltk_data..
[nltk_data]     .
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/c75d4454-e391-4890-b9d7-5d2bf0a84bec/nltk_data..
[nltk_data]     .
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
text="Mary had a little lamb. Her fleece was white as snow"

# Import functions to break text into words/sentences from nltk.tokenize module

from nltk.tokenize import word_tokenize, sent_tokenize

sents=sent_tokenize(text)
print(sents)


['Mary had a little lamb.', 'Her fleece was white as snow']


In [3]:
words = [word_tokenize(sent) for sent in sents]
print(words)

# Punctuation is treated as an individual token

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'fleece', 'was', 'white', 'as', 'snow']]


## Stopwords removal

The process of removing the words that don't hold meaning (such as "a", "was", "her")

In [4]:
from nltk.corpus import stopwords
from string import punctuation

# creating a custom set of stopwords (using a set instead of a list because the order doesn't matter)

customStopWords=set(stopwords.words('english')+list(punctuation))


In [5]:
wordsWoStopwords=[word for word in word_tokenize(text) if word not in customStopWords]
print(wordsWoStopwords)

['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']


## Identifying N-grams

N-grams are combinations of words that occur consecutively. Usually used to find bigrams (pairs of words, eg: "New York")

In [6]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

# constructs bigrams from the list of words, BigramCollocationFinder is a built-in class

finder = BigramCollocationFinder.from_words(wordsWoStopwords)

# displays bigrams and their frequencies, if there are more than 1 of a kind - sorts in descending order

sorted(finder.ngram_fd.items())

[(('Her', 'fleece'), 1),
 (('Mary', 'little'), 1),
 (('fleece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

## Stemming

The process of extracting words's root to treat same word with a different ending as the same word (eg: "close, closer, closing, closed")


In [7]:
text2="Mary closed on closing night when she was in the mood to close."

from nltk.stem.lancaster import LancasterStemmer

st=LancasterStemmer()
stemmedWords = [st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']


## Part-of-speech

The process of identyfying whether the word is a noun, adjective, verb etc.

In [8]:
# pos_tag (build-in function to tag a part of speech)

nltk.pos_tag(word_tokenize(text2))

[('Mary', 'NNP'),
 ('closed', 'VBD'),
 ('on', 'IN'),
 ('closing', 'NN'),
 ('night', 'NN'),
 ('when', 'WRB'),
 ('she', 'PRP'),
 ('was', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mood', 'NN'),
 ('to', 'TO'),
 ('close', 'VB'),
 ('.', '.')]

## Word sense disambiguation

The process of identifying word's meaning based on the context


In [9]:
# wordnet is a lexicon (a little like a thesaurus)

# synset represent a single definition of a word

from nltk.corpus import wordnet as wn
for ss in wn.synsets('bass'):
    print(ss, ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


In [10]:
# lesk is an algorithm for word sense disambiguation

from nltk.wsd import lesk
sense1 = lesk(word_tokenize("Sing in a lower tone, along with the bass"), "bass")
print(sense1, sense1.definition())

Synset('bass.n.07') the member with the lowest range of a family of musical instruments


In [11]:
sense2 = lesk(word_tokenize("This sea bass was really hard to catch"), "bass")
print(sense2, sense2.definition())

Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
