# NLTK

## Common Taks in Natural Language Processing

#### Breaking down text into words and sentences - TOKENIZATION
#### Identifying the type of word [Noun, Verb etc.,] - PARTS-OF-SPEECH TAGGING
#### Identifying commonly occuring words / group of words - FREQUENCY, N-grams
#### Filtering common words such as 'The', 'A', 'AN' - STOPWORD REMOVAL
#### Understanding the context in which a word occurs - WORD SENSE DISAMBIGUATION
#### Reduce a word to its base form - STEMMING


#### A large body of text, collections of news articles - CORPUS
#### A dictionary / collection of dictionaries - LEXICAL RESOURCE



### UNDERSTANDING THE CONTEXT OF A TEXT
#### concordance() - Displays all occurances of a word along with context.
#### similar() - Returns a list of words that appear in similar contexts - Usually synonyms
#### common_contexts() - Returns contexts shared by 2 words
#### dispersion_plot() - Prints a plot of all the occurances of the word relative to the beginning of the text

### Processing Text

#### sent_tokenize(), word_tokenize() - TOKENIZE TEXT INTO LISTS OF SENTENCES (OR) LISTS OF WORDS
#### stopwords.words() - GET A LIST OF STOPWORDS - COMMONLY OCCURING WORDS
#### bigrams(), ngrams() - GENERATE BIGRAMS [PAIRS OF WORDS] OR N-GRAMS [GROUPS OF N-WORDS] FOR A SENTENCE OR A TEXT
#### collocations() - FIND THE MOST COMMONLY OCCURING BIGRAMS


In [1]:
# Import the nltk module
!pip install nltk



In [2]:
# Download the dataset
nltk.download()

NameError: name 'nltk' is not defined

In [None]:
from nltk.book import *

In [None]:
text1.collocations()

In [None]:
text1.concordance("monstrous")

In [None]:
text2.concordance("monstrous")

In [None]:
text2.common_contexts(["monstrous", "very"])

In [None]:
print(" ".join (text4[:100]))

In [None]:
# Lets see how the usage of certain words by Presidents has changed over the years. 

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

In [None]:
# Let's see what kind of emotions are expressed in Jane Austen's works vs Herman Melville's
text1.dispersion_plot(["happy", "sad"])

In [None]:
text2.dispersion_plot(["happy", "sad"])

In [None]:
# tokenization
# splitting a piece of text into sentences and words

from nltk.tokenize import word_tokenize, sent_tokenize
text ="Mary had a little lamb. Her fleece was white as snow."
sents = sent_tokenize(text)
print(text)
print(sents)

words = [word_tokenize(sent) for sent in sents]
print(words)

In [None]:
# filter out the stopwords and punctations
from nltk.corpus import stopwords
from string import punctuation

customStopWords = set(stopwords.words('english') + list(punctuation))
wordsWOStopWords = [word for word in word_tokenize(text) if word not in customStopWords]
wordsWOStopWords

In [None]:
# stemming - the process of removing morphological affixes from words, l
# leaving only the word stem.

from nltk.stem.lancaster import LancasterStemmer

text2="Mary closed on closing night when she was in the mood to close."
stemmer = LancasterStemmer()
words = [x for x in word_tokenize(text2)]
print(words)
stemmedWords = [stemmer.stem(x) for x in word_tokenize(text2)]
print(stemmedWords)

In [None]:
# NLTK has a functionality to automatically tag words as nouns, verbs, conjugation etc
from nltk import pos_tag

pt = pos_tag(word_tokenize(text2))

wordParts = [pos_tag(word_tokenize(x)) for x in word_tokenize(text2) if x not in customStopWords]
wordParts