In [1]:
import nltk

# 1. Tokenization

#### One common task in NLP (Natural Language Processing) is tokenization. "Tokens" are usually individual words (at least in languages like English) and "tokenization" is taking a text or set of text and breaking it up into its individual words. These tokens are then used as the input for other types of analysis or tasks, like parsing (automatically tagging the syntactic relationship between words).

<img src = 'NLP.png'>

## 1a. New libraries used in this section

In [2]:
from nltk import word_tokenize
from nltk import sent_tokenize

In [3]:
example_line = 'This is super awesome!'
print(word_tokenize(example_line))

['This', 'is', 'super', 'awesome', '!']


In [4]:
example_paragraph = 'This is super awesome! Even though I have no idea what I am doing! But still!'
print(sent_tokenize(example_paragraph))

['This is super awesome!', 'Even though I have no idea what I am doing!', 'But still!']


# 2. Stopwords

#### While stopwords removes noise from the sentences, note that it might totally change the meaning of the sentence. Take for example the word 'not'. This is found as a stopword in the default stopword corpora. Therefore, you might want to edit it out. If you want to find the directory, you might have to use nltk.download() to find the download directory associated with the NLTK package. From there, you want to find nltk_data -> corpora -> stopwords -> english and open using a text editor. This will then allow you to add or remove stop words. ('C:\Users\esnxwng\AppData\Roaming\nltk_data')

## 2a. New libraries used in this section

In [5]:
from nltk.corpus import stopwords

In [6]:
stop_words = set(stopwords.words('english'))

example_line = 'I am not a big fan of Twitter Analysis!'

words = word_tokenize(example_line)

for word in words:
    
    if word not in stop_words:
        
        print(word)

I
not
big
fan
Twitter
Analysis
!


# 3. Part of Speech Tagging

#### In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.

In [7]:
example_paragraph = 'Thank you very much. Mr. Speaker, Mr. President, distinguished members of Congress, honored guests and fellow citizens. May I congratulate all of you who are members of this historic 100th Congress of the United States of America. In this 200th anniversary year of our Constitution, you and I stand on the shoulders of giants–men whose words and deeds put wind in the sails of freedom.'

In [8]:
tagged = []
words = []

#In the first step, sent_tokenize is used to break a paragraph down into lines
for lines in sent_tokenize(example_paragraph):
    
    #In the second step, word_tokenize is used to break lines down into words
    for word in word_tokenize(lines):
        
        #The broken down words of the paragraph is then appended to the list called words
        words.append(word)

#part of speech tagging is then used to tag entities
tagged.append(nltk.pos_tag(words))
print(tagged)

[[('Thank', 'NNP'), ('you', 'PRP'), ('very', 'RB'), ('much', 'RB'), ('.', '.'), ('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Mr.', 'NNP'), ('President', 'NNP'), (',', ','), ('distinguished', 'VBD'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('honored', 'VBD'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('.', '.'), ('May', 'NNP'), ('I', 'PRP'), ('congratulate', 'VBP'), ('all', 'DT'), ('of', 'IN'), ('you', 'PRP'), ('who', 'WP'), ('are', 'VBP'), ('members', 'NNS'), ('of', 'IN'), ('this', 'DT'), ('historic', 'JJ'), ('100th', 'JJ'), ('Congress', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('of', 'IN'), ('America', 'NNP'), ('.', '.'), ('In', 'IN'), ('this', 'DT'), ('200th', 'CD'), ('anniversary', 'JJ'), ('year', 'NN'), ('of', 'IN'), ('our', 'PRP$'), ('Constitution', 'NNP'), (',', ','), ('you', 'PRP'), ('and', 'CC'), ('I', 'PRP'), ('stand', 'VBP'), ('on', 'IN'), ('the', 'DT'), ('shoulders', 'NNS'), ('o

# 4. Stemming & Lemmatization

#### For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

#### Similarities between stemming and lemmatization:- 

##### The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

#### Differences between stemming and lemmatization:- 

##### A. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

##### B. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

##### If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.

## 4a. New libraries used in this section

In [14]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

## 4b. Stemming

In [10]:
ps = PorterStemmer()

stem_words = []

for word in words:
    
    stem_words.append(ps.stem(word))

In [11]:
print(stem_words)

['thank', 'you', 'veri', 'much', '.', 'mr.', 'speaker', ',', 'mr.', 'presid', ',', 'distinguish', 'member', 'of', 'congress', ',', 'honor', 'guest', 'and', 'fellow', 'citizen', '.', 'may', 'I', 'congratul', 'all', 'of', 'you', 'who', 'are', 'member', 'of', 'thi', 'histor', '100th', 'congress', 'of', 'the', 'unit', 'state', 'of', 'america', '.', 'In', 'thi', '200th', 'anniversari', 'year', 'of', 'our', 'constitut', ',', 'you', 'and', 'I', 'stand', 'on', 'the', 'shoulder', 'of', 'giants–men', 'whose', 'word', 'and', 'deed', 'put', 'wind', 'in', 'the', 'sail', 'of', 'freedom', '.']


## 4c. Lemmatization

In [15]:
lm = WordNetLemmatizer()

lemma_words = []

for word in words:
    
    lemma_words.append(lm.lemmatize(word))
    
print(lemma_words)

['Thank', 'you', 'very', 'much', '.', 'Mr.', 'Speaker', ',', 'Mr.', 'President', ',', 'distinguished', 'member', 'of', 'Congress', ',', 'honored', 'guest', 'and', 'fellow', 'citizen', '.', 'May', 'I', 'congratulate', 'all', 'of', 'you', 'who', 'are', 'member', 'of', 'this', 'historic', '100th', 'Congress', 'of', 'the', 'United', 'States', 'of', 'America', '.', 'In', 'this', '200th', 'anniversary', 'year', 'of', 'our', 'Constitution', ',', 'you', 'and', 'I', 'stand', 'on', 'the', 'shoulder', 'of', 'giants–men', 'whose', 'word', 'and', 'deed', 'put', 'wind', 'in', 'the', 'sail', 'of', 'freedom', '.']


# 5. Frequency Distribution

#### Sometimes we want to know the occurrence of a word in an article or most common 15 words in an article, then we use the FreqDist() function from NLTK.

In [12]:
words_dist = (nltk.FreqDist(stem_words))

In [13]:
print(words_dist.most_common(5))

[('of', 8), ('.', 4), (',', 4), ('you', 3), ('and', 3)]


# 6. WordNet

#### Wordnet is a huge collection of synsets, meanings, definition, examples, synonyms, antonyms etc

## 6a. New libraries used in this section

In [16]:
from nltk.corpus import wordnet

In [17]:
words = wordnet.synsets('big')

In [20]:
print(words[0].definition())

above average in size or number or quantity or magnitude or extent


In [21]:
print(words[0].examples())

['a large city', 'set out for the big city', 'a large sum', 'a big (or large) barn', 'a large family', 'big businesses', 'a big expenditure', 'a large number of newspapers', 'a big group of scientists', 'large areas of the world']


#### A synonym is a word or phrase that means exactly or nearly the same as another lexeme in the same language. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy.

#### An antonyms is a word of opposite meaning.

In [23]:
synonyms = []
antonyms = []

for words in wordnet.synsets('big'):
    
    for word in words.lemmas():
        
        synonyms.append(word.name())
        
        if word.antonyms():
            
            antonyms.append(word.antonyms()[0].name())
            
print('Synonyms: {}'.format(set(synonyms)))
print()
print('Antonyms: {}'.format(set(antonyms)))

Synonyms: {'adult', 'big', 'magnanimous', 'crowing', 'bounteous', 'grownup', 'prominent', 'braggy', 'bragging', 'self-aggrandizing', 'vauntingly', 'boastfully', 'fully_grown', 'liberal', 'cock-a-hoop', 'great', 'vainglorious', 'heavy', 'grown', 'swelled', 'handsome', 'with_child', 'openhanded', 'bighearted', 'enceinte', 'self-aggrandising', 'bountiful', 'boastful', 'bad', 'expectant', 'large', 'braggart', 'freehanded', 'gravid', 'full-grown', 'giving'}

Antonyms: {'little', 'small'}


#### Therefore, the synsets that we were investigating, was with regards to the word 'BIG'. This would result in the antonyms to be 'SMALL'.