NLTK (www.nltk.org/) is the most common package with corpora, categorizing text, analyzing linguistic structure, and more.

In [None]:
import nltk
nltk.download('punkt')
# Tokenization
sent_ = "I am almost dead this time"
tokens_ = nltk.word_tokenize(sent_)
tokens_

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['I', 'am', 'almost', 'dead', 'this', 'time']

In [None]:
 #Make sure to install wordnet, if not done already so
 # import nltk
nltk.download('wordnet')
 # Synonyms
 from nltk.corpus import wordnet
 word_ = wordnet.synsets("spectacular")
 print(word_)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[Synset('spectacular.n.01'), Synset('dramatic.s.02'), Synset('spectacular.s.02'), Synset('outstanding.s.02')]


In [6]:
print(word_[0].definition())    #  Printing the meaning along of each of the synonyms
print(word_[1].definition())
print(word_[2].definition())
print(word_[3].definition())

a lavishly produced performance
sensational in appearance or thrilling in effect
characteristic of spectacles or drama
having a quality that thrusts itself into attention


**Lemmatization**

Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In [7]:
#Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Create the Lemmatizer object
print(lemmatizer.lemmatize("decreases"))

decrease


**POS Tagging**

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context.

In [8]:
# Defining a sample text
from textblob import TextBlob

text = '''How about you and I go together on a walk far away from this place, discussing the things we have never discussed on Deep Learning and Natural Language Processing.'''
blob_ = TextBlob(text)
# Making it as Textblob object
blob_

TextBlob("How about you and I go together on a walk far away from this place, discussing the things we have never discussed on Deep Learning and Natural Language Processing.")

In [10]:
#   This part internally makes use of the 'punkt' resource from the NLTK package, make sure to download it before running this
# import nltk
# nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#  Running this separately : python3.6 -m textblob.download_corpora
blob_.tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('How', 'WRB'),
 ('about', 'IN'),
 ('you', 'PRP'),
 ('and', 'CC'),
 ('I', 'PRP'),
 ('go', 'VBP'),
 ('together', 'RB'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('walk', 'NN'),
 ('far', 'RB'),
 ('away', 'RB'),
 ('from', 'IN'),
 ('this', 'DT'),
 ('place', 'NN'),
 ('discussing', 'VBG'),
 ('the', 'DT'),
 ('things', 'NNS'),
 ('we', 'PRP'),
 ('have', 'VBP'),
 ('never', 'RB'),
 ('discussed', 'VBN'),
 ('on', 'IN'),
 ('Deep', 'NNP'),
 ('Learning', 'NNP'),
 ('and', 'CC'),
 ('Natural', 'NNP'),
 ('Language', 'NNP'),
 ('Processing', 'NNP')]

**Named Entity Recognition**

Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. 

In [11]:
import spacy
# Run below command, if you are getting error
#!pip install spacy
nlp = spacy.load("en")
william_wikidef = """William was the son of King William II and Anna Pavlovna of Russia. On the abdication of his grandfatherWilliam I in 1840, he became the Prince of Orange. On the death of his father in 1849, he succeeded as king of the Netherlands. William married his cousin Sophie of Württemberg in 1839 and they had three sons, William, Maurice, and Alexander, all of whom predeceased him. """
nlp_william = nlp(william_wikidef)
print([ (i, i.label_, i.label) 
for i in nlp_william.ents])

[(William, 'PERSON', 380), (King William II, 'PERSON', 380), (Anna Pavlovna, 'PERSON', 380), (Russia, 'GPE', 384), (1840, 'DATE', 391), (the Prince of Orange, 'LOC', 385), (1849, 'DATE', 391), (Netherlands, 'GPE', 384), (William, 'PERSON', 380), (Sophie, 'PERSON', 380), (Württemberg, 'GPE', 384), (1839, 'DATE', 391), (three, 'CARDINAL', 397), (William, 'PERSON', 380), (Maurice, 'PERSON', 380), (Alexander, 'PERSON', 380)]


**Co-reference resolution**

In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun Bill and the pronoun he refer to the same person, namely to Bill

In [12]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the displaCy library
from spacy import displacy

In [13]:
# Create a simple Doc object
doc = nlp(u"Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolution and lived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary.")

In [14]:
# Render the dependency parse immediately inside Jupyter:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 80})

**Parsing**

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars, meaning part.

In [16]:
#From PoS tagging example :
from nltk import word_tokenize

sentence = "A very beautiful young lady is walking on the beach"

#Tokenizing words :
tokenized_words = word_tokenize(sentence)

for words in tokenized_words:
    tagged_words = nltk.pos_tag(tokenized_words)
    
tagged_words

[('A', 'DT'),
 ('very', 'RB'),
 ('beautiful', 'JJ'),
 ('young', 'JJ'),
 ('lady', 'NN'),
 ('is', 'VBZ'),
 ('walking', 'VBG'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('beach', 'NN')]

In [17]:
#Extracting Noun Phrase from text :

# ? - optional character
# * - 0 or more repetations
grammar = "NP : {<DT>?<JJ>*<NN>} "
import matplotlib.pyplot as plt
#Creating a parser :
parser = nltk.RegexpParser(grammar)

#Parsing text :
output = parser.parse(tagged_words)
print (output)

#To visualize :
#output.draw()

(S
  A/DT
  very/RB
  (NP beautiful/JJ young/JJ lady/NN)
  is/VBZ
  walking/VBG
  on/IN
  (NP the/DT beach/NN))


In [18]:
#Chinking example :
# * - 0 or more repetations
# + - 1 or more repetations

#Here we are taking the whole string and then
#excluding adjectives from that chunk.

grammar = r""" NP: {<.*>+} 
               }<JJ>+{"""

#Creating parser :
parser = nltk.RegexpParser(grammar)

#parsing string :
output = parser.parse(tagged_words)
print(output)

#To visualize :
#output.draw()

(S
  (NP A/DT very/RB)
  beautiful/JJ
  young/JJ
  (NP lady/NN is/VBZ walking/VBG on/IN the/DT beach/NN))
