Ultimate Guide to Understand & Implement Natural Language Processing (with codes in Python)

Table of Contents

    1. Introduction to NLP
    2. Text Preprocessing
        2.1 Noise Removal
        2.2 Lexicon Normalization
            2.2.1 Lemmatization
            2.2.2 Stemming
        2.3 Object Standardization
    3. Text to Features (Feature Engineering on text data)
        3.1 Syntactical Parsing
            3.1.1 Dependency Grammar
            3.1.2 Part of Speech Tagging
        3.2 Entity Parsing
            3.2.1 Phrase Detection
            3.2.2 Named Entity Recognition
            3.2.3 Topic Modelling
            3.2.4 N-Grams
        3.3 Statistical features
            3.3.1 TF – IDF
            3.3.2 Frequency / Density Features
            3.3.3 Readability Features
        3.4 Word Embeddings
    4.Important tasks of NLP
        4.1 Text Classification
        4.2 Text Matching
            4.2.1 Levenshtein Distance
            4.2.2 Phonetic Matching
            4.2.3 Flexible String Matching
        4.3 Coreference Resolution
        4.4 Other Problems
    5.Important NLP libraries


            1. Introduction to NLP
    
    Applications of NLP

    automatic summarization
    machine translation
    named entity recognition
    relationship extraction
    sentiment analysis
    speech recognition
    topic segmentation

            2. Text Preprocessing

        2.1 Noise Removal
    Step 1:prepare a dictionary of noisy entities, 
    Step 2:and iterate the text object by tokens (or by words), 
    Step 3:eliminating those tokens which are present in the noise dictionary.

In [1]:
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a sample text")


'sample text'

In [2]:

# Sample code to remove a regex pattern 
import re 

def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)


'remove this  from analytics vidhya'

        2.2 Lexicon Normalization
        
       2.2.1 Stemming:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
       2.2.2 Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).


In [3]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stem = PorterStemmer()
lem = WordNetLemmatizer()

word ="multiplying"

print(stem.stem(word))
print(lem.lemmatize(word,"v"))


multipli
multiply


        2.3 Object Standardization

In [4]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        
        new_words.append(word) 
        new_text = " ".join(new_words) 
        print(new_words) # for understanding the process
    return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")


['Retweet']
['Retweet', 'this']
['Retweet', 'this', 'is']
['Retweet', 'this', 'is', 'a']
['Retweet', 'this', 'is', 'a', 'retweeted']
['Retweet', 'this', 'is', 'a', 'retweeted', 'tweet']
['Retweet', 'this', 'is', 'a', 'retweeted', 'tweet', 'by']
['Retweet', 'this', 'is', 'a', 'retweeted', 'tweet', 'by', 'Shivam']
['Retweet', 'this', 'is', 'a', 'retweeted', 'tweet', 'by', 'Shivam', 'Bansal']


'Retweet this is a retweeted tweet by Shivam Bansal'

        3.Text to Features (Feature Engineering on text data)
            3.1 Syntactic Parsing
                3.1.1 Dependency Trees
                3.1.2 Part of Speech Tagging (POS Tagging)
               

    3.1.2 Part of Speech Tagging (POS Tagging)

In [5]:
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print(tokens,"\n")
tokens2 = re.split(r'\s',text)
print(tokens2,"\n")
print (pos_tag(tokens))
print (pos_tag(tokens2))
# regular expression fails at some places example : John's , isn't

['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'on', 'Analytics', 'Vidhya'] 

['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'on', 'Analytics', 'Vidhya'] 

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]
[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]


In [6]:
import nltk
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/ashu6811/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [7]:

print(pos_tag(word_tokenize("John's big idea isn't all that bad.")))

print(pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal'))

[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'), ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]


In [8]:
#A.Word sense disambiguation
#book is first sentence is verb and in second sentence is noun. This isn't predicted by pos_tag
#Lesk Algorithm comes to our rescue
print(pos_tag(word_tokenize("Please book my flight for Delhi"), tagset='universal'))
print(pos_tag(word_tokenize("I am going to read this book in the flight"),  tagset='universal'))

[('Please', 'NOUN'), ('book', 'NOUN'), ('my', 'PRON'), ('flight', 'NOUN'), ('for', 'ADP'), ('Delhi', 'NOUN')]
[('I', 'PRON'), ('am', 'VERB'), ('going', 'VERB'), ('to', 'PRT'), ('read', 'VERB'), ('this', 'DET'), ('book', 'NOUN'), ('in', 'ADP'), ('the', 'DET'), ('flight', 'NOUN')]


        3.2 Entity Extraction (Entities as features)
            Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights.
            3.2.1 Named Entity Recognition (NER)
            3.2.2 Topic Modeling

In [10]:
#TOPIC MODELING
#there is another code uploaded on github independently explaining topic modeling
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim
from gensim import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)
print(dictionary)
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
print("\nmatrix:",doc_term_matrix)
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print("\nldamodel:",ldamodel.print_topics())

Dictionary(34 unique tokens: ['not', 'my', 'My', 'suggest', 'of']...)

matrix: [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1)], [(1, 1), (5, 1), (8, 1), (12, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)], [(15, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)]]

ldamodel: [(0, '0.063*"to" + 0.036*"Sugar" + 0.036*"not" + 0.036*"but" + 0.036*"bad" + 0.036*"is" + 0.036*"sugar," + 0.036*"likes" + 0.036*"have" + 0.036*"consume."'), (1, '0.029*"driving" + 0.029*"sister" + 0.029*"My" + 0.029*"my" + 0.029*"to" + 0.029*"cause" + 0.029*"blood" + 0.029*"may" + 0.029*"suggest" + 0.029*"increased"'), (2, '0.053*"driving" + 0.053*"My" + 0.053*"sister" + 0.053*"my" + 0.053*"to" + 0.053*"practice." + 0.053*"lot" + 0.053*"of" + 0.053*"father" + 0.053*"dance"')]


        3.2.3 N-Grams as Features

In [16]:
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
generate_ngrams(doc1, 2)

[['Sugar', 'is'],
 ['is', 'bad'],
 ['bad', 'to'],
 ['to', 'consume.'],
 ['consume.', 'My'],
 ['My', 'sister'],
 ['sister', 'likes'],
 ['likes', 'to'],
 ['to', 'have'],
 ['have', 'sugar,'],
 ['sugar,', 'but'],
 ['but', 'not'],
 ['not', 'my'],
 ['my', 'father.']]

    3.3 Statistical Features
        3.3.1 Term Frequency – Inverse Document Frequency (TF – IDF)
        3.3.2 Count / Density / Readability Features
        

    3.3.1 Term Frequency – Inverse Document Frequency (TF – IDF)

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print (X)

  (0, 7)	0.58448290102
  (0, 2)	0.58448290102
  (0, 4)	0.444514311537
  (0, 1)	0.345205016865
  (1, 1)	0.385371627466
  (1, 0)	0.652490884513
  (1, 3)	0.652490884513
  (2, 4)	0.444514311537
  (2, 1)	0.345205016865
  (2, 6)	0.58448290102
  (2, 5)	0.58448290102


    3.3.2 Count / Density / Readability Features
    Some of the features are: 
    Word Count, Sentence Count, Punctuation Counts 
    and Industry specific word counts.
    Other types of measures include readability measures such as syllable counts,
    smog index and flesch reading ease. 
    Refer to Textstat library to create such features.

    3.4 Word Embedding (text vectors)

In [30]:
#there is a separate word embedding code book available on github

from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)

print (model.similarity('data', 'science'))

print (model['learning']) 

-0.0332806159459
[ -3.74489790e-03   4.30184737e-05   4.85881791e-03   3.99604253e-03
  -3.05236247e-03  -4.72965511e-03   3.89103661e-04  -1.63321255e-03
  -2.33896566e-03  -4.74427873e-03   4.84594749e-03  -4.05107532e-03
   2.88611627e-04  -3.40197003e-04  -2.34870962e-03  -2.86212831e-04
   7.35686393e-04   1.17057447e-04  -2.19545418e-05  -1.20955636e-03
  -2.74638343e-03  -7.71940453e-04  -2.49645353e-04   3.23865097e-04
  -6.57792029e-04  -1.18192553e-03   1.46145860e-04   1.26875460e-03
  -2.67122436e-04  -4.41335654e-03  -3.23761581e-03   4.66149766e-03
   2.14769319e-03  -3.45350895e-03   8.87754024e-04  -3.29132844e-03
   1.50818343e-03  -4.91187035e-04   4.13787132e-03   2.25335360e-03
   1.38052274e-03  -4.81582945e-03   2.37386371e-03  -4.48366132e-04
  -4.77828924e-03   4.47241729e-03   1.20649545e-03   1.93272578e-03
  -4.95557813e-03   1.72483316e-03  -5.15283667e-04   6.43836276e-04
   1.86065445e-03  -2.02088244e-03  -2.03146413e-03  -8.80531152e-04
  -3.14491056e-03

    4. Important tasks of NLP
        4.1 Text Classification
        4.2 Text Matching / Similarity
        4.3 Coreference Resolution
        4.4 Other NLP problems / tasks
            4.4.1 Text Summarization
            4.4.2 Machine Translation
            4.4.3 Natural Language Generation and Understanding
            4.4.4 Optical Character Recognition
            4.4.5 Document to Information
            
    5. Important Libraries for NLP (python)
        5.1 Scikit-learn: Machine learning in Python
        5.2 Natural Language Toolkit (NLTK):The complete toolkit for all NLP techniques.
        5.3 Pattern:A web mining module for the with tools for NLP and machine learning.
        5.4 TextBlob : Easy to use nl p tools API, built on top of NLTK and Pattern.
        5.5 spaCy : Industrial strength N LP with Python and Cython.
        5.6 Gensim : Topic Modelling for Humans
        5.7 Stanford Core NLP : NLP services and packages by Stanford NLP Group.


In [5]:
# 4.1 Text Classification using naive bayes 
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus) 
print(model.classify("Their codes are amazing."))
 
print(model.classify("I don't like their computer."))

print(model.accuracy(test_corpus))


Class_A
Class_B
0.8333333333333334


In [6]:
#from sklearn.feature_extraction.text
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import svm 

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)


print (classification_report(test_labels, prediction))

             precision    recall  f1-score   support

    Class_A       0.50      0.67      0.57         3
    Class_B       0.50      0.33      0.40         3

avg / total       0.50      0.50      0.49         6



        4.2.1 Levenshtein Distance
        The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

In [3]:
def levenshtein(s1,s2): 
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]

print(levenshtein("analyze","analyse"))

1


        4.2.2 Phonetic Matching
    A Phonetic matching algorithm takes a keyword as input (person’s name, location name etc) and produces a character string that identifies a set of words that are (roughly) phonetically similar

In [7]:
from sklearn import fuzzy 
soundex = fuzzy.Soundex(4) 
print (soundex('ankit'))
print (soundex('aunkit'))


ImportError: cannot import name 'fuzzy'

        4.2.3 Flexible String Matching
        4.2.4 Cosine Similarity

In [6]:
import math
from collections import Counter
def get_cosine(vec1, vec2):
    common = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in common])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()]) 
    sum2 = sum([vec2[x]**2 for x in vec2.keys()]) 
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
   
    if not denominator:
        return 0.0 
    else:
        return float(numerator) / denominator

def text_to_vector(text): 
    words = text.split() 
    return Counter(words)

text1 = 'This is an article on analytics vidhya' 
text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1) 
vector2 = text_to_vector(text2) 
cosine = get_cosine(vector1, vector2)
print(vector1)
print(vector2)
print(cosine)


Counter({'analytics': 1, 'on': 1, 'article': 1, 'an': 1, 'is': 1, 'This': 1, 'vidhya': 1})
Counter({'analytics': 1, 'on': 1, 'language': 1, 'article': 1, 'natural': 1, 'is': 1, 'about': 1, 'processing': 1, 'vidhya': 1})
0.629940788348712
