## **What is NLP (Natural Language Processing)?**

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, topic segmentation etc.

Nowadays, most of us have smartphones that have speech recognition. These smartphones use NLP to understand what is said. Also, many people use laptops whose operating system has a built-in speech recognition.

Example:
**Cortana**

<img src="https://miro.medium.com/max/700/1*TXj0kr4jVrtLtmvxZFu8Lw.png" height="200" width="500">

The Microsoft OS has a virtual assistant called Cortana that can recognize a natural voice. You can use it to set up reminders, open apps, send emails, play games, track flights and packages, check the weather and so on.


**Siri**

<img src="https://miro.medium.com/max/700/1*-AuKCZbXIVOhI-AgX4J8PQ.jpeg">

Siri is a virtual assistant of the Apple Inc.’s iOS, watchOS, macOS, HomePod, and tvOS operating systems. Again, you can do a lot of things with voice commands: start a call, text someone, send an email, set a timer, take a picture, open an app, set an alarm, use navigation and so on.

### **Applications of NLP:**
**Machine Translation**
    It is the process by which computer software is used to translate a text from one natural language (such as English) to another (such as Spanish).


**Speech Recognition:**
    Speech recognition is the process by which a computer (or other type of machine) identifies spoken words. Basically, it means talking to your computer, AND having it correctly recognize what you are saying.
    
**Sentiment Analysis:**
    Sentiment analysis is the process of detecting positive or negative sentiment in text. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.
    
**Question Answering:**
    Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.
    
**Text Summarization:**
    Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).
  
**Chatbot:**
    A chatbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent.
    
**Text Classifications:**
    Text clarification is the process of categorizing the text into a group of words. By using NLP, text classification can automatically analyze text and then assign a set of predefined tags or categories based on its context.
    
**Optical Character Recognition:**
    Optical Character Recognition (OCR) is an electronic conversion of the typed, handwritten or printed text images into machine-encoded text
    
**Spell Checking:**
    Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things, Different correction candidates for each word — word level.
  
**Spam Detection:**
    Spam Detection detect unsolicited, unwanted, and virus-infested email (called spam) and stop it from getting into email inboxes.

**Named Entity Recognition:**
    Named entity recognition (NER) — sometimes referred to as entity chunking, extraction, or identification — is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, an NER machine learning (ML) model might detect the word “Bdec” in a text and classify it as a “Company”.
    
    
    
### **Understanding Natural Language Processing (NLP)**

   ![title](https://miro.medium.com/max/581/0*YovzfkM8Ld1LO-87.png)


As humans, perform natural language processing (NLP) considerably well, but even then, we are not perfect. We often misunderstand one thing for another, and we often interpret the same sentences or words differently.

consider the following sentence,

`I saw a man on hill with a telescope.`

These are some interpretations of the sentence shown above.
   - There is a man on the hill, and I watched him with my telescope.
   - There is a man on the hill, and he has a telescope.
   - I’m on a hill, and I saw a man using my telescope.
   - I’m on a hill, and I saw a man who has a telescope.
   - There is a man on a hill, and I saw him something with my telescope.
   
From the examples above, we can see that language processing is not “deterministic” (the same language has the same interpretations), and something suitable to one person might not be suitable to another.

Therefore, Natural Language Processing (NLP) has a non-deterministic approach. In other words, Natural Language Processing can be used to create a new intelligent system that can understand how humans understand and interpret language in different situations.

### **Components of Natural Language Processing**

<img src="https://miro.medium.com/max/455/0*9aT_MdjuT9xXGUdU.png">

**Lexical Analysis:**
With lexical analysis, we divide a whole chunk of text into paragraphs, sentences, and words. It involves identifying and analyzing words’ structure.

**Syntactic Analysis:**
Syntactic analysis involves the analysis of words in a sentence for grammar and arranging words in a manner that shows the relationship among the words. For instance, the sentence “The shop goes to the house” does not pass.

**Semantic Analysis:**
Semantic analysis draws the exact meaning for the words, and it analyzes the text meaningfulness. Sentences such as “hot ice-cream” do not pass.

**Disclosure Integration:**
Disclosure integration takes into account the context of the text. It considers the meaning of the sentence before it ends. For example: “He works at Google.” In this sentence, “he” must be referenced in the sentence before it.

**Pragmatic Analysis:**
Pragmatic analysis deals with overall communication and interpretation of language. It deals with deriving meaningful use of language in various situations.


For instance, Banks are using natural language processing (NLP) to automate certain document processing, analysis and customer service activities. Three applications include:

- **Intelligent document search:** finding relevant information in large volumes of scanned documents.
- **Investment analysis:** automating routine analysis of earnings reports and news so that analysts can focus on alpha generation.
- **Customer service & insights:** deploying chatbots to answer customer queries and understand customer needs.

![title](https://miro.medium.com/max/2560/1*BqX1wu57y5ApVE-5G-EC4w.png)

### **NLTK**
(Natural Language Toolkit) is a suite that contains libraries and programs for statistical language processing. It is one of the most powerful NLP libraries, which contains packages to make machines understand human language and reply to it with an appropriate response.

- if nltk library is not install use pip method to install it.

**!pip install nltk**

after installation use nltk.download to install all the other packages of nltk.


In [1]:
import nltk

In [None]:
#nltk.download()

## **Text Preprocessing**

Since text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis, is known as text preprocessing.


![The text data preprocessing framework.](https://www.kdnuggets.com/wp-content/uploads/text-preprocessing-framework-2.png)
### **Basic Text Pre-processing of text data**
- Case Conversion
- Punctuation removal
- Stopwords removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

### **Case Conversion**

If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine. For example, words like Ball and ball are treated differently by machine. So, we need to make the text in the same case and the most preferred case is a lower case to avoid such problems.

In [2]:
text='Natural language processing (NLP), describes the interaction between human language and computers.'
text

'Natural language processing (NLP), describes the interaction between human language and computers.'

In [3]:
#conversion of text into lower case.
text.lower()

'natural language processing (nlp), describes the interaction between human language and computers.'

In [4]:
#conversion of text into upper letter.
text.upper()

'NATURAL LANGUAGE PROCESSING (NLP), DESCRIBES THE INTERACTION BETWEEN HUMAN LANGUAGE AND COMPUTERS.'

In [6]:
# Load the imdb review dataset
import pandas as pd
import re
imdb=pd.read_csv('imdb_sentiment.csv')

In [7]:
#converting each review into lower to avoid duplication of word in sentence.
imdb['review']=imdb['review'].apply(lambda x :x.lower())
imdb['review']

0      a very, very, very slow-moving, aimless movie ...
1      not sure who was more lost - the flat characte...
2      attempting artiness with black & white and cle...
3           very little music or anything to speak of.  
4      the best scene in the movie was when gerardo i...
                             ...                        
743    i just got bored watching jessice lange take h...
744    unfortunately, any virtue in this film's produ...
745                     in a word, it is embarrassing.  
746                                 exceptionally bad!  
747    all in all its an insult to one's intelligence...
Name: review, Length: 748, dtype: object

### **Punctuation Removal**
One of the other text processing techniques is removing punctuations. There are total 32 main punctuations that need to be taken care of. We can directly use the string module with a regular expression to replace any punctuation in text with an empty string.

<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*cBCVyPufn4l8lZXjy93s8Q.jpeg">

**Regex:**

A regular expression (regex) is a sequence of characters that defines a search pattern. Here's a guide to writing regular expressions:

- **Learn the Special Characters:** Familiarize yourself with special
characters used in regex, such as ".", "*", "+", "?", and others.

- **Select a Programming Language or Tool:** Choose a language or tool that supports regex, such as Python, Perl, or grep.

- **Construct the Pattern:** Combine special characters with literal characters to form your regex pattern.

- **Search for the Pattern:** Use the relevant function or method in your chosen language or tool to search for the pattern in a string.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/0*yEz4_OZ7HhuYNebv.png">

In [8]:
#removal of punctuation using regex.
text

'Natural language processing (NLP), describes the interaction between human language and computers.'

In [9]:
text=re.sub(r'[^\w\s]','',text) #remove everything except words and space
text

'Natural language processing NLP describes the interaction between human language and computers'

In [10]:
imdb['clean']=imdb['review'].apply(lambda x : re.sub(r'[^\w\s]',' ',x))

In [11]:
imdb['clean']

0      a very  very  very slow moving  aimless movie ...
1      not sure who was more lost   the flat characte...
2      attempting artiness with black   white and cle...
3           very little music or anything to speak of   
4      the best scene in the movie was when gerardo i...
                             ...                        
743    i just got bored watching jessice lange take h...
744    unfortunately  any virtue in this film s produ...
745                     in a word  it is embarrassing   
746                                 exceptionally bad   
747    all in all its an insult to one s intelligence...
Name: clean, Length: 748, dtype: object

In [12]:
#remove punctuation using string module
import string
imdb['clean1']=imdb['review'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '' , x))

In [13]:
imdb['clean1']

0      a very very very slowmoving aimless movie abou...
1      not sure who was more lost  the flat character...
2      attempting artiness with black  white and clev...
3            very little music or anything to speak of  
4      the best scene in the movie was when gerardo i...
                             ...                        
743    i just got bored watching jessice lange take h...
744    unfortunately any virtue in this films product...
745                       in a word it is embarrassing  
746                                  exceptionally bad  
747    all in all its an insult to ones intelligence ...
Name: clean1, Length: 748, dtype: object

### **Stopword Removal**

### **What are stop words?**

Stopwords are the words in any language which does not add much meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.

<img src="https://repository-images.githubusercontent.com/181882059/cfeb9180-6dde-11e9-85b6-e79357766310" height="400" width="600">


### **When to remove stop words?**

If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model, i.e keeping out unwanted words out of our corpus, but if we have the task of language translation then stopwords are useful, as they have to be translated along with other words.

There is no hard and fast rule on when to remove stop words. But I would suggest removing stop words if our task to be performed is one of Language Classification, Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or something that is related to text classification.

On the other hand, if our task is one of Machine Translation, Question-Answering problems, Text Summarization, Language Modeling, it’s better not to remove the stop words as they are a crucial part of these applications.



Let's remove stopwords

In [14]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
print('This are the stopwords',stop)

This are the stopwords ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [15]:
text

'Natural language processing NLP describes the interaction between human language and computers'

In [16]:
text=' '.join([x for x in text.split() if x not in stop]) #here we are spltting the text and then removing the stop word from list
#and theb join the list to string.
text

'Natural language processing NLP describes interaction human language computers'

In [17]:
#removing stop words from reviews.
imdb['clean']=imdb['clean'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
imdb['clean']

0      slow moving aimless movie distressed drifting ...
1      sure lost flat characters audience nearly half...
2      attempting artiness black white clever camera ...
3                            little music anything speak
4      best scene movie gerardo trying find song keep...
                             ...                        
743        got bored watching jessice lange take clothes
744    unfortunately virtue film production work lost...
745                                    word embarrassing
746                                    exceptionally bad
747             insult one intelligence huge waste money
Name: clean, Length: 748, dtype: object

- Now we can see that stop words like `"very", "a"` is removed from the review.

### **Spell checks**

Spelling mistakes are common and most of us are used to software indicating if a mistake was made or not. From autocorrect on our phones to red underlining in text editors, spell checking is an essential feature for many different products.

<img src="https://www.gingersoftware.com/statics/2.3.16/images/uploads/Group-541.png">

In [18]:
# Many a time some words are spelt wrongly by author either by mistake or due to typing error
# So corpus of our word increases due to wrong spellings, hence we correct them
# We will use textBlob module.
# If textBlob is not installed use pip method to install it.
#!pip install textblob
from textblob import TextBlob

In [19]:
text_='hostipal is far'# here spelling of hospital is worng.

- The correct() Function
- The most straightforward way to correct input text is to use the correct() method

In [20]:
text_=TextBlob(text_).correct() # from textblob we use correct to corrrect the spellings.
text_

TextBlob("hospital is far")

In [21]:
# same way we can use for reviews
imdb['clean'][:10].apply(lambda x: str(TextBlob(x).correct()))

0    slow moving aimless movie distressed drifting ...
1    sure lost flat characters audience nearly half...
2    attempting artless black white clever camera a...
3                          little music anything speak
4    best scene movie gerard trying find song keeps...
5    rest movie lacks art charm meaning emptiness w...
6                                     wasted two hours
7    saw movie today thought good effort good messa...
8                                      bit predictable
9          loved casting jimmy buffets science teacher
Name: clean, dtype: object

## **Tokenization**

Tokenization is splitting the large chunk of word, sentence, document into smaller unit (single word or combination of words). Smaller units are known as tokens.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*PZYP2nL6Zc_jpkaHLRxLQQ.png" height="300" width="500">


**Why is Tokenization required in NLP?**

Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.

**Tokenization using split()**

Let’s start with the split() method as it is the most basic one. It returns a list of strings after breaking the given string by the specified separator. By default, split() breaks a string at each space. We can change the separator to anything.

In [22]:
# Split is the most basic tokenizing technique.here we splited on the whitespaces. It split and return the list of all the words.
Text="""Because of problems with her eyesight, rey the African penguin had issues with swimming. That’s unusual for a penguin,
and presented a big challenge for our aviculture team to help Rey overcome her hesitancy.
Slowly and steadily, we trained her to be comfortable feeding in the water like the rest of the penguin colony.
The aviculturists also trained Rey to accept daily eye drops from them as part of her special health care.
Rey already had good relationships with some staff, and was comfortable with them handling her.
Senior Aviculturist Kim Fukuda says the team built on those bonds to get Rey used to receiving the eye drops.
"She knows the routine," Kim says. "I usually give her the eye drops in one area of the exhibit after all the penguins get
their vitamins. When that happens, she runs over there and waits for me." Rosa, our oldest sea otter, has very limited eyesight,
among other health issues. The sea otter team had already trained Rosa so they could examine her eyes,
and built on that trust to include administering the eye drops she needs."""
Text.split()

['Because',
 'of',
 'problems',
 'with',
 'her',
 'eyesight,',
 'rey',
 'the',
 'African',
 'penguin',
 'had',
 'issues',
 'with',
 'swimming.',
 'That’s',
 'unusual',
 'for',
 'a',
 'penguin,',
 'and',
 'presented',
 'a',
 'big',
 'challenge',
 'for',
 'our',
 'aviculture',
 'team',
 'to',
 'help',
 'Rey',
 'overcome',
 'her',
 'hesitancy.',
 'Slowly',
 'and',
 'steadily,',
 'we',
 'trained',
 'her',
 'to',
 'be',
 'comfortable',
 'feeding',
 'in',
 'the',
 'water',
 'like',
 'the',
 'rest',
 'of',
 'the',
 'penguin',
 'colony.',
 'The',
 'aviculturists',
 'also',
 'trained',
 'Rey',
 'to',
 'accept',
 'daily',
 'eye',
 'drops',
 'from',
 'them',
 'as',
 'part',
 'of',
 'her',
 'special',
 'health',
 'care.',
 'Rey',
 'already',
 'had',
 'good',
 'relationships',
 'with',
 'some',
 'staff,',
 'and',
 'was',
 'comfortable',
 'with',
 'them',
 'handling',
 'her.',
 'Senior',
 'Aviculturist',
 'Kim',
 'Fukuda',
 'says',
 'the',
 'team',
 'built',
 'on',
 'those',
 'bonds',
 'to',
 'get',

### **Tokenization using regex.**

The re.findall() function finds all the words that match the pattern passed on it and stores it in the list.
The “\w” represents “any word character” which usually means alphanumeric (letters, numbers) and underscore (_). ‘+’ means any number of times. So [\w’]+ signals that the code should find all the alphanumeric characters until any other character is encountered.

In [23]:
# we will use re library in Python to work with regular expression.
tokens = re.findall("[\w']+", Text)
tokens

['Because',
 'of',
 'problems',
 'with',
 'her',
 'eyesight',
 'rey',
 'the',
 'African',
 'penguin',
 'had',
 'issues',
 'with',
 'swimming',
 'That',
 's',
 'unusual',
 'for',
 'a',
 'penguin',
 'and',
 'presented',
 'a',
 'big',
 'challenge',
 'for',
 'our',
 'aviculture',
 'team',
 'to',
 'help',
 'Rey',
 'overcome',
 'her',
 'hesitancy',
 'Slowly',
 'and',
 'steadily',
 'we',
 'trained',
 'her',
 'to',
 'be',
 'comfortable',
 'feeding',
 'in',
 'the',
 'water',
 'like',
 'the',
 'rest',
 'of',
 'the',
 'penguin',
 'colony',
 'The',
 'aviculturists',
 'also',
 'trained',
 'Rey',
 'to',
 'accept',
 'daily',
 'eye',
 'drops',
 'from',
 'them',
 'as',
 'part',
 'of',
 'her',
 'special',
 'health',
 'care',
 'Rey',
 'already',
 'had',
 'good',
 'relationships',
 'with',
 'some',
 'staff',
 'and',
 'was',
 'comfortable',
 'with',
 'them',
 'handling',
 'her',
 'Senior',
 'Aviculturist',
 'Kim',
 'Fukuda',
 'says',
 'the',
 'team',
 'built',
 'on',
 'those',
 'bonds',
 'to',
 'get',
 'Re

- The re.findall() function finds all the words that match the pattern passed on it and stores it in the list.
-  The “\w” represents “any word character” which usually means alphanumeric (letters, numbers) and underscore (_). ‘+’ means any number of times.

### **Tokenization using NLTK**

NLTK contains a module called tokenize() which further classifies into two sub-categories:

- Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words.
- Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

In [25]:
#Tokenize module have further 2 module. word_tokenize, sentence_tokenize.
nltk.download('punkt')
from nltk.tokenize import word_tokenize
token=word_tokenize(Text)
token

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Because',
 'of',
 'problems',
 'with',
 'her',
 'eyesight',
 ',',
 'rey',
 'the',
 'African',
 'penguin',
 'had',
 'issues',
 'with',
 'swimming',
 '.',
 'That',
 '’',
 's',
 'unusual',
 'for',
 'a',
 'penguin',
 ',',
 'and',
 'presented',
 'a',
 'big',
 'challenge',
 'for',
 'our',
 'aviculture',
 'team',
 'to',
 'help',
 'Rey',
 'overcome',
 'her',
 'hesitancy',
 '.',
 'Slowly',
 'and',
 'steadily',
 ',',
 'we',
 'trained',
 'her',
 'to',
 'be',
 'comfortable',
 'feeding',
 'in',
 'the',
 'water',
 'like',
 'the',
 'rest',
 'of',
 'the',
 'penguin',
 'colony',
 '.',
 'The',
 'aviculturists',
 'also',
 'trained',
 'Rey',
 'to',
 'accept',
 'daily',
 'eye',
 'drops',
 'from',
 'them',
 'as',
 'part',
 'of',
 'her',
 'special',
 'health',
 'care',
 '.',
 'Rey',
 'already',
 'had',
 'good',
 'relationships',
 'with',
 'some',
 'staff',
 ',',
 'and',
 'was',
 'comfortable',
 'with',
 'them',
 'handling',
 'her',
 '.',
 'Senior',
 'Aviculturist',
 'Kim',
 'Fukuda',
 'says',
 'the',
 'tea

- NLTK consider punctuation as tokens. so we can remove the punctuation for further use.

### **Tokenization using the spaCy library**

spaCy is an open-source library for advanced Natural Language Processing (NLP). It supports over 49+ languages and provides state-of-the-art computation speed.
- Spacy is faster than its other contenders

Installation of Spacy

**pip install -U pip setuptools wheel**

**pip install -U spacy**

**python -m spacy download en_core_web_sm**

In [26]:
# if spacy is not installed, use pip method to install it
# import library.
from spacy.lang.en import English
# Load English tokenizer
nlp = English()

In [27]:
my_doc = nlp(Text)
# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
token_list

['Because',
 'of',
 'problems',
 'with',
 'her',
 'eyesight',
 ',',
 'rey',
 'the',
 'African',
 'penguin',
 'had',
 'issues',
 'with',
 'swimming',
 '.',
 'That',
 '’s',
 'unusual',
 'for',
 'a',
 'penguin',
 ',',
 '\n',
 'and',
 'presented',
 'a',
 'big',
 'challenge',
 'for',
 'our',
 'aviculture',
 'team',
 'to',
 'help',
 'Rey',
 'overcome',
 'her',
 'hesitancy',
 '.',
 '\n',
 'Slowly',
 'and',
 'steadily',
 ',',
 'we',
 'trained',
 'her',
 'to',
 'be',
 'comfortable',
 'feeding',
 'in',
 'the',
 'water',
 'like',
 'the',
 'rest',
 'of',
 'the',
 'penguin',
 'colony',
 '.',
 '\n',
 'The',
 'aviculturists',
 'also',
 'trained',
 'Rey',
 'to',
 'accept',
 'daily',
 'eye',
 'drops',
 'from',
 'them',
 'as',
 'part',
 'of',
 'her',
 'special',
 'health',
 'care',
 '.',
 '\n',
 'Rey',
 'already',
 'had',
 'good',
 'relationships',
 'with',
 'some',
 'staff',
 ',',
 'and',
 'was',
 'comfortable',
 'with',
 'them',
 'handling',
 'her',
 '.',
 '\n',
 'Senior',
 'Aviculturist',
 'Kim',
 'F

In [28]:
#tokenizing the reviews of imdb datasets.
review=' '.join(imdb['review'])
my_doc = nlp(review)
# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
token_list

['a',
 'very',
 ',',
 'very',
 ',',
 'very',
 'slow',
 '-',
 'moving',
 ',',
 'aimless',
 'movie',
 'about',
 'a',
 'distressed',
 ',',
 'drifting',
 'young',
 'man',
 '.',
 '  ',
 'not',
 'sure',
 'who',
 'was',
 'more',
 'lost',
 '-',
 'the',
 'flat',
 'characters',
 'or',
 'the',
 'audience',
 ',',
 'nearly',
 'half',
 'of',
 'whom',
 'walked',
 'out',
 '.',
 '  ',
 'attempting',
 'artiness',
 'with',
 'black',
 '&',
 'white',
 'and',
 'clever',
 'camera',
 'angles',
 ',',
 'the',
 'movie',
 'disappointed',
 '-',
 'became',
 'even',
 'more',
 'ridiculous',
 '-',
 'as',
 'the',
 'acting',
 'was',
 'poor',
 'and',
 'the',
 'plot',
 'and',
 'lines',
 'almost',
 'non',
 '-',
 'existent',
 '.',
 '  ',
 'very',
 'little',
 'music',
 'or',
 'anything',
 'to',
 'speak',
 'of',
 '.',
 '  ',
 'the',
 'best',
 'scene',
 'in',
 'the',
 'movie',
 'was',
 'when',
 'gerardo',
 'is',
 'trying',
 'to',
 'find',
 'a',
 'song',
 'that',
 'keeps',
 'running',
 'through',
 'his',
 'head',
 '.',
 '  ',
 

### **What is Stemming?**

Stemming is an elementary rule-based process for removing inflectional forms from a token and the outputs are the stem of the world.

For example, "Consult", "Consultant", "Consulting", "Consultantative" and "Consultants" "Consult", will all become "Consult", which is their stem, because their inflection form will be removed.


<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/0*-yUy-dAKeTbPRQuk.png" height="400" width="600">

Stemming is not a good normalization process because sometimes stemming can produce words that are not in the dictionary. For example, consider a sentence: “His teams are not winning”

After stemming the tokens that we will get are- “hi”, “team”, “are”, “not”,  “winn”

Notice that the keyword “winn” is not a regular word and “hi” changed the context of the entire sentence.

- **2 types of stemmers:**
  
    1. **Porter Stemmer:**
    It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped on to the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy in nature and are known to be the oldest stemmer.
        
    2. **Snowball stemmer:**
    When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from the nltk package. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is having greater computational speed.


In [29]:
#NLTK library used for stemming.
from nltk.stem.snowball import PorterStemmer,SnowballStemmer
#PorterStemmer
port=PorterStemmer()
words=[]
for word in Text.split(' '):
    words.append(port.stem(word))
Text_=' '.join(words)
Text_

'becaus of problem with her eyesight, rey the african penguin had issu with swimming. that’ unusu for a penguin,\nand present a big challeng for our avicultur team to help rey overcom her hesitancy.\nslowli and steadily, we train her to be comfort feed in the water like the rest of the penguin colony.\nth aviculturist also train rey to accept daili eye drop from them as part of her special health care.\nrey alreadi had good relationship with some staff, and wa comfort with them handl her.\nsenior aviculturist kim fukuda say the team built on those bond to get rey use to receiv the eye drops.\n"sh know the routine," kim says. "i usual give her the eye drop in one area of the exhibit after all the penguin get\ntheir vitamins. when that happens, she run over there and wait for me." rosa, our oldest sea otter, ha veri limit eyesight,\namong other health issues. the sea otter team had alreadi train rosa so they could examin her eyes,\nand built on that trust to includ administ the eye drop 

In [30]:
#Using SnowballStemmer
snow=SnowballStemmer('english')
words=[]
for word in Text.split(' '):
    words.append(snow.stem(word))
Text_=' '.join(words)
Text_

'becaus of problem with her eyesight, rey the african penguin had issu with swimming. that unusu for a penguin,\nand present a big challeng for our avicultur team to help rey overcom her hesitancy.\nslowli and steadily, we train her to be comfort feed in the water like the rest of the penguin colony.\nth aviculturist also train rey to accept daili eye drop from them as part of her special health care.\nrey alreadi had good relationship with some staff, and was comfort with them handl her.\nsenior aviculturist kim fukuda say the team built on those bond to get rey use to receiv the eye drops.\n"sh know the routine," kim says. "i usual give her the eye drop in one area of the exhibit after all the penguin get\ntheir vitamins. when that happens, she run over there and wait for me." rosa, our oldest sea otter, has veri limit eyesight,\namong other health issues. the sea otter team had alreadi train rosa so they could examin her eyes,\nand built on that trust to includ administ the eye drop

In [31]:
#stemming the review from imdb dataset.
imdb['clean']=imdb['clean'].apply(lambda x: " ".join([snow.stem(word) for word in x.split()]))

In [32]:
imdb['clean']

0        slow move aimless movi distress drift young man
1          sure lost flat charact audienc near half walk
2      attempt arti black white clever camera angl mo...
3                                littl music anyth speak
4      best scene movi gerardo tri find song keep run...
                             ...                        
743                got bore watch jessic lang take cloth
744    unfortun virtu film product work lost regrett ...
745                                       word embarrass
746                                           except bad
747                  insult one intellig huge wast money
Name: clean, Length: 748, dtype: object

### **What is Lemmatization?**
Lemmatization, on the other hand, is a systematic step-by-step process for removing inflection forms of a word. It makes use of vocabulary, word structure, part of speech tags, and grammar relations.

The output of lemmatization is the root word called a lemma. For example,

Am, Are, Is >> Be

Running, Ran, Run >> Run

Also, since it is a systematic process while performing lemmatization one can specify the part of the speech tag for the desired term and lemmatization will only be performed if the given word has the proper part of the speech tag. For example, if we try to lemmatize the word running as a verb, it will be converted to run. But if we try to lemmatize the same word running as a noun it won’t be converted.

![title](https://cdn.analyticsvidhya.com/wp-content/uploads/2021/02/Screenshot-from-2021-02-23-15-07-22.png)

In [34]:
nltk.download('wordnet')
from nltk import WordNetLemmatizer
lemma=WordNetLemmatizer()
words=[]
for word in Text.split(' '):
    words.append(lemma.lemmatize(word))
Text_=' '.join(words)
Text_

[nltk_data] Downloading package wordnet to /root/nltk_data...


'Because of problem with her eyesight, rey the African penguin had issue with swimming. That’s unusual for a penguin,\nand presented a big challenge for our aviculture team to help Rey overcome her hesitancy.\nSlowly and steadily, we trained her to be comfortable feeding in the water like the rest of the penguin colony.\nThe aviculturists also trained Rey to accept daily eye drop from them a part of her special health care.\nRey already had good relationship with some staff, and wa comfortable with them handling her.\nSenior Aviculturist Kim Fukuda say the team built on those bond to get Rey used to receiving the eye drops.\n"She know the routine," Kim says. "I usually give her the eye drop in one area of the exhibit after all the penguin get\ntheir vitamins. When that happens, she run over there and wait for me." Rosa, our oldest sea otter, ha very limited eyesight,\namong other health issues. The sea otter team had already trained Rosa so they could examine her eyes,\nand built on th

### **Parts of Speech Tagging**


<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*6Ps5SSxIwH28b_RLYQ2Sqw.jpeg" height="400" width="500">

For any language, syntax and structure usually go hand in hand, where a set of specific rules, conventions, and principles govern the way words are combined into phrases; phrases get combines into clauses; and clauses get combined into sentences.

Knowledge about the structure and syntax of language is helpful in many areas like text processing, annotation, and parsing for further operations such as text classification or summarization.

__Parts of speech (POS)__ are specific lexical categories to which words are assigned, based on their syntactic context and role. Usually, words can fall into one of the following major categories.

+ __N(oun)__: This usually denotes words that depict some object or entity, which may be living or nonliving. Some examples would be fox , dog , book , and so on. The POS tag symbol for nouns is N.

+ __V(erb)__: Verbs are words that are used to describe certain actions, states, or occurrences. There are a wide variety of further subcategories, such as auxiliary, reflexive, and transitive verbs (and many more). Some typical examples of verbs would be running , jumping , read , and write . The POS tag symbol for verbs is V.

+ __Adj(ective)__: Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ) beautiful . The POS tag symbol for adjectives is ADJ .

+ __Adv(erb)__: Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs. The phrase very beautiful flower has the adverb (ADV) very , which modifies the adjective (ADJ) beautiful , indicating the degree to which the flower is beautiful. The POS tag symbol for adverbs is ADV.

Besides these four major categories of parts of speech , there are other categories that occur frequently in the English language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. Furthermore, each POS tag like the noun (N) can be further subdivided into categories like __singular nouns (NN)__, __singular proper nouns (NNP)__, and __plural nouns (NNS)__.

The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging .

### **Guide to POS Tags**

The most common part of speech (POS) tag schemes are those developed for the Penn Treebank.

| POS Tag | Description | Example |
|---------|---------------------------------------|-----------------------------------------|
| CC | coordinating conjunction | and |
| CD | cardinal number | 1, third |
| DT | determiner | the |
| EX | existential there | there is |
| FW | foreign word | d’hoevre |
| IN | preposition/subordinating conjunction | in, of, like |
| JJ | adjective | big |
| JJR | adjective, comparative | bigger |
| JJS | adjective, superlative | biggest |
| LS | list marker | 1) |
| MD | modal | could, will |
| NN | noun, singular or mass | door |
| NNS | noun plural | doors |
| NNP | proper noun, singular | John |
| NNPS | proper noun, plural | Vikings |
| PDT | predeterminer | both the boys |
| POS | possessive ending | friend‘s |
| PRP | personal pronoun | I, he, it |
| PRP\$ | possessive pronoun | my, his |
| RB | adverb | however, usually, naturally, here, good |
| RBR | adverb, comparative | better |
| RBS | adverb, superlative | best |
| RP | particle | give up |
| TO | to | to go, to him |
| UH | interjection | uhhuhhuhh |
| VB | verb, base form | take |
| VBD | verb, past tense | took |
| VBG | verb, gerund/present participle | taking |
| VBN | verb, past participle | taken |
| VBP | verb, sing. present, non-3d | take |
| VBZ | verb, 3rd person sing. present | takes |
| WDT | wh-determiner | which |
| WP | wh-pronoun | who, what |
| WP\$ | possessive wh-pronoun | whose |
| WRB | wh-abverb | where, when |



### **POS Using NLTK**

In [35]:
sentence = 'Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin'
sentence

'Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin'

In [36]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [37]:
#Tokenize each word using word_tokenize using nltk module thne find the pos tag for each word.
nltk_pos_tagged = nltk.pos_tag(word_tokenize(sentence))
nltk_pos_tagged

[('Mr.', 'NNP'),
 ('Trump', 'NNP'),
 ('became', 'VBD'),
 ('president', 'NN'),
 ('after', 'IN'),
 ('winning', 'VBG'),
 ('the', 'DT'),
 ('political', 'JJ'),
 ('election', 'NN'),
 ('.', '.'),
 ('Though', 'IN'),
 ('he', 'PRP'),
 ('lost', 'VBD'),
 ('the', 'DT'),
 ('support', 'NN'),
 ('of', 'IN'),
 ('some', 'DT'),
 ('republican', 'JJ'),
 ('friends', 'NNS'),
 (',', ','),
 ('Trump', 'NNP'),
 ('is', 'VBZ'),
 ('friends', 'NNS'),
 ('with', 'IN'),
 ('President', 'NNP'),
 ('Putin', 'NNP')]

In [38]:
#creating dataframe for word and its tag.
POS_df=pd.DataFrame(nltk_pos_tagged,
             columns=['Word', 'POS tag'])
POS_df

Unnamed: 0,Word,POS tag
0,Mr.,NNP
1,Trump,NNP
2,became,VBD
3,president,NN
4,after,IN
5,winning,VBG
6,the,DT
7,political,JJ
8,election,NN
9,.,.


### **POS Using Spacy**

In [39]:
import spacy #loading spacy

nlp = spacy.load('en_core_web_sm') # english module.

sentence_nlp = nlp(sentence)
spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in sentence_nlp]
spacy_pos_tagged

[(Mr., 'NNP', 'PROPN'),
 (Trump, 'NNP', 'PROPN'),
 (became, 'VBD', 'VERB'),
 (president, 'NN', 'NOUN'),
 (after, 'IN', 'ADP'),
 (winning, 'VBG', 'VERB'),
 (the, 'DT', 'DET'),
 (political, 'JJ', 'ADJ'),
 (election, 'NN', 'NOUN'),
 (., '.', 'PUNCT'),
 (Though, 'IN', 'SCONJ'),
 (he, 'PRP', 'PRON'),
 (lost, 'VBD', 'VERB'),
 (the, 'DT', 'DET'),
 (support, 'NN', 'NOUN'),
 (of, 'IN', 'ADP'),
 (some, 'DT', 'DET'),
 (republican, 'JJ', 'ADJ'),
 (friends, 'NNS', 'NOUN'),
 (,, ',', 'PUNCT'),
 (Trump, 'NNP', 'PROPN'),
 (is, 'VBZ', 'AUX'),
 (friends, 'NNS', 'NOUN'),
 (with, 'IN', 'ADP'),
 (President, 'NNP', 'PROPN'),
 (Putin, 'NNP', 'PROPN')]

In [40]:
spacy_POS_DF=pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS Tag', 'Tag Type'])
spacy_POS_DF.head(10)

Unnamed: 0,Word,POS Tag,Tag Type
0,Mr.,NNP,PROPN
1,Trump,NNP,PROPN
2,became,VBD,VERB
3,president,NN,NOUN
4,after,IN,ADP
5,winning,VBG,VERB
6,the,DT,DET
7,political,JJ,ADJ
8,election,NN,NOUN
9,.,.,PUNCT


**End of Notebook**