<a href="https://colab.research.google.com/github/franckbizimana/Wamungu/blob/main/NLP_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using Natural Language Processing

## Table of contents:

### 1. Data Acquisition
### 2. Data pre-processing and data cleaning
### 3. Sentiment Analysis
### 4. Text Classification

In this notebook I am demonstrating how to pre-process and clean text data for Natural Lanuage Processing applications and how to perform Sentiment Analysis and Text Classification on the data.

Pre-processing and data cleaning are crucial steps in any NLP (Natural Language Processing) project. These steps help to ensure that the data used in the project is of high quality, and ready to be used for further analysis and modeling. The following are the common pre-processing and data cleaning steps in an NLP project:

1. **Data collection:** The first step is to collect the data that will be used in the project. This can be done from various sources such as text files, web pages, databases, etc.<br><br>

2. **Data cleaning:** Remove any irrelevant or redundant information from the data such as special characters, punctuation marks, numbers, etc. This step also involves correcting any spelling mistakes or typos in the data ('recieve' > 'receive', 'brocoli' > 'broccoli').<br><br>

3. **Text normalization:** Convert all the text data into a uniform format. This involves converting all the text to lowercase or uppercase, converting slang words or ackronyms ( e.g. lol, gn), expanding contractions (can't to can not), removing stop words (a, the, and, but), stemming or lemmatizing the words to reduce words to their base or root form.<br><br>

  - **Stemming:** Stemming is the process of reducing words to their base form by removing suffixes. e.g. the root word in 'writing' and 'written' is 'write'. You get rid of 'ing', 'ed' ,'en'. Stemming algorithms are fast and efficient, but they can sometimes produce non-words or words with a different meaning than the original word.<br><br>

  - **Lemmatization:** Lemmatization, on the other hand, is the process of reducing words to their base form using a morphological analysis of words. For e.g. the word 'better'. If we use a stemming algorithm to reduce this word to its base form, it will likely produce 'bett' as the stem. However, this stem is not a meaningful word and does not accurately represent the original word. Lemmatization on the other hand, will reduce the word to its base form or lemma of the word "better" would be "good".

    The choice between stemming and lemmatization will depend on the specific requirements of   the NLP project and the trade-off between speed and accuracy.<br><br>

4. **Tokenization:** Tokenization is the process of splitting the text data into smaller chunks or tokens, such as words, phrases, or sentences. This step helps in preparing the text data for further analysis and modeling.<br><br>

5. **Text vectorization:** Text vectorization is the process of transforming data into a numerical format that can be used for analysis and modeling. This involves converting the text data into numerical vectors using techniques such as bag of words, TF-IDF, or word embeddings.

In [1]:
!pip install nltk



In [7]:
!pip install PyPDF2
!pip install nltk

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m184.3/232.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


## Different ways of importing text data

1. Web scrapping
2. Pdf/Word Files

The `requests.get(url)` function is used to fetch the HTML content of the **url** specified. The HTML content is then passed to the **BeautifulSoup constructor**, which returns a **BeautifulSoup object** that can be used to parse the HTML.

`soup.find_all("p")[:3]` expression is used to find all the **<p>** elements in the HTML and select only the first three. The resulting list of elements is stored in the **paragraphs** variable which is a list of the 3 paragraphs

In [8]:
import requests
from bs4 import BeautifulSoup

url = "https://www.ibm.com/topics/natural-language-processing#:~:text=the%20next%20step-,What%20is%20natural%20language%20processing%3F,same%20way%20human%20beings%20can."

# this helps you to go to the website to fetch the content
response = requests.get(url)

# BeautifulSoup constructor takes the text as input and returns a BeautifulSoup object.
# BeautifulSoup object makes it easier to parse and extract information from the HTML content.
soup = BeautifulSoup(response.text, "html.parser")

# Find the first paragraph element in the HTML
paragraphs = soup.find_all("p")[:3]

# Extract the text from the selected paragraphs and concatenate them
combined_paragraph = ' '.join([p.text for p in paragraphs])

print(combined_paragraph)


Editorial Lead, AI Models Writer Natural language processing (NLP) is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language.


In [9]:
#We will work with a pdf document this time.
# Import the library reqiured to read from pdf documents.
import PyPDF2

Here I am using a sample pdf file that has a small explanation about NLP.

In [10]:
# Open the PDF file
with open("NLP.pdf", "rb") as file:
    # Create a PDF object
    pdf = PyPDF2.PdfReader(file)

    # Initialize a variable to store the extracted text
    corpus = ""

    # Extract the text from each page of the PDF. (We have only one page)
    for page in pdf.pages:
        corpus += page.extract_text()

    # Print the extracted text
    print(corpus)


Natural language processing (NLP) is a field of study focused on making it # possible for  
computers to read, understand, and generate human language. NLP is an interdisciplinary  
field that combines computer science, AI, and linguistics.   
  
The process of NLP includes several steps, such as tokenization, stop word removal,  
stemming and lemmat ization, and more.  {}  
  
In tokenization, we break down the text into individual words, phrases, symbols, or other  
elements.   
  
Stop word removal involves removing commonly used words such as and, the, a, etc. that  
do not contribute much to the meaning of the text.   
  
Stemming and lemmatization, on the other hand, are techniques to reduce words to their  
root form. After the text has been cleaned and pre -processed, it can be used for various 
NLP tasks such as sentiment analysis, text classification, la nguage translation, and more !  
  
The effectiveness of these tasks greatly depends on the quality of the pre -processing 

### Convert all text to lower case

In [18]:
# Convert the text to lowercase
corpus = corpus.lower()
print(corpus)

natural language processing (nlp) is a field of study focused on making it # possible for  
computers to read, understand, and generate human language. nlp is an interdisciplinary  
field that combines computer science, ai, and linguistics.   
  
the process of nlp includes several steps, such as tokenization, stop word removal,  
stemming and lemmat ization, and more.  {}  
  
in tokenization, we break down the text into individual words, phrases, symbols, or other  
elements.   
  
stop word removal involves removing commonly used words such as and, the, a, etc. that  
do not contribute much to the meaning of the text.   
  
stemming and lemmatization, on the other hand, are techniques to reduce words to their  
root form. after the text has been cleaned and pre -processed, it can be used for various 
nlp tasks such as sentiment analysis, text classification, la nguage translation, and more !  
  
the effectiveness of these tasks greatly depends on the quality of the pre -processing 

### Tokenization
To tokenize the text, you can use the `word_tokenize` function from the `nltk` library.

The **Punkt** tokenizer is a pre-trained model that is trained to detect sentence boundaries in text. It can be used to tokenize a given text into a list of sentences.

In [42]:
import nltk
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [43]:
# Tokenize the text by words
tokens = nltk.word_tokenize(corpus)

# Print the tokens
print(tokens)

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'field', 'of', 'study', 'focused', 'on', 'making', 'it', '#', 'possible', 'for', 'computers', 'to', 'read', ',', 'understand', ',', 'and', 'generate', 'human', 'language', '.', 'nlp', 'is', 'an', 'interdisciplinary', 'field', 'that', 'combines', 'computer', 'science', ',', 'ai', ',', 'and', 'linguistics', '.', 'the', 'process', 'of', 'nlp', 'includes', 'several', 'steps', ',', 'such', 'as', 'tokenization', ',', 'stop', 'word', 'removal', ',', 'stemming', 'and', 'lemmat', 'ization', ',', 'and', 'more', '.', '{', '}', 'in', 'tokenization', ',', 'we', 'break', 'down', 'the', 'text', 'into', 'individual', 'words', ',', 'phrases', ',', 'symbols', ',', 'or', 'other', 'elements', '.', 'stop', 'word', 'removal', 'involves', 'removing', 'commonly', 'used', 'words', 'such', 'as', 'and', ',', 'the', ',', 'a', ',', 'etc', '.', 'that', 'do', 'not', 'contribute', 'much', 'to', 'the', 'meaning', 'of', 'the', 'text', '.', 'stemming', 'a

### Removing Punctuations
To remove punctuations from the tokenized corpus, you can use the string module in Python

In [44]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

The code below uses a list comprehension to iterate over the tokens and check if each token is not in the `string.punctuation` list. If a token is not in the string.punctuation list, it is added to a new list called tokens_without_punctuation.

In [45]:
# Remove punctuations from the tokens
tokens_without_punctuation = [token for token in tokens if token not in string.punctuation]

# Print the tokens without punctuation
print(tokens_without_punctuation)


['natural', 'language', 'processing', 'nlp', 'is', 'a', 'field', 'of', 'study', 'focused', 'on', 'making', 'it', 'possible', 'for', 'computers', 'to', 'read', 'understand', 'and', 'generate', 'human', 'language', 'nlp', 'is', 'an', 'interdisciplinary', 'field', 'that', 'combines', 'computer', 'science', 'ai', 'and', 'linguistics', 'the', 'process', 'of', 'nlp', 'includes', 'several', 'steps', 'such', 'as', 'tokenization', 'stop', 'word', 'removal', 'stemming', 'and', 'lemmat', 'ization', 'and', 'more', 'in', 'tokenization', 'we', 'break', 'down', 'the', 'text', 'into', 'individual', 'words', 'phrases', 'symbols', 'or', 'other', 'elements', 'stop', 'word', 'removal', 'involves', 'removing', 'commonly', 'used', 'words', 'such', 'as', 'and', 'the', 'a', 'etc', 'that', 'do', 'not', 'contribute', 'much', 'to', 'the', 'meaning', 'of', 'the', 'text', 'stemming', 'and', 'lemmatization', 'on', 'the', 'other', 'hand', 'are', 'techniques', 'to', 'reduce', 'words', 'to', 'their', 'root', 'form', '

### Remove stop words
To remove stop words from the tokenized corpus, you can use the `stopwords` corpus from the `nltk` library

In [46]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Get a list of stop words in English
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [47]:
print(stop_words)

{"needn't", 'doesn', "shan't", 'ours', 'any', "it's", 'have', "we're", "mightn't", 'will', 'hasn', 'on', 'about', 'yourselves', 'shan', 'while', 'here', 'we', 'further', 'll', 'no', 'they', "i've", 'was', 'couldn', "it'll", 'out', 're', "doesn't", "they're", 'my', 'these', "we'd", 'from', 'being', 'theirs', 'but', "couldn't", "it'd", 'it', 'under', 'a', 'each', 'most', 'or', 'through', 'against', "i'll", 'had', "hasn't", 'i', 'has', "she's", 'that', "won't", 'in', 't', 's', 'm', 'this', 'where', 'himself', 'nor', 'itself', 'between', 'him', 'all', 'because', 'am', 'not', 'above', 'me', 'other', 'why', 'what', "she'll", "we've", 'to', 'myself', "wasn't", 'does', 'hadn', 'off', 'd', "haven't", 'same', 'them', 'ma', 'whom', "wouldn't", 'yourself', 'were', 'can', 'after', 'until', 'down', "he'd", 'when', 'own', 'weren', "weren't", "isn't", 'again', 'wasn', "hadn't", 'into', 'up', 'before', 'for', "you're", 'now', 'how', 'their', "he'll", "they'd", 'once', 'themselves', 'very', "she'd", 'be

The code below uses the `stopwords.words` function to get a list of stop words in English and stores the list in a set called stop_words. Then it uses a list comprehension to iterate over the tokens without punctuation and check if each token is not in the stop_words set. If a token is not in the stop_words set, it is added to a new list called tokens_without_stop_words.

In [48]:
# Remove stop words from the tokens
tokens_without_stop_words = [token for token in tokens_without_punctuation if token not in stop_words]

# Print the tokens without stop words
print(tokens_without_stop_words)


['natural', 'language', 'processing', 'nlp', 'field', 'study', 'focused', 'making', 'possible', 'computers', 'read', 'understand', 'generate', 'human', 'language', 'nlp', 'interdisciplinary', 'field', 'combines', 'computer', 'science', 'ai', 'linguistics', 'process', 'nlp', 'includes', 'several', 'steps', 'tokenization', 'stop', 'word', 'removal', 'stemming', 'lemmat', 'ization', 'tokenization', 'break', 'text', 'individual', 'words', 'phrases', 'symbols', 'elements', 'stop', 'word', 'removal', 'involves', 'removing', 'commonly', 'used', 'words', 'etc', 'contribute', 'much', 'meaning', 'text', 'stemming', 'lemmatization', 'hand', 'techniques', 'reduce', 'words', 'root', 'form', 'text', 'cleaned', 'pre', '-processed', 'used', 'various', 'nlp', 'tasks', 'sentiment', 'analysis', 'text', 'classification', 'la', 'nguage', 'translation', 'effectiveness', 'tasks', 'greatly', 'depends', 'quality', 'pre', '-processing', 'step', 'making', 'crucial', 'step', 'nlp']


In [49]:
print('Total words before removing stop words -',len(tokens_without_punctuation))
print('Total words left after stop words removal -',len(tokens_without_stop_words))

Total words before removing stop words - 160
Total words left after stop words removal - 91


### Stemming

In [50]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens_without_stop_words]
print(stemmed_tokens)


['natur', 'languag', 'process', 'nlp', 'field', 'studi', 'focus', 'make', 'possibl', 'comput', 'read', 'understand', 'gener', 'human', 'languag', 'nlp', 'interdisciplinari', 'field', 'combin', 'comput', 'scienc', 'ai', 'linguist', 'process', 'nlp', 'includ', 'sever', 'step', 'token', 'stop', 'word', 'remov', 'stem', 'lemmat', 'izat', 'token', 'break', 'text', 'individu', 'word', 'phrase', 'symbol', 'element', 'stop', 'word', 'remov', 'involv', 'remov', 'commonli', 'use', 'word', 'etc', 'contribut', 'much', 'mean', 'text', 'stem', 'lemmat', 'hand', 'techniqu', 'reduc', 'word', 'root', 'form', 'text', 'clean', 'pre', '-process', 'use', 'variou', 'nlp', 'task', 'sentiment', 'analysi', 'text', 'classif', 'la', 'nguag', 'translat', 'effect', 'task', 'greatli', 'depend', 'qualiti', 'pre', '-process', 'step', 'make', 'crucial', 'step', 'nlp']


Observe how stemming creates non english words.

### Lemmatization
The WordNet corpus is a lexical database of English words, developed by Princeton University. It groups words into sets of synonyms and provides short definitions and example sentences for each word.

In lemmatization, the wordnet corpus is used to determine the base form of a word. For example, the lemma of "better" is "good". The lemmatizer uses the context of a word to determine its correct lemma based on its definition in the wordnet corpus. This is more sophisticated than stemming, which simply removes the suffixes from words without considering the meaning of the word.



In the code below, the `nltk.download('omw-1.4')` function downloads the Open Multilingual Wordnet (OMW) data package version 1.4 for the Natural Language Toolkit (NLTK) library in Python.

The Open Multilingual Wordnet is a database of synonyms and related words in over 300 languages. It provides a hierarchical structure for words and concepts that allows for tasks such as word sense disambiguation and semantic similarity calculation.

By downloading the OMW data package, you can use it within your NLTK-based project to perform various natural language processing tasks, such as lemmatization, word sense disambiguation, and word similarity computation.

In [51]:
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens_without_stop_words]
print(lemmatized_tokens)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


['natural', 'language', 'processing', 'nlp', 'field', 'study', 'focused', 'making', 'possible', 'computer', 'read', 'understand', 'generate', 'human', 'language', 'nlp', 'interdisciplinary', 'field', 'combine', 'computer', 'science', 'ai', 'linguistics', 'process', 'nlp', 'includes', 'several', 'step', 'tokenization', 'stop', 'word', 'removal', 'stemming', 'lemmat', 'ization', 'tokenization', 'break', 'text', 'individual', 'word', 'phrase', 'symbol', 'element', 'stop', 'word', 'removal', 'involves', 'removing', 'commonly', 'used', 'word', 'etc', 'contribute', 'much', 'meaning', 'text', 'stemming', 'lemmatization', 'hand', 'technique', 'reduce', 'word', 'root', 'form', 'text', 'cleaned', 'pre', '-processed', 'used', 'various', 'nlp', 'task', 'sentiment', 'analysis', 'text', 'classification', 'la', 'nguage', 'translation', 'effectiveness', 'task', 'greatly', 'depends', 'quality', 'pre', '-processing', 'step', 'making', 'crucial', 'step', 'nlp']


`WordNetLemmatizer` by default lemmatizes words as nouns. Therefore, it doesn’t convert `"making"` to `"make"` because "making" as a noun (e.g., "the making of a movie") is already in its base form.

To correctly lemmatize verbs like `"making"` to `"make"`, you need to specify the part of speech (POS) as a verb ('v').

In [58]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token, pos='v') for token in tokens_without_stop_words]
print(lemmatized_tokens)

['natural', 'language', 'process', 'nlp', 'field', 'study', 'focus', 'make', 'possible', 'computers', 'read', 'understand', 'generate', 'human', 'language', 'nlp', 'interdisciplinary', 'field', 'combine', 'computer', 'science', 'ai', 'linguistics', 'process', 'nlp', 'include', 'several', 'step', 'tokenization', 'stop', 'word', 'removal', 'stem', 'lemmat', 'ization', 'tokenization', 'break', 'text', 'individual', 'word', 'phrase', 'symbols', 'elements', 'stop', 'word', 'removal', 'involve', 'remove', 'commonly', 'use', 'word', 'etc', 'contribute', 'much', 'mean', 'text', 'stem', 'lemmatization', 'hand', 'techniques', 'reduce', 'word', 'root', 'form', 'text', 'clean', 'pre', '-processed', 'use', 'various', 'nlp', 'task', 'sentiment', 'analysis', 'text', 'classification', 'la', 'nguage', 'translation', 'effectiveness', 'task', 'greatly', 'depend', 'quality', 'pre', '-processing', 'step', 'make', 'crucial', 'step', 'nlp']


### Handling acronyms and slang words

create a dictionary that maps acronyms or slang words to their expanded form and then use this dictionary to replace the words in your tokens.

In [60]:
expanded_terms = {
    'nlp': 'natural language processing',
    'ai': 'artificial intelligence'
}

expanded_tokens = [expanded_terms.get(token, token) for token in lemmatized_tokens]

print(expanded_tokens)


['natural', 'language', 'process', 'natural language processing', 'field', 'study', 'focus', 'make', 'possible', 'computers', 'read', 'understand', 'generate', 'human', 'language', 'natural language processing', 'interdisciplinary', 'field', 'combine', 'computer', 'science', 'artificial intelligence', 'linguistics', 'process', 'natural language processing', 'include', 'several', 'step', 'tokenization', 'stop', 'word', 'removal', 'stem', 'lemmat', 'ization', 'tokenization', 'break', 'text', 'individual', 'word', 'phrase', 'symbols', 'elements', 'stop', 'word', 'removal', 'involve', 'remove', 'commonly', 'use', 'word', 'etc', 'contribute', 'much', 'mean', 'text', 'stem', 'lemmatization', 'hand', 'techniques', 'reduce', 'word', 'root', 'form', 'text', 'clean', 'pre', '-processed', 'use', 'various', 'natural language processing', 'task', 'sentiment', 'analysis', 'text', 'classification', 'la', 'nguage', 'translation', 'effectiveness', 'task', 'greatly', 'depend', 'quality', 'pre', '-proces

The expanded_tokens list is created by iterating through lemmatized_tokens and using the `.get()` method to retrieve the expanded form of each token from the expanded_terms dictionary. If the expanded form of a token is not found in the dictionary, the original token is used.

1. The expression before the **for** statement is executed for each item in the iterable specified after the **for** statement.
2. In this case, the iterable is **lemmatized_tokens**, which is a list of words.
3. `.get()` is a method that works on dictionaries in Python. It allows you to retrieve the value associated with a specified key in a dictionary. If the key you are searching for does not exist in the dictionary, the `.get()` method will return a default value that you can specify as a second argument to the method. If you don't specify a default value, the `.get()` method will return None by default. This can be useful when working with dictionaries as it allows you to access values in a safe and predictable manner, without the risk of encountering a KeyError if the key is not present in the dictionary.
5. The result of each evaluation is added to a new list, which is assigned to the **expanded_acronyms** variable.
6. The end result is a list of words where the acronyms have been expanded to their full form.


In other words, the list comprehension is essentially going through each word in **lemmatized_tokens**, checking if it is an acronym, and if so, replacing it with its full form from **acronym_dict**

### Fixing Typos
To fix typos in our list of tokens, we can use spelling correction tools like the `Spellchecker` module in the `nltk` library or the autocorrect library. These tools work by comparing the words in your list of tokens to a dictionary of correctly spelled words, and making suggestions for corrected spellings based on the closest match.

Use both one after the other for better results. The spelling correction is not always 100% accurate, and we may need to manually review the suggestions made by the tool to ensure that they are correct.

In [63]:
!pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/622.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m614.4/622.8 kB[0m [31m18.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622364 sha256=88dacde63c7cf7a47f5f3d7f30e24b480f6a0598575c95a0eb2cbb8ac3118188
  Stored in directory: /root/.cache/pip/wheels/5e/90/99/807a5ad861ce5d22c3c299a11df8cba9f31524f23ae6e645cb
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.6.1


In [65]:
import autocorrect

spell = autocorrect.Speller(lang='en')

#tokens = ['This', 'is', 'speling', 'misstake']
corrected_tokens = [spell.autocorrect_word(token) for token in expanded_tokens]
print('Original tokens:', expanded_tokens)
print('Corrected tokens:', corrected_tokens)

Original tokens: ['natural', 'language', 'process', 'natural language processing', 'field', 'study', 'focus', 'make', 'possible', 'computers', 'read', 'understand', 'generate', 'human', 'language', 'natural language processing', 'interdisciplinary', 'field', 'combine', 'computer', 'science', 'artificial intelligence', 'linguistics', 'process', 'natural language processing', 'include', 'several', 'step', 'tokenization', 'stop', 'word', 'removal', 'stem', 'lemmat', 'ization', 'tokenization', 'break', 'text', 'individual', 'word', 'phrase', 'symbols', 'elements', 'stop', 'word', 'removal', 'involve', 'remove', 'commonly', 'use', 'word', 'etc', 'contribute', 'much', 'mean', 'text', 'stem', 'lemmatization', 'hand', 'techniques', 'reduce', 'word', 'root', 'form', 'text', 'clean', 'pre', '-processed', 'use', 'various', 'natural language processing', 'task', 'sentiment', 'analysis', 'text', 'classification', 'la', 'nguage', 'translation', 'effectiveness', 'task', 'greatly', 'depend', 'quality'

# Sentiment Analysis

To perform sentiment analysis on the list of corrected tokens, we can use various libraries in Python such as:

1. **nltk:** The Natural Language Toolkit (nltk) provides a SentimentIntensityAnalyzer class that can be used to calculate the sentiment of a piece of text. We can use the `polarity_scores()` method to obtain the sentiment scores.<br><br>

2. **TextBlob:** TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as sentiment analysis.

### Using TextBlob

In [66]:
!pip install textblob



This would give you a sentiment polarity score, where;

* score closer to 1 indicates a positive sentiment,
* score closer to -1 indicates a negative sentiment, and
* score closer to 0 indicates a neutral sentiment.

In [67]:
# we need to now join the tokens back to make it as a sentence.

from textblob import TextBlob

text = " ".join(corrected_tokens)
print(text)

natural language process natural language processing field study focus make possible computers read understand generate human language natural language processing interdisciplinary field combine computer science artificial intelligence linguistics process natural language processing include several step tokenization stop word removal stem lemma station tokenization break text individual word phrase symbols elements stop word removal involve remove commonly use word etc contribute much mean text stem lemmatization hand techniques reduce word root form text clean pre processed use various natural language processing task sentiment analysis text classification la language translation effectiveness task greatly depend quality pre processing step make crucial step natural language processing


In [68]:
# get the polarity score
analysis = TextBlob(text)

print(analysis.sentiment)

Sentiment(polarity=0.032598039215686284, subjectivity=0.5316176470588236)


**Polarity**: This is a float value that represents the sentiment polarity of the text. It ranges from -1 to 1, where -1 indicates a highly negative sentiment, 0 indicates a neutral sentiment, and 1 indicates a highly positive sentiment.

**subjectivity**: This is a float value that represents the degree of subjectivity or opinion present in the text. It ranges from 0 to 1, where 0 indicates a very objective text with no opinion or sentiment, and 1 indicates a highly subjective text with strong opinions or emotions expressed.

In other words, the text is not purely subjective and doesn't express personal opinions, feelings, emotions or subjective interpretations, but it has some subjectivity and is not entirely neutral.

### Using NLTK

The "VADER (Valence Aware Dictionary and sEntiment Reasoner)" lexicon, is a pre-trained lexicon and rule-based sentiment analysis tool. It is part of the Natural Language Toolkit (nltk) library in Python, and is used to determine the sentiment of text data by analyzing the words used in the text and the context in which they are used.

The VADER lexicon contains a list of words and their associated sentiment scores, which are based on the words' connotations, intensities, and tendencies to appear in positive, neutral, or negative contexts. When analyzing text, the sentiment analysis tool looks at each word in the text and uses the sentiment scores in the VADER lexicon to determine the overall sentiment of the text.

In [69]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')


# Create a SentimentIntensityAnalyzer object
sia = SentimentIntensityAnalyzer()

# Calculate the sentiment score for each token
scores = [sia.polarity_scores(token) for token in corrected_tokens]

# Calculate the average sentiment score for the text
sentiment_score = sum(score['compound'] for score in scores) / len(scores)

# Check the sentiment of the text
if sentiment_score >= 0.05:
    sentiment = "positive"
elif sentiment_score <= -0.05:
    sentiment = "negative"
else:
    sentiment = "neutral"

# Print the sentiment of the text
print("Sentiment:", sentiment)


Sentiment: neutral


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


#### Both the libraries are generating a neutral sentiment output.