#### **Sentiment Analysis**

Sentiment Analysis is technique used to determine the emotional tone or sentiment expressed in a text. It involves analyzing the words and phrases  used in the text to underlying sentiment, whether it is a positive, negative or neutral.

***Lexicon-based analysis***

This type of analysis, such as the NLTK Vader sentiment analyzer, involves using a set of predefined rules and heuristics to determine the sentiment of a piece of text. These rules are typically based on lexical and syntactic features of the text, such as the presence of positive or negative words and phrases.

While **lexicon-based analysis** can be relatively simple to implement and interpret, it may not be as accurate as ML-based or transformed-based approaches, especially when dealing with complex or ambiguous text data.

***Machine learning (ML)***

This approach involves training a model to identify the sentiment of a piece of text based on a set of labeled training data. These models can be trained using a wide range of ML algorithms, including decision trees, support vector machines (SVMs), and neural networks.

ML-based approaches can be more accurate than rule-based analysis, especially when dealing with complex text data, but they require a larger amount of labeled training data and may be more computationally expensive.

***Pre-trained transformer-based deep learning***

A deep learning-based approach, as seen with BERT and GPT-4, involve using pre-trained models trained on massive amounts of text data. These models use complex neural networks to encode the context and meaning of the text, allowing them to achieve state-of-the-art accuracy on a wide range of NLP tasks, including sentiment analysis. However, these models require significant computational resources and may not be practical for all use cases.



*   Lexicon-based analysis is a straightforward approach to sentiment analysis, but it may not be as accurate as more complex methods.
*   Machine learning-based approaches can be more accurate, but they require labeled training data and may be more computationally expensive.
*  Pre-trained transformer-based deep learning approaches can achieve state-of-the-art accuracy but require significant computational resources and may not be practical for all use cases.









In [9]:
#Importing required libraries

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import nltk
#nltk.download('all')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
from sklearn.metrics import confusion_matrix, classification_report




In [3]:
#Reading the dataset

df = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/amazon.csv')
df.head()

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1


### Preprocess Text

Text preprocessing is a crucial step in performing sentiment analysis, as it helps to clean and normalize the text data, making it easier to analyze. The preprocessing step involves a series of techniques that help transform raw text data into a form you can use for analysis. Some common text preprocessing techniques include tokenization, stop word removal, stemming, and lemmatization.

Step 1: We will convert all the text into lowercase.

Step 2: Remove the punctuations

Step 3: Remove the stopwords

Step 4: Tokenize the words


In [11]:
#Preprocessing the text

def preprocess_text(text):
    """
    Preprocesses the input text by converting it to lowercase, removing punctuation, stop words, and lemmatizing the words.

    Args:
        text: The input text string.

    Returns:
        A string containing the preprocessed text.
    """
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)


Getting the sentiment using Sentiment Vader Analyzer.

In [13]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text):

  scores = analyzer.polarity_scores(text)
  sentiment = 1 if scores['pos'] > 0 else 0
  return sentiment

#Applying get_sentiment on cleaned_review

df['sentiment'] = df['cleaned_reviews'].apply(get_sentiment)


In [14]:
df

Unnamed: 0,reviewText,Positive,cleaned_reviews,sentiment
0,This is a one of the best apps acording to a b...,1,one best apps acording bunch people agree bomb...,1
1,This is a pretty good version of the game for ...,1,pretty good version game free lot different le...,1
2,this is a really cool game. there are a bunch ...,1,really cool game bunch level find golden egg s...,1
3,"This is a silly game and can be frustrating, b...",1,silly game frustrating lot fun definitely reco...,1
4,This is a terrific game on any pad. Hrs of fun...,1,terrific game pad hr fun grandkids love great ...,1
...,...,...,...,...
19995,this app is fricken stupid.it froze on the kin...,0,app fricken froze kindle wont allow place itea...,0
19996,Please add me!!!!! I need neighbors! Ginger101...,1,please add need neighbor thanks bunch awesome ...,1
19997,love it! this game. is awesome. wish it had m...,1,love game awesome wish free stuff house cost m...,1
19998,I love love love this app on my side of fashio...,1,love love love app side fashion story fight wo...,1


In [15]:
confusion_matrix(df['sentiment'],df['Positive'])

array([[ 1088,   559],
       [ 3679, 14674]])

In [17]:
print(f"Below is the classification report \n\n {classification_report(df['sentiment'],df['Positive'])}")

Below is the classification report 

               precision    recall  f1-score   support

           0       0.23      0.66      0.34      1647
           1       0.96      0.80      0.87     18353

    accuracy                           0.79     20000
   macro avg       0.60      0.73      0.61     20000
weighted avg       0.90      0.79      0.83     20000

