### Sentiment analysis

Sentiment analysis is the process of using natural language processing and machine learning techniques to identify and extract subjective information from text data, such as opinions, emotions, and attitudes. It involves analyzing the language used in text data to determine whether it expresses positive, negative, or neutral sentiment.

Sentiment analysis has a wide range of applications in various industries, including marketing, customer service, politics, and finance. For example, businesses can use sentiment analysis to analyze customer feedback and reviews to improve their products and services, while political campaigns can use sentiment analysis to track public opinion and adjust their messaging accordingly.

There are several techniques used in sentiment analysis, including rule-based systems, machine learning, and deep learning. Rule-based systems use a set of pre-defined rules and lexicons to analyze text data and determine sentiment, while machine learning techniques use algorithms to automatically learn from data and classify sentiment. Deep learning techniques, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), are used to analyze more complex text data, such as social media posts and reviews.

Sentiment analysis has its limitations and challenges, such as dealing with sarcasm, irony, and cultural nuances in language. However, with the increasing availability of large amounts of text data and advancements in machine learning and deep learning techniques, sentiment analysis is becoming more accurate and useful in a wide range of applications.

### NLTK

NLTK stands for Natural Language Toolkit and is a Python library that provides tools and resources for working with human language data. It is an open-source library that is widely used in academia and industry for natural language processing (NLP) tasks such as text classification, sentiment analysis, and information extraction.

NLTK provides a wide range of modules and tools for processing text data, including tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and named entity recognition. It also includes corpora and lexicons for various languages and tasks, such as the Brown Corpus and WordNet.

One of the main advantages of NLTK is its ease of use and flexibility. It provides a user-friendly interface for performing NLP tasks and allows users to easily customize and extend its functionality. It also includes a wide range of documentation, tutorials, and examples, making it accessible to users with varying levels of experience in NLP and programming.

Overall, NLTK is a powerful tool for working with human language data in Python, and is widely used in research and industry for NLP tasks.

In [1]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [3]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [5]:
a = 'this is a good movie ever seen'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.633, 'pos': 0.367, 'compound': 0.4404}

In [6]:
b = 'this is a worst movie ever seen'
sid.polarity_scores(b)

{'neg': 0.451, 'neu': 0.549, 'pos': 0.0, 'compound': -0.6249}

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [14]:
df = pd.read_csv("data/amazonreviews.tsv", delimiter = '\t')

In [15]:
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [18]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [20]:
df.isnull().sum()

label     0
review    0
dtype: int64

In [21]:
# Removeing nan values
df.dropna(inplace=True)

blanks = []
for i, lb, rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blanks.append(i)
            
df.drop(blanks, inplace=True)

In [22]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [24]:
# Let's check
sid.polarity_scores(df.loc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [26]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [28]:
df['compound'] = df['scores'].apply(lambda score_dist: score_dist['compound'])
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [29]:
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c>=0 else 'neg')
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [32]:
from sklearn import metrics

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [33]:
# Print Classification Report
print(metrics.classification_report(df['label'], df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.52      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [34]:
print(confusion_matrix(df['label'], df['comp_score']))

[[2629 2468]
 [ 435 4468]]


In [36]:
accuracy_score(df['label'], df['comp_score'])

0.7097