**AI/ML with Python: Web Scraping & Sentiment Analysis**

**Sentiment Analysis Tool**




**Introduction to VADER:**

VADER is a sentiment analysis tool that is specifically attuned to sentiments expressed in social media and is available in the Natural Language Toolkit (NLTK) library for Python.The tool evaluates text to determine the sentiment of each lexical feature it encounters, adjusts the sentiment scores based on rules that consider syntax and grammatical conventions, and provides an overall sentiment score.

The sentiment score given by VADER can be broken down into three components - positive, negative, and neutral, and also provides a compound score which is a normalized, weighted composite score. This compound score is often used as the singular measure of sentiment for a given text.

We start off by importing **nltk** (Natural Language Toolkit) which allows us utilise its internal package **SentimentIntensityAnalyzer** that will provide us with the necessary polarity scores in terms of negative, neutral, or positive. To start, ensure that you have **ntlk** installed on your local machine. If you haven't, open your terminal and do **pip install nltk** as shown below.

After importing **nltk**, ensure that you have **vader_lexicon** downloaded. Once everything is completed, we will proceed to import **SentimentIntensityAnalyzer** as a package from **nltk.sentiment**.

In [2]:
!pip3 install nltk
import nltk

# Remember to download vader_lexicon if you havent! Simply uncomment the code below and run it
nltk.download('vader_lexicon')

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


With our setup complete, we're now equipped to analyze the sentiment of various sentences. We will utilize the **polarity_scores** method to evaluate and display their sentiment metrics. Proceed with executing the following code to observe the breakdown of sentiment scores for each sentence provided.

In [3]:
# Sample texts that we will be using for sentiment analysis
texts = [
    "I love this product, it's absolutely amazing!",
    "This is the worst movie I have ever seen.",
    "I'm not sure how I feel about this new policy.",
    "Meh, it was okay, nothing special.",
    "Wow, this new update is fantastic! 😊",
    "I had an excellent day, but the weather was horrible."
]

for text in texts:
    scores = sia.polarity_scores(text)
    print(f"Text: {text}")
    print(f"Scores: {scores}")
    print()

Text: I love this product, it's absolutely amazing!
Scores: {'neg': 0.0, 'neu': 0.318, 'pos': 0.682, 'compound': 0.862}

Text: This is the worst movie I have ever seen.
Scores: {'neg': 0.369, 'neu': 0.631, 'pos': 0.0, 'compound': -0.6249}

Text: I'm not sure how I feel about this new policy.
Scores: {'neg': 0.197, 'neu': 0.803, 'pos': 0.0, 'compound': -0.2411}

Text: Meh, it was okay, nothing special.
Scores: {'neg': 0.421, 'neu': 0.355, 'pos': 0.225, 'compound': -0.1675}

Text: Wow, this new update is fantastic! 😊
Scores: {'neg': 0.0, 'neu': 0.342, 'pos': 0.658, 'compound': 0.8268}

Text: I had an excellent day, but the weather was horrible.
Scores: {'neg': 0.337, 'neu': 0.496, 'pos': 0.167, 'compound': -0.5267}



In [4]:
!pip3 install afinn

from afinn import Afinn

# We initialize Afinn sentiment analyzer
afinn = Afinn()

Collecting afinn
  Downloading afinn-0.1.tar.gz (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.6/52.6 kB[0m [31m478.8 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... [?25l[?25hdone
  Created wheel for afinn: filename=afinn-0.1-py3-none-any.whl size=53430 sha256=aaa4195997b171c787dfa21e1dde97775f4c5d6ad885436042818fcf5c2b9a19
  Stored in directory: /root/.cache/pip/wheels/b0/05/90/43f79196199a138fb486902fceca30a2d1b5228e6d2db8eb90
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1


In [5]:
# List of sentences to analyze
texts = [
    "I really love the new design of your website!",
    "I hate waiting in long queues.",
    "This is utterly fantastic!",
    "It's raining again. This weather is depressing.",
    "I'm not sure how I feel about the new policy.",
    "I love the absolutely wonderful performance, it was simply perfect and made me incredibly happy!"
]

# Analyze the sentiment of each sentence
for text in texts:
    score = afinn.score(text)
    print(f"Text: {text}\nScore: {score}\n")

Text: I really love the new design of your website!
Score: 3.0

Text: I hate waiting in long queues.
Score: -3.0

Text: This is utterly fantastic!
Score: 4.0

Text: It's raining again. This weather is depressing.
Score: -2.0

Text: I'm not sure how I feel about the new policy.
Score: 0.0

Text: I love the absolutely wonderful performance, it was simply perfect and made me incredibly happy!
Score: 13.0



# Sentiment Analysis with Naive Bayes Classifier

In [6]:
!pip3 install scikit-learn


import nltk
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import re
import random



Download the **movie_reviews** dataset from the nltk library, which is a collection of movie reviews that have been categorized as either positive or negative.

It contains 2,000 movie reviews, with an equal number of positive and negative reviews. This balanced dataset is ideal for training and testing sentiment analysis algorithms, specifically the Naive Bayes Classifiers in this case to determine whether a new movie review is positive or negative.

In [7]:
# Download the movie review dataset from nltk
nltk.download('movie_reviews')

# Run the below code to load the reviews and text preprocessing the data for modeling
documents = [(" ".join(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Split the dataset into the text and labels
texts, labels = zip(*documents)

# The handle_negation function is designed to preprocess text to better handle negations when performing sentiment analysis.
# Negation words like "not," "no," "never," and "cannot" can completely change the sentiment of the phrase that follows them.
def handle_negation(text):
    # A simple way to handle negation: attach "not_" to words following a negation word
    negation_re = re.compile(r"\b(not|no|never|cannot)\b[\s]+([a-z]+)", re.IGNORECASE)
    return negation_re.sub(lambda match: f"{match.group(1)}_{match.group(2)}", text)

# Apply the negation handling to your texts
texts = [handle_negation(text) for text in texts]

# This flattens a list of lists (or a mix of lists and strings) into a list of strings, needed for text processing tasks such as vectorization
# texts = [' '.join(text) if isinstance(text, list) else text for text in texts]

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


In [8]:
# Split data into training and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.25, random_state=42)

# Initialize a CountVectorizer for text vectorization
vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit and transform the training data
train_vectors = vectorizer.fit_transform(train_texts)

# Transform the test data
test_vectors = vectorizer.transform(test_texts)

In [9]:
# Initialize the Multinomial Naive Bayes classifier
classifier = MultinomialNB()

# Train the classifier
classifier.fit(train_vectors, train_labels)

In [10]:
# Predict sentiments for test data
predictions = classifier.predict(test_vectors)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.8180


In [11]:
# Function to predict sentiment of a new review
def predict_sentiment(new_text):
    new_vector = vectorizer.transform([new_text])
    pred = classifier.predict(new_vector)
    return pred[0]

# Test the function
sample_review = "I absolutely loved this movie, the storyline was engaging from start to finish!"
print(f"The sentiment predicted by the model is: {predict_sentiment(sample_review)}")

The sentiment predicted by the model is: pos


# Sentiment Analysis with Logistic Regression


In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TF-IDF Vectorization instead of simple counts
vectorizer = TfidfVectorizer(ngram_range=(1, 3), min_df=5, max_df=0.8)
train_vectors = vectorizer.fit_transform(train_texts)
test_vectors = vectorizer.transform(test_texts)

# Initialize the Logistic Regression classifier
logistic_classifier = LogisticRegression()

# Train the classifier on the training data and labels
logistic_classifier.fit(train_vectors, train_labels)

# Predict sentiments for test data using the trained classifier
logistic_predictions = logistic_classifier.predict(test_vectors)

# Calculate accuracy of the classifier on the test data
logistic_accuracy = accuracy_score(test_labels, logistic_predictions)

print(f"Logistic Regression Accuracy: {logistic_accuracy:.4f}")

Logistic Regression Accuracy: 0.8320


In [13]:
# Function to predict sentiment of a new review
def logistic_predict_sentiment(new_text):
    new_vector = vectorizer.transform([new_text])
    pred = logistic_classifier.predict(new_vector)
    return pred[0]

# Test the function
sample_review = "The sun is shining and I'm so absolutely happy today!"
print(f"The sentiment predicted by the model is: {logistic_predict_sentiment(sample_review)}")

The sentiment predicted by the model is: pos


In [14]:
# Check the prediction probability for a sample text
sample_prob = logistic_classifier.predict_proba(vectorizer.transform([sample_review]))
print("Prediction probability for the sample text: ", sample_prob)

Prediction probability for the sample text:  [[0.48429751 0.51570249]]


# Step-by-step instructions to try out above concepts on twitter_samples dataset

In [16]:
# Download the twitter_samples dataset
nltk.download('twitter_samples')

# Import twitter_samples dataset
from nltk.corpus import twitter_samples

# Load positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

# Creating labelled data
documents = []

# Adding positive tweets
for tweet in positive_tweets:
    documents.append((tweet, "positive"))

# Adding negative tweets
for tweet in negative_tweets:
    documents.append((tweet, "negative"))

# Split the dataset into the text and labels
texts, labels = zip(*documents)

# Split data into training and test sets
# Split data into training and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.25, random_state=42)


# Begin text vectorization
vectorizer = TfidfVectorizer(ngram_range=(1, 3), min_df=5, max_df=0.8)
train_vectors = vectorizer.fit_transform(train_texts)
test_vectors = vectorizer.transform(test_texts)

# Fit and transform the training data
train_vectors = vectorizer.fit_transform(train_texts)

# Transform the test data
test_vectors = vectorizer.transform(test_texts)

# Initialize the Logistic Regression classifier
logistic_classifier = LogisticRegression()

# Train the classifier
logistic_classifier.fit(train_vectors, train_labels)


# Predict sentiments for test data using the trained classifier
logistic_predictions = logistic_classifier.predict(test_vectors)

# Test your results with the sample tweets below
sample_tweets = [
    "Absolutely loving the new update! Everything runs so smoothly and efficiently now. Great job! 👍",
    "Had an amazing time at the beach today with friends. The weather was perfect! ☀️ #blessed",
    "Extremely disappointed with the service at the restaurant tonight. Waited over an hour and still got the order wrong. 😡",
    "Feeling really let down by the season finale. It was so rushed and left too many unanswered questions. 😞 #TVShow",
    "My phone keeps crashing after the latest update. So frustrating dealing with these glitches! 😠",
]

# Test the function
for sentence in sample_tweets:
    # Transform the sample tweet using the vectorizer
    sample_vector = vectorizer.transform([sentence])
    # Predict the sentiment of the sample tweet
    sentiment = logistic_classifier.predict(sample_vector)
    print(f"The sentiment predicted by the model is: {sentiment[0]}")



[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


The sentiment predicted by the model is: positive
The sentiment predicted by the model is: positive
The sentiment predicted by the model is: negative
The sentiment predicted by the model is: negative
The sentiment predicted by the model is: negative
