# 6.2 Sentiment Analysis

## 6.2.2 Sentiment analysis with ChatGPT API

### Let’s ask ChatGPT 6.2

Query: How to perform sentiment analysis of customer reviews using your API?

Output: ChatGPT's working code to generate text completion and extracting the sentiment from the output. The code was manually adapted to load custom data and store the output in the more convinient way. It was assumed that the input file is stored in the same folder as this notebook. For the purpose of our analysis only the first 500 reviews were analysed from the exemplary DataFrame. The first code snippet utilizes the "text-davinci-003" engine, the second utilizes the "text-davinci-002" engine. The results of both engines are compared below with the results of sentiment analysis done with help of the basic keyword search.

In [2]:
# List of reviews to analyze (adapted manually)
import pandas as pd
df = pd.read_csv('olist_order_reviews_dataset.csv')
df = df.dropna(subset = ['review_comment_message'])[0:500]
reviews = list(df["review_comment_message"])

In [None]:
# Code snippet that utilizes the text-davinci-003 engine

import openai

# Replace 'your_openai_api_key' with your actual OpenAI API key
openai.api_key = 'your_openai_api_key'

def get_sentiment(review):
    prompt = f"The sentiment of this review is: {review} -> "
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=10,
        n=1,
        stop=None,
        temperature=0.5,
    )
    completion = response.choices[0].text.strip()
    if "positive" in completion:
        return "positive"
    elif "neutral" in completion:
        return "neutral"
    elif "negative" in completion:
        return "negative"
    else:
        return "unknown"

# Analyze the reviews and store the output (manually adapted)
sentiments = []
for review in reviews:
    sentiments.append(get_sentiment(review))

df["GPT_003_sentiment"] = sentiments

In [None]:
# Code snippet that utilizes the text-davinci-002 engine

import openai

# Replace 'your_openai_api_key' with your actual OpenAI API key
openai.api_key = 'your_openai_api_key'

def get_sentiment(review):
    prompt = f"The sentiment of this review is: {review} -> "
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=10,
        n=1,
        stop=None,
        temperature=0.5,
    )
    completion = response.choices[0].text.strip()
    if "positive" in completion:
        return "positive"
    elif "neutral" in completion:
        return "neutral"
    elif "negative" in completion:
        return "negative"
    else:
        return "unknown"

# Analyze the reviews and store the output (manually adapted)
sentiments = []
for review in reviews:
    sentiments.append(get_sentiment(review))

df["GPT_002_sentiment"] = sentiments

In [None]:
# Simple keywords analysis performed in section 5.6.2 run on the set of the first 500 reviews.
keywords = [
    "excelente", "ótimo", "maravilhoso", "incrível", "fantástico",
    "perfeito", "bom", "eficiente", "durável", "confiável",
    "rápido", "custo-benefício", "recomendo", "satisfeito",
    "surpreendente", "confortável", "fácil de usar", "funcional",
    "melhor", "vale a pena"
]

# Second version of the keyword search function proposed by ChatGPT that copes with NaNs in the input.
def is_positive(review, keywords):
    if not isinstance(review, str):
        return False

    for keyword in keywords:
        if keyword.lower() in review.lower():
            return True
    return False

# Applying the function to the test DataFrame (adapted).
df['keyword_sentiment'] = df['review_comment_message'].apply(lambda x: is_positive(x, keywords))

In [None]:
###
# Assessing quality of the sentiment analysis based on keywords.

# Extract records with positive reviews assessed by sentiment analysis and by review scores.
posrev_senti = df[df['keyword_sentiment']==True]
posrev_score = df[(df['review_score']==5)|(df['review_score']==4)]

# Perform set operations to determine true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN).
TP = pd.merge(posrev_senti, posrev_score)
FP = posrev_senti[posrev_senti["review_id"].isin(posrev_score["review_id"]) == False]
FN = posrev_score[posrev_score["review_id"].isin(posrev_senti["review_id"]) == False]
TN = df[(df["review_id"].isin(posrev_senti["review_id"]) == False) & (df["review_id"].isin(posrev_score["review_id"]) == False)]

# Calculate sensitivity and specificity
print("Quality for basic keyword search:")
print("Sensitivity: ", round(len(TP) / (len(TP) + len(FN)),2))
print("Specificity: ", round(len(TN) / (len(TN) + len(FP)),2))

###
# Assessing quality of the sentiment analysis based on ChatGPT language model text-davinci-003.

# Extract records with positive reviews assessed by sentiment analysis and by review scores.
posrev_senti = df[df['GPT_003_sentiment']=='positive']
posrev_score = df[(df['review_score']==5)|(df['review_score']==4)]

# Perform set operations to determine true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN).
TP = pd.merge(posrev_senti, posrev_score)
FP = posrev_senti[posrev_senti["review_id"].isin(posrev_score["review_id"]) == False]
FN = posrev_score[posrev_score["review_id"].isin(posrev_senti["review_id"]) == False]
TN = df[(df["review_id"].isin(posrev_senti["review_id"]) == False) & (df["review_id"].isin(posrev_score["review_id"]) == False)]

# Calculate sensitivity and specificity
print("Quality for GPT direct sentiment analysis with text-davinci-003:")
print("Sensitivity: ", round(len(TP) / (len(TP) + len(FN)),2))
print("Specificity: ", round(len(TN) / (len(TN) + len(FP)),2))

###
# Assessing quality of the sentiment analysis based on ChatGPT language model text-davinci-002.

# Extract records with positive reviews assessed by sentiment analysis and by review scores.
posrev_senti = df[df['GPT_002_sentiment']=='positive']
posrev_score = df[(df['review_score']==5)|(df['review_score']==4)]

# Perform set operations to determine true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN).
TP = pd.merge(posrev_senti, posrev_score)
FP = posrev_senti[posrev_senti["review_id"].isin(posrev_score["review_id"]) == False]
FN = posrev_score[posrev_score["review_id"].isin(posrev_senti["review_id"]) == False]
TN = df[(df["review_id"].isin(posrev_senti["review_id"]) == False) & (df["review_id"].isin(posrev_score["review_id"]) == False)]

# Calculate sensitivity and specificity
print("Quality for GPT direct sentiment analysis with text-davinci-002:")
print("Sensitivity: ", round(len(TP) / (len(TP) + len(FN)),2))
print("Specificity: ", round(len(TN) / (len(TN) + len(FP)),2))

In [None]:
# Printing out the number of positive, negative and unknown/neutral annotations
print("\nReview score:")
print(df["review_score"].value_counts())
print("\nKeyword sentiment analysis:")
print(df["keyword_sentiment"].value_counts())
print("\nGPT_003 sentiment analysis:")
print(df["GPT_003_sentiment"].value_counts())
print("\nGPT_002 sentiment analysis:")
print(df["GPT_002_sentiment"].value_counts())

## 6.2.3 Sentiment analysis with a pretrained model
### Let’s ask ChatGPT 6.3

Query: How to perform sentiment analysis of Portuguese customer reviews in Python?

Output: The code proposed by ChatGPT required small debugging, but after pasting the error message to ChatGPT it was able to come out with a correct solution. The code works well.

In [None]:
import spacy
import unidecode
import re
import nltk
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
from translate import Translator

nltk.download('movie_reviews')
nltk.download('punkt')

nlp = spacy.load("pt_core_news_sm")

def preprocess(text):
    text = unidecode.unidecode(text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    doc = nlp(text)
    tokens = [token.text for token in doc]
    return tokens

class PortugueseTextBlob(TextBlob):
    def __init__(self, *args, **kwargs):
        super(PortugueseTextBlob, self).__init__(*args, **kwargs)
        self.translator = Translator(to_lang="en")

    def translate_to_english(self, text):
        try:
            return self.translator.translate(text)
        except Exception as e:
            print(f"Translation error: {e}")
            return text

    def sentiment(self, translated_text):
        tb = TextBlob(translated_text, analyzer=NaiveBayesAnalyzer())
        return tb.sentiment
    
review = "Eu realmente gostei deste produto!"
tokens = preprocess(review)
review_clean = ' '.join(tokens)
pt_blob = PortugueseTextBlob(review_clean)
translated_text = pt_blob.translate_to_english(review_clean)
sentiment = pt_blob.sentiment(translated_text)

print("Sentiment:", sentiment)

### Let’s ask ChatGPT 6.4
Query: Adapt this code to analyze reviews stored in a pandas DataFrame column. The output should be stored as a new DataFrame column with values: "positive", "negative" or "neutral".

Output: Works well.

In [None]:
import pandas as pd

def analyze_sentiment(review):
    tokens = preprocess(review)
    review_clean = ' '.join(tokens)
    pt_blob = PortugueseTextBlob(review_clean)
    translated_text = pt_blob.translate_to_english(review_clean)
    sentiment = pt_blob.sentiment(translated_text)

    if sentiment.classification == 'pos' and sentiment.p_pos >= 0.6:
        return 'positive'
    elif sentiment.classification == 'neg' and sentiment.p_neg >= 0.6:
        return 'negative'
    else:
        return 'neutral'

# Loading testing data - added manually
df = pd.read_csv('olist_order_reviews_dataset.csv')
df = df.dropna(subset = ['review_comment_message'])[0:500]

# Running the analysis - adapted manually
df['pretrained_sentiment'] = df['review_comment_message'].apply(analyze_sentiment)

In [None]:
###
# Assessing quality of the sentiment analysis done with a pretrained language model.

# Extract records with positive reviews assessed by sentiment analysis and by review scores.
posrev_senti = df[df['pretrained_sentiment']=='positive']
posrev_score = df[(df['review_score']==5)|(df['review_score']==4)]

# Perform set operations to determine true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN).
TP = pd.merge(posrev_senti, posrev_score)
FP = posrev_senti[posrev_senti["review_id"].isin(posrev_score["review_id"]) == False]
FN = posrev_score[posrev_score["review_id"].isin(posrev_senti["review_id"]) == False]
TN = df[(df["review_id"].isin(posrev_senti["review_id"]) == False) & (df["review_id"].isin(posrev_score["review_id"]) == False)]

# Calculate sensitivity and specificity
print("Quality for the sentiment analysis done with a pretrained model:")
print("Sensitivity: ", round(len(TP) / (len(TP) + len(FN)),2))
print("Specificity: ", round(len(TN) / (len(TN) + len(FP)),2))

# 6.3 Text summarization
## 6.3.2 Summarizing text with dedicated libraries
### Let's ask ChatGPT 6.7

Query: Provide Python code to generate summaries of very short customer reviews in Portuguese. Use frequency-based approach.

Output: The code works well. The input was adapted manually to remove empty records.

In [None]:
import pandas as pd
import nltk
import string
from collections import Counter

# Download the required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Sample data (manually adapted to remove empty records)
df = pd.read_csv('olist_order_reviews_dataset.csv')
df = df.dropna(subset = ['review_comment_message'])

# Function to tokenize and remove stopwords
def preprocess(text):
    stopwords = nltk.corpus.stopwords.words('portuguese')
    tokens = nltk.word_tokenize(text.lower())
    tokens = [token for token in tokens if token not in string.punctuation and token not in stopwords]
    return tokens

# Function to create word frequency distribution
def word_frequency(tokens):
    frequency = Counter(tokens)
    return frequency

# Function to summarize short reviews
def summarize_reviews(text, num_keywords=3):
    tokens = preprocess(text)
    frequency = word_frequency(tokens)
    important_words = [word for word, count in frequency.most_common(num_keywords)]
    summary = ' '.join(important_words)
    return summary

# Apply the function to the DataFrame
df['summary'] = df['review_comment_message'].apply(summarize_reviews)

# Display the results (manually adapted to print the summary of the longest message)
print("Longest review:", df.loc[1316]["review_comment_message"])
print("Summary:", df.loc[1316]["summary"])

## 6.3.3 Topic modeling
### Let’s ask ChatGPT 6.8

Query: I have a set of short negative customer reviews in Portuguese stored in a pandas dataframe column. I want to know what are the main concerns raised by customers. How to extract this information from reviews using text summarization?

Output: Code works well. The code was manually adapted to our input data. Only negative reviews with review_scores of 1 or 2 were used in the analysis. The output was manually adapted to display 7 words for each topic.

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
import re

# Load your DataFrame (assuming your reviews are in the 'review_comment_message' column)
# Only negative reviews were chosen for the analysis (adapted manually).
df = pd.read_csv('olist_order_reviews_dataset.csv')
df = df.dropna(subset = ['review_comment_message'])
df = df[(df["review_score"]==1) | (df["review_score"]==2)]

# Preprocess the text
def preprocess_text(text, language='portuguese'):
    # Remove special characters, convert to lowercase
    cleaned_text = re.sub(r'[^\w\s]', '', text.lower())

    # Tokenize words
    words = word_tokenize(cleaned_text, language=language)

    # Remove stopwords
    stop_words = set(stopwords.words(language))
    words = [word for word in words if word not in stop_words]

    # Apply stemming
    stemmer = SnowballStemmer(language)
    words = [stemmer.stem(word) for word in words]

    return words

df['preprocessed_reviews'] = df['review_comment_message'].apply(preprocess_text)

# Loading the model.
from gensim.corpora import Dictionary
from gensim.models import LdaModel

# Create a dictionary and corpus for LDA
dictionary = Dictionary(df['preprocessed_reviews'])
corpus = [dictionary.doc2bow(text) for text in df['preprocessed_reviews']]

# Train an LDA model
num_topics = 5  # Adjust this value according to the desired number of topics
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, random_state=42)

# Displaying results (manually adapted to display 7 words).
def display_topics(model, num_topics, num_words=7):
    for idx, topic in model.print_topics(num_topics, num_words):
        print(f"Topic {idx + 1}: {topic}\n")

display_topics(lda_model, num_topics)