<a href="https://colab.research.google.com/github/fadodo/Books_reviewers_review_Analysis/blob/main/Books_reviews_ratings_Analysis%C2%A0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SUMMARY AND OVERVIEW

- Overall, the discovery of a book or author happens through both direct and indirect recommendations.


- Among these recommendations are the opinions and ratings of other readers, which can influence our decision.


- With this in mind, we sought to understand what motivates the high or low ratings of these reviews.


- To do this, we examined the causal relationship between the length of the reviews, the sentiment of the reviews as calculated by a natural language processing algorithm, the total number of reviews, the number of followers, the number of likes on the reviews, and the review ratings.


**The results of the relationship map show no causal relationship between the review ratings and the other variables**. These ratings might therefore be more dependent on individual writing habits than on the review length.



- We then sought to identify the words or terms that determine these ratings.

**We observed that the words in 5-star reviews indicate *joy, excitement, admiration, and emotion*, while the words in 1-star reviews indicate *frustration, disappointment, disgust, boredom, and confusion*.**

## Can a **review content** impact the reviewer's influence on a book rating?

A review rating can significantly influence a book's average rating and sales, though the impact depend on factors like volume, credibility, and context.
- A surge of **5 stars** reviews will boost the average rating, while
- **1-2 stars** reviews (negative reviews) can drag it down, especially if the total review count is low.
- Many **5 stars** and **1 star** reviews may signal controversy, which can intrigue some readers but deter other
- A balanced spread **4 stars** often indicated broad appeal.

On another hand , a single reviewers rating can influence book, but the extents of its impact depends on several factors, including reviewers credibility, the books existing review pool, the review timing and **the review content**.

# Exploratory Analysis of Reader Engagement and Interaction on Goodreads



## Connecting to Drive and Loading Datasets



In [26]:
## Connexion to gdrive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
## Import of necessary libraries
import pandas as pd
import plotly.express as px
import math

In [28]:
df = pd.read_csv("/content/drive/MyDrive/projet_analyse_sentiment_books/clean data/Book_reviews_clean.csv")

## Computing reviews sentiment using NLTK tools in order to check wether it is in accordance with the review ratings

The goal is to categorize the reviews content as positive (1) or negative (0)

In [36]:
!pip install emoji
!pip install contractions

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-

In [37]:
# import libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import spacy
from bs4 import BeautifulSoup
import emoji
import contractions

from nltk.corpus import stopwords

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

KeyboardInterrupt: 

In [None]:
##  Load goodreads review dataset
df1 = df[["review_content"]]

### Preprocess text

In [None]:
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower() # lower case
    text = re.sub(r'https?://\S+|www\.\S+', '', text) # remove URLs
    text = BeautifulSoup(text, "html.parser").get_text() # remove all html tags
    text = emoji.demojize(text) # remove emojis
    text = contractions.fix(text) # remove slang see https://github.com/kootenpv/contractions
    text = re.sub(r"[^a-zA-Z\s]", "", text) # keep only text
    doc = nlp(text)
    text = " ".join([token.lemma_ for token in doc if token.text not in stop_words])

    return text.strip()

In [None]:
## Apply clean text function
df1['review_content_clean'] = df1['review_content'].apply(clean_review)
df1

In [None]:
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text


In [None]:
# initialize NLTK sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# create get_sentiment function
def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores['pos'] > 0.5 else 0
    return scores

# apply get_sentiment function
df1.loc[:, 'score'] = df1['review_content_clean'].apply(get_sentiment)

In [None]:
df["sentiment"] = df1["sentiment"]
df["review_content_clean"] = df1["review_content_clean"]

In [None]:
df.to_csv("/content/drive/MyDrive/batch_1939/Book_reviews_clean_with_sentiment.csv")

## Loading of the dataset books_review_clean with sentiment dataset

In [38]:
### Table loading with sentiment analysis in 0 (negative sentiment) and 1 (positive sentiment)

df_reviews=pd.read_csv('/content/drive/MyDrive/projet_analyse_sentiment_books/clean data/Book_reviews_clean_with_sentiment.csv', parse_dates=["review_date"])
## delete the first column unnamed
df_reviews=df_reviews.drop(columns=['Unnamed: 0','Unnamed: 0.1'], axis=1)
df_reviews.head()

Unnamed: 0,book_id,reviewer_id,likes_on_review,review_content,reviewer_followers,reviewer_total_reviews,review_date,review_rating,sentiment
0,57094644,114413220,582,Just when you thought he was done writing book...,7961,234,2021-02-24,0,0
1,57094644,48328025,329,Would you be shocked if I told you this was th...,12100,1802,2024-03-17,5,1
2,57094644,6728955,232,So you're telling me Anaisn'ta Daughter of Pos...,490,1263,2022-09-05,3,0
3,57094644,101179363,218,"*inserts vine ""anything for you, beyoncé""*upda...",2709,458,2021-06-03,0,0
4,2948832,48727754,174,i was excited about this one since it was so w...,55100,1139,2021-06-09,2,1


##  The length of a review can correlate with its rating, but is it necessarily the cause of a high-lever of or low-level rating?

To test if review length affects rating:
- computing correlation between review length and rating.
- Comparing average review length for different rating groups.
- Using regression analysis to see if length predicts rating after controlling for sentiment

In [39]:
# Adding new columns that count the number of words from review content
df_reviews['review_length']=df_reviews['review_content'].apply(lambda review: len(review.split()))
df_reviews.head()

Unnamed: 0,book_id,reviewer_id,likes_on_review,review_content,reviewer_followers,reviewer_total_reviews,review_date,review_rating,sentiment,review_length
0,57094644,114413220,582,Just when you thought he was done writing book...,7961,234,2021-02-24,0,0,12
1,57094644,48328025,329,Would you be shocked if I told you this was th...,12100,1802,2024-03-17,5,1,453
2,57094644,6728955,232,So you're telling me Anaisn'ta Daughter of Pos...,490,1263,2022-09-05,3,0,9
3,57094644,101179363,218,"*inserts vine ""anything for you, beyoncé""*upda...",2709,458,2021-06-03,0,0,11
4,2948832,48727754,174,i was excited about this one since it was so w...,55100,1139,2021-06-09,2,1,86


#### Grouping the reviews dataset by reviewers_id for this study

In [40]:
### Grouping by reviewers
df_reviews_reviewers=df_reviews.groupby(by=['reviewer_id']).agg({'likes_on_review':'sum',
                                                                 'reviewer_followers':'sum',
                                                                 'reviewer_total_reviews':'sum',
                                                                 'review_rating': lambda x: math.ceil(x.mean()),  # Apply ceiling to the mean of review_rating
                                                                 'sentiment': lambda x: math.ceil(x.mean()),  # Apply ceiling to the mean of review_rating,
                                                                 'review_length':'mean'})

In [41]:
print(f'Aggregating by the reviewers, we have {df_reviews_reviewers.shape[0]} rows to manage')

Aggregating by the reviewers, we have 22344 rows to manage


#### Computing the correlation of the reviews rating in regards of all other variables

In [42]:
fig = px.imshow(df_reviews_reviewers.corr().round(2),
                text_auto=True,
                aspect="auto",
                width=1000, # Adjust the width of the figure
                height=600, # Adjust the height of the figure
                x=['review content', 'reviewer followers','reviewer total reviews', 'review rating', 'sentiment', 'review length'],
                y=['review content', 'reviewer followers','reviewer total reviews', 'review rating', 'sentiment', 'review length'],
                color_continuous_scale="Mint",
                #origin='upper',
    )
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)") # Set paper background color to transparent
fig.update_layout(coloraxis_colorbar=dict(title=dict(text="<b>Correlation</b>", font=dict(size=16, color='black'))))
# Set legend label color and size
fig.update_coloraxes(colorbar_tickfont=dict(color='black', size=16))  # Set color to black, size to 12
fig.update_xaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black') # Remove grid and zeroline from x-axis
fig.update_yaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black') # Remove grid and zeroline from y-axis
fig.show()

- **This firts exploratory analysis shows that the more comments a reviewer makes, the more followers they have. Furthermore, the more followers they have, the greater the chance of their comments receiving likes.**

- **Nevertheless, no causal relationship was observed between the review rating and others variables.**

- **Review rating might depend more on individual writing habits than review length or sentiment.**

#### Comparing average review length for different rating groups.


In [45]:
### Distribution de la review content  in regard of rating categories
fig=px.histogram(df_reviews,
                 x='review_rating',
                 y='review_length',
                 color='review_rating',
                 title='Average Review length for different rating groups',
                 #barmode='group',
                 text_auto=True,
                 histfunc='avg',
                 labels={'review_rating':'Review Rating', 'review_length':'Average Review Length'},
                 opacity=0.8,
                 width=1000, # Adjust the width of the figure
                 height=600, # Adjust the height of the figure
                 #color_discrete_sequence=['indianred'] # color of histogram bars
                 category_orders={"review_rating": [5, 4, 3, 2, 1, 0]} # Reorder legend values
    )
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)") # Set paper background color to transparent
# Update legend color and text size
fig.update_layout(legend=dict(font=dict(color='black', size=16))) # Set legend text color and size
# Update title color and font
fig.update_layout(title_font=dict(color='black', family='Arima', size=20)) # Set title color and font
# Set legend label color and size
fig.update_xaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black', title_font=dict(size=18, family='Arial', color='black')) # Remove grid and zeroline from x-axis
fig.update_yaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black', title_font=dict(size=18, family='Arial', color='black')) # Remove grid and zeroline from y-axis
fig.show()

#### Using regression analysis to see if length predicts rating after controlling for sentiment

In [48]:
fig = px.histogram(df_reviews_reviewers,
             x='sentiment',
             y='review_length',
             color='sentiment',
             title="Review length for different sentiment groups",
             #barmode='group',
             text_auto=True,
             histfunc='avg',
             labels={'sentiment':'Review Sentiment', 'review_length':'Review Length'},
             opacity=0.8,
             width=1000, # Adjust the width of the figure
             height=600, # Adjust the height of the figure
             #color_discrete_sequence=['indianred'] # color of histogram bars
             )
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)") # Set paper background color to transparent
# Update legend color and text size
fig.update_layout(legend=dict(font=dict(color='black', size=16))) # Set legend text color and size
# Update title color and font
fig.update_layout(title_font=dict(color='black', family='Arima', size=20)) # Set title color and font
# Set legend label color and size
fig.update_xaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black', title_font=dict(size=18, family='Arial', color='black')) # Remove grid and zeroline from x-axis
fig.update_yaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black', title_font=dict(size=18, family='Arial', color='black')) # Remove grid and zeroline from y-axis
fig.show()

In [50]:
fig = px.histogram(df_reviews_reviewers,
             x='sentiment',
             y='review_rating',
             color='review_rating',
             title="Review rating for different sentiment",
             barmode='group',
             histfunc='count',
             text_auto=True,
             labels={'sentiment':'Review Sentiment', 'review_rating':'Review rating'},
             opacity=0.8,
             width=1000, # Adjust the width of the figure
             height=600, # Adjust the height of the figure
             #color_discrete_sequence=['indianred'], # color of histogram bars
             category_orders={"review_rating": [5, 4, 3, 2, 1, 0]} # Reorder legend values
    )
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)") # Set paper background color to transparent
# Update legend color and text size
fig.update_layout(legend=dict(font=dict(color='black', size=16))) # Set legend text color and size
# Update title color and font
fig.update_layout(title_font=dict(color='black', family='Arima', size=20)) # Set title color and font
# Set legend label color and size
fig.update_xaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black', title_font=dict(size=18, family='Arial', color='black')) # Remove grid and zeroline from x-axis
fig.update_yaxes(showgrid=False, zeroline=False, tickfont=dict(size=16), color='black', title_font=dict(size=18, family='Arial', color='black')) # Remove grid and zeroline from y-axis
fig.show()

## As review rating might depend more on individuals writtring habits than sentiment, what guide the bad or good rating?

### To identify words that signal personal enjoyment and emotional impact in high- and low-rated reviews, We  analyze the most common or frequent words and phrases in 5 stars category and in 1 star category.



#### Loading all the necessary librairies

In [52]:
#### Package à installer
# !pip install contractions
# !pip install emoji
# !pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=e7e9c228e1efb8466d815d95c153624341fb3376f688dd84d14b11236f05274b
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [53]:
# Start with loading all necessary libraries
import contractions
import difflib
import emoji
import pandas as pd
import nltk
import re
import spacy
import unicodedata
from langdetect import detect, DetectorFactory
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from wordcloud import WordCloud, STOPWORDS # Importing STOPWORDS from wordcloud

In [54]:
### Downloading only the necessary modules
nltk.download('stopwords')
nltk.download('universal_tagset')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('words')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

### Reviews content cleaning preprocessed function to eliminate all special characters.

In [55]:
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

## Formatting function of characters in the text
def remove_formatting_chars(text):
    return ''.join(ch for ch in text if unicodedata.category(ch) != 'Cf')

### Cleaning function that remove all non pratical characters
def clean_text(text):
    # Normalize text
    text = unicodedata.normalize('NFKC', text)
    # Remove invisible or formatting characters
    text = remove_formatting_chars(text)
    # Convert to lowercase
    text = text.lower()
    # Remove numbers by replacing sequences of digits with an empty string
    text = re.sub(r'\d+', ' ', text)
    # Expand contractions
    text = contractions.fix(text)
    # Remove emojis
    text = emoji.demojize(text)
    # Remove email addresses and URLs
    text = re.sub(r'\b\S+@\S+\b', ' ', text)  # remove emails
    text = re.sub(r'http\S+|www\.\S+', ' ', text)  # remove urls
    # Replace punctuation (except period) with a space
    text = re.sub(r'[!"#$%&\'()*+,\-./:;<=>?@[\\\]^_`{|}~]', ' ', text)
    # Remove multiple spaces and trim the text
    text = re.sub(r'\s+', ' ', text).strip()
    return text


# Set seed for reproducibility
DetectorFactory.seed = 0

def detect_language(text):
    try:
        return detect(text)
    except Exception:
        return None

### Applying the cleaning function on the review content

In [56]:
%%time
df_reviews.loc[:, "review_content_clean"] = df_reviews["review_content"].apply(clean_text)

CPU times: user 3min 4s, sys: 539 ms, total: 3min 5s
Wall time: 3min 8s


### Applying the language detecting function

In [57]:
%%time
df_reviews.loc[:, 'lang'] = df_reviews['review_content'].apply(detect_language)

CPU times: user 7min 39s, sys: 2.03 s, total: 7min 41s
Wall time: 7min 44s


### Text tagging and lemmatization function to keep only the root form of words.
List of POS tag code (Universel)
- ADJ adjective   new, good, high, special, big, local
- ADP adposition  on, of, at, with, by, into, under
- ADV adverb  really, already, still, early, now
- CONJ    conjunction and, or, but, if, while, although
- DET determiner, article the, a, some, most, every, no, which
- NOUN    noun    year, home, costs, time, Africa
- NUM numeral twenty-four, fourth, 1991, 14:24
- PRT particle    at, on, out, over per, that, up, with
- PRON    pronoun he, their, her, its, my, I, us
- VERB    verb    is, say, told, given, playing, would
- .   punctuation marks   . , ; !
- X others

In [74]:
lmtzr = WordNetLemmatizer()
### This function allows to select and keeps only the desired tag words
def keep_tags(sentence, keep_tags=("ADJ")): # keep_tags=("ADJ", "NOUN", "VERB")):
  l = [lmtzr.lemmatize(w)  for w, t in pos_tag(word_tokenize(sentence.lower()), tagset='universal') if t in keep_tags and w not in nltk.corpus.stopwords.words('english')] # Use nltk.corpus.stopwords.words('english')
  return " ".join(l)

### Extraction of all 5 stars and 1 star reviews

In [75]:
df1 = df_reviews[df_reviews["review_rating"] == 1]["review_content_clean"] # Commentaires ayant une note de 1
df5 = df_reviews[df_reviews["review_rating"] == 5]["review_content_clean"] # Commentaires ayant une note de 1

### Applying the Keep tag function on the dataset

In [76]:
%%time
df1 = df_reviews[df_reviews["review_rating"] == 1][['review_content_clean']].copy() # Selecting the column as a DataFrame
df5 = df_reviews[df_reviews["review_rating"] == 5][['review_content_clean']].copy() # Selecting the column as a DataFrame

df1.loc[:, "words"] = df1["review_content_clean"].apply(keep_tags)
df5.loc[:, "words"] = df5["review_content_clean"].apply(keep_tags)

CPU times: user 10min 56s, sys: 18.9 s, total: 11min 15s
Wall time: 11min 26s


In [78]:
## Converting the datframe series to a list
l1, l5 = df1["words"].to_list(), df5["words"].to_list()

In [79]:
## Turning the list to a list of splitted words for 5 stars reviews content
w5 = [ w for l in l5 for w in l.split()]

## Trions les mots pour ne garder que ce qui ont plus de 3 lettres
# w5 = [w for w in w5 if len(w) > 3]
print (f'There are {len(w5)} words in the combination of all review rated 5')

There are 849375 words in the combination of all review rated 5


In [80]:
## Turning the list to a list of splitted words for 1 stars reviews content
w1 = [ w for l in l1 for w in l.split()]

# w1 = [w for w in w1 if len(w) > 3]
print (f'There are {len(w1)} words in the combination of all review rated 1')

There are 184975 words in the combination of all review rated 1


In [81]:
from collections import Counter

### This stage consists of counting and keepingthe most common words.

w1_counts = Counter(w1)
w1_top_500 = w1_counts.most_common(500)
w1_ = [word for word, count in w1_top_500 if len(word) > 1]

w5_counts = Counter(w5)
w5_top_500 = w5_counts.most_common(500)
w5_ = [word for word, count in w5_top_500 if len(word) > 1]

### Identifying common words in both texts to retain only the words that differentiate them

In [82]:
## Common words of both text
c15 = set(w1_).intersection(set(w5_))

In [83]:
## Substracting the common words from the 1 star text
s1 = set(w1_)
s1 = s1.difference(c15)

In [84]:
## Substracting the common words from the 5 stars text
s5 = set(w5_)
s5 = s5.difference(c15)

In [85]:
### Checking if there is no common words
s1.intersection(s5)

set()

In [86]:
### Number of words pass to the wordcloud
len(s1), len(s5)

(114, 115)

### Generating word clouds for 5 stars reviews

In [87]:
import plotly.express as px

In [89]:
# Create stopword list:
stopwords = set(STOPWORDS)

# lower max_font_size, change the maximum number of word and lighten the background:

wordcloud = WordCloud(stopwords=stopwords,max_font_size=50, max_words=110, background_color='white').generate(" ".join(s5))

# Display the generated image:
fig=px.imshow(wordcloud,
              width=1000,
              height=600)
# Update layout for transparent background
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)") # Set paper background color to transparent
fig.update_xaxes(showgrid=False, zeroline=False) # Remove grid and zeroline from x-axis
fig.update_yaxes(showgrid=False, zeroline=False) # Remove grid and zeroline from y-axis
fig.show()

 ### Generating word clouds for 1 star reviews

In [91]:
# Create stopword list:
stopwords = set(STOPWORDS)

# lower max_font_size, change the maximum number of word and lighten the background:

wordcloud = WordCloud(stopwords=stopwords,max_font_size=50, max_words=110, background_color='white').generate(" ".join(s1))

# Display the generated image:
fig=px.imshow(wordcloud,
              width=1000, # Adjust the width of the figure
              height=600, # Adjust the height of the figure
              )
# Update layout for transparent background
fig.update_layout(paper_bgcolor="rgba(0,0,0,0)") # Set paper background color to transparent
fig.update_xaxes(showgrid=False, zeroline=False) # Remove grid and zeroline from x-axis
fig.update_yaxes(showgrid=False, zeroline=False) # Remove grid and zeroline from y-axis
fig.show()

Results shows that words in 5 stars reviews often express **joy, excitement, admiration, and deep emotional connection**


while, words in 1 star reviews often indicate **disappointment, frustration, boredom, or confusion**