# Review Sentiment Analysis

## Introduction

To uncover what makes a game good, I will perform sentiment analysis on each Steam review to explore what topics are most often brought up and whether those topics typically result in positive or negative reviews. This code will first clean the data and prepare it for sentiment analysis by aggregating Steam reviews and cleaning the reviews to only include significant words. Then, I will perform a sentiment intensity analysis to see the sentiment of the review with the generated topics

## Data Cleaning and Preparation

First, I needed to clean and prepare the data the same way I did so in the previous section and aggregate the reviews into a single string and list. By aggegrating the reviews, I can then perform vectorizations and sentiment analysis more efficiently.

In [None]:
# Importing the necessary libraries and downloading needed packages
import pandas as pd
import altair as alt
import re
from sklearn.preprocessing import MinMaxScaler
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from collections import Counter
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import save_npz, load_npz
import numpy as np
from tqdm import tqdm
import ast
from collections import defaultdict

nltk.download(['wordnet', 'stopwords', 'punkt'])
nltk.download('punkt_tab')
nltk.download('vader_lexicon')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/davidlee/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/davidlee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/davidlee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/davidlee/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/davidlee/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
# Same data process in main 
game_details = pd.read_csv("steam_game_details.csv")
game_stats = pd.read_csv("steam_game_stats.csv")
game_reviews = pd.read_csv("steam_review_data.csv")

game_reviews = game_reviews.dropna()

combined_details = game_details.merge(game_stats, on="app_id")
combined_details['owners_upper'] = combined_details['owners'].apply(
    lambda x: int(re.search(r"\.\. ([\d,]+)", x).group(1).replace(',', '')) if pd.notnull(x) else None
)

combined_details['total'] = combined_details['positive'] + combined_details['negative']

combined_details['positive_ratio'] = combined_details['positive'] / combined_details['total']

combined_details['ranked_positive'] = combined_details['positive_ratio'].rank(pct=True)
combined_details['ranked_owners'] = combined_details['owners_upper'].rank(pct=True)

combined_details['popularity_score_ranked'] = (
    0.3 * combined_details['ranked_positive'] +
    0.7 * combined_details['ranked_owners']
)

combined_details['popularity_rating'] = (combined_details['popularity_score_ranked'] * 5).round(1)

In [None]:
# Aggregating reviews and putting each review in a list format for sentiment analysis
agg_reviews = game_reviews.groupby(by="app_id")['review_text'].apply(lambda x: ' '.join(x)).reset_index()
agg_reviews.columns = ["app_id", "agg_reviews"]

review_lists = game_reviews.groupby(by="app_id")['review_text'].apply(list).reset_index()
review_lists.columns = ["app_id", "review_list"]

In [None]:
# Merging the aggregated reviews and review list with the main df
with_reviews = combined_details.merge(agg_reviews,on='app_id')
with_reviews = with_reviews.merge(review_lists,on='app_id')

## Sentiment Analysis

To perform sentiment analysis, I needed to further clean the review data to not include stopwords and other insignificant words like the game title, games, play, etc. To do this, I created functions to remove insignificant words and to find the most common topics.

Then, using a TF-IDF vectorizer, I found the top topics. By using a TF-IDF, I was able to capture the most significant words throughout the entire collection of reviews instead of getting unique topics within each one. This would allow me to find overarching topics within each individual review.

Finally, I traversed through the collection of reviews and calculated their sentiment if they contained at least one of the top topics found by the TF-IDF vectorizer. Using these sentiments, I was able to find the average sentiment for each topic to see what a user like/dislikes about their game. Visualizing these results, I found that topics like adventure, friend, and story had high average sentiments, which could imply that games with great storytelling and ability to play with friends were the best. I additionally found that certain genres like mmo, td, rpg, and puzzle typically had high average sentiments implying that those genres were generally the most popular/best games. While other topics like control, enemy, and server had lower average sentiments which could show that games with bad controls, too difficult/easy enemies, or server issues would lead to worse reviews.

In [None]:
# Initialize lemmatizer and set stopwords
lmr = WordNetLemmatizer()
sw = set(stopwords.words('english')).union({'wa', 'ha', 'one', 'ever', 'would', 'like', 'still','game', 'play'})

# Function to find tokenize, lemmatize, and remove stopwords from a review
def findTopics(text):
    tokens = word_tokenize(text.lower())
    lemmatized = []
    for t in tokens:
        if t.isalpha() and t not in sw:
            t_lem = lmr.lemmatize(t)
            if t_lem != "game":
                lemmatized.append(t_lem)
    return lemmatized
    

# Function to find the top terms in the cleaned review
def findTopTerms(row, feature_names, top_n=4):
    indices = row.nonzero()[1]
    tfidf_scores = zip(indices, [row[0, x] for x in indices])
    sorted_terms = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)[:top_n]
    return [feature_names[i] for i, _ in sorted_terms]


NOTE: Code below took 16m 32s to run on VS Code; may take longer/shorter on Google Colab

In [None]:
# NOTE: Uncomment the commented code to run from scratch. Otherwise, use the previously downloaded files

# Initializing the TFIDF Vectorizer, loading/creating the matrix and feature names
vectorizer = TfidfVectorizer(tokenizer=findTopics, stop_words=None)
print("Vectorizing reviews...")
# tfidf_matrix = vectorizer.fit_transform(reviews)
tfidf_matrix = load_npz("tfidf_matrix.npz")
print("Vectorization complete.")
# feature_names = vectorizer.get_feature_names_out()
feature_names = np.load("feature_names.npy", allow_pickle=True)

# Extracting the top topics from sentiment analysis
print("Extracting top terms...")
# top_terms_list = [findTopTerms(tfidf_matrix[i], feature_names) for i in range(tfidf_matrix.shape[0])]
# with_reviews['top_terms'] = top_terms_list
with_reviews = pd.read_csv("with_reviews.csv")

# save_npz("tfidf_matrix.npz", tfidf_matrix)
# np.save("feature_names.npy", feature_names)
# with_reviews.to_csv("with_reviews.csv", index=False)

Vectorizing reviews...
Vectorization complete.
Extracting top terms...


In [None]:
# Created row number column for debugging and keeping track of progress
with_reviews['row_number'] = range(len(with_reviews))
with_reviews.iloc[0]['row_number']

0

Took 19m to run

In [None]:
# Used to keep track of progress since it took a long time to run
tqdm.pandas()

# Initializing sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to see if a review has the generated topics, and if so, to calculate the sentiment of that review
def analyzeSentiment(row):
    review_list = row['review_list']
    keywords = set(row['top_terms'])
    res = []
    for review in review_list:
        review_lower = review.lower()
        sentiment = sia.polarity_scores(review_lower)
        # If the topics exist in the review then add it to the lst
        matched_keywords = [kw for kw in keywords if kw in review_lower]
        if matched_keywords:
            res.append((matched_keywords, sentiment))
    return res

# Needed to run if using data from a CSV to convert it to a list
if type(with_reviews['review_list'].iloc[0]) != list:
    with_reviews['review_list'] = with_reviews['review_list'].progress_apply(ast.literal_eval)
if type(with_reviews['top_terms'].iloc[0]) != list:
    with_reviews['top_terms'] = with_reviews['top_terms'].progress_apply(ast.literal_eval)

# Create a topic sentiment column
with_reviews['topic_sentiment'] = with_reviews.progress_apply(analyzeSentiment, axis=1)
with_reviews

100%|██████████| 1438/1438 [19:01<00:00,  1.26it/s] 


Unnamed: 0,app_id,title,genre,developer,publisher,franchise,release_date,positive,negative,owners,...,positive_ratio,ranked_positive,ranked_owners,popularity_score_ranked,popularity_rating,agg_reviews,review_list,top_terms,row_number,topic_sentiment
0,10,Counter-Strike,Action,Valve,Valve,,"Nov 1, 2000",242768,6388,"10,000,000 .. 20,000,000",...,0.974361,0.994663,0.991667,0.992566,5.0,Ruined my life. This will be more of a ''my ex...,"[Ruined my life., This will be more of a ''my ...","[best, c, holefire, play]",0,"[([c, play, best], {'neg': 0.051, 'neu': 0.765..."
1,1002,Rag Doll Kung Fu,Indie,Mark Healey,Mark Healey,,"Oct 12, 2005",89,30,"20,000 .. 50,000",...,0.747899,0.406271,0.146667,0.224548,1.1,i joined steam 10 years ago today to play rag ...,[i joined steam 10 years ago today to play rag...,"[kung, healey, rdkf, multiplayer]",1,"[([healey, kung], {'neg': 0.149, 'neu': 0.642,..."
2,100400,Silo 2,Animation & Modeling,Nevercenter Ltd. Co.,Nevercenter Ltd. Co.,,"Dec 19, 2012",61,24,"0 .. 20,000",...,0.717647,0.346898,0.042667,0.133936,0.7,"If it were free, I would recommend it, but it ...","[If it were free, I would recommend it, but it...","[silo, modeling, modo, software]",2,"[([modo, modeling, software], {'neg': 0.018, '..."
3,10090,Call of Duty: World at War,Action,Treyarch,Activision,Call of Duty,"Nov 18, 2008",46341,3870,"2,000,000 .. 5,000,000",...,0.922925,0.875917,0.938333,0.919609,4.6,"Once upon a time, Call of Duty was a legit, ep...","[Once upon a time, Call of Duty was a legit, e...","[zombie, custom, cod, duty]",3,"[([duty], {'neg': 0.084, 'neu': 0.882, 'pos': ..."
4,10100,King's Quest™ Collection,Adventure,Sierra,Activision,,"Sep 1, 2006",312,47,"100,000 .. 200,000",...,0.869081,0.703135,0.409000,0.497241,2.5,How to Get KQ Collection Working on Windows 7...,[ How to Get KQ Collection Working on Windows ...,"[quest, king, graham, collection]",4,"[([collection, king, quest], {'neg': 0.0, 'neu..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1433,259490,Beast Boxing Turbo,"Action, Indie, Sports",,,,,263,31,"0 .. 20,000",...,0.894558,0.795864,0.042667,0.268626,1.3,BBT reminds me of the time when indie games we...,[BBT reminds me of the time when indie games w...,"[boxing, punch, fun, opponent]",1433,"[([fun, punch], {'neg': 0.012, 'neu': 0.741, '..."
1434,259510,Shufflepuck Cantina Deluxe,"Action, Indie, Sports",Agharta Studio,Agharta Studio,,"Dec 6, 2013",451,82,"100,000 .. 200,000",...,0.846154,0.633089,0.409000,0.476227,2.4,This is a unique air hockey inspired game. The...,[This is a unique air hockey inspired game. Th...,"[shufflepuck, hockey, cantina, puck]",1434,"[([puck, hockey], {'neg': 0.07, 'neu': 0.781, ..."
1435,259530,Savant - Ascent,"Action, Indie",D-Pad Studio,,Savant - Ascent,"Dec 4, 2013",2583,219,"200,000 .. 500,000",...,0.921842,0.869246,0.595000,0.677274,3.4,A VERY highly recommended game for music lover...,[A VERY highly recommended game for music love...,"[savant, music, ascent, fun]",1435,"[([music, savant], {'neg': 0.1, 'neu': 0.653, ..."
1436,259550,Hero of the Kingdom,"Adventure, Casual, Indie, RPG",Lonely Troops,Lonely Troops,Hero of the Kingdom,"Dec 20, 2012",4376,296,"200,000 .. 500,000",...,0.936644,0.917945,0.595000,0.691884,3.5,I'm so glad I played this. After becoming a CS...,[I'm so glad I played this. After becoming a C...,"[story, click, kingdom, hour]",1436,"[([story], {'neg': 0.05, 'neu': 0.711, 'pos': ..."


In [None]:
# Only including relevant columns
sentiment_df = with_reviews.copy()
sentiment_df = sentiment_df[['app_id', 'title', 'popularity_rating', 'agg_reviews','review_list','top_terms','topic_sentiment','row_number']]
sentiment_df.to_csv("sentiment_df.csv", index=False)

In [None]:
def avg_sentiments(row):
    topic_sentiments = row['topic_sentiment']
    topic_scores = defaultdict(list)

    for keywords, sentiment_dict in topic_sentiments:
        for keyword in keywords:
            topic_scores[keyword].append(sentiment_dict['compound'])

    avg_compound_per_topic = {k: sum(v)/len(v) for k, v in topic_scores.items()}
    return avg_compound_per_topic

sentiment_df['avg_sentiment'] = sentiment_df.progress_apply(avg_sentiments, axis=1)

100%|██████████| 1438/1438 [00:00<00:00, 1804.06it/s]


In [None]:
# Calculating the average sentiment by keeping track of each topic's sentiment sum and number of occurences
topic_sum = defaultdict(float)
topic_count = defaultdict(int)

# For each topic in the the average sentiment column, add their topics to the topic sum and count
for topic_dict in sentiment_df['avg_sentiment']:
    if isinstance(topic_dict, dict):
        for topic, sentiment in topic_dict.items():
            topic_sum[topic] += sentiment
            topic_count[topic] += 1

# Calculating the average sentiment
avg_sentiment = {
    topic: topic_sum[topic] / topic_count[topic]
    for topic in topic_sum
}

# Getting the top 50 most common topics and putting that into a df
top_50_topics = Counter(topic_count).most_common(50)
top_50_df = [
    {
        'topic': topic,
        'count': count,
        'avg_sentiment': avg_sentiment[topic]
    }
    for topic, count in top_50_topics
]
top_50_df = pd.DataFrame(top_50_df)

In [None]:
# Visualizing the top 50 topics with their average sentiments
alt.Chart(top_50_df).mark_point().encode(
    x = "count",
    y = "avg_sentiment",
    tooltip= "topic"
)

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


## Conclusion

All in all, through my data and sentiment analysis, we can see that certain genres tend to be better received by the public. However, I encourage more analysis to be done on the game market as this only covered a limited 1500 games and did not gain significant statistical findings in the model. For future studies or iterations, I would use more games and explore different models to understand what makes a game good and popular. If possible, factoring in things like marketing, social media attention, and game quality could result in more accurate findings. Additionally, qualitative analysis on artistic and subjective opinions can be difficult so exploring more precise factors like playtime, retention, and in-game purchases might reveal new, more meaningful findings.