## Part 1:

Data Preparation

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", None, "display.max_columns", None)

#I want to look at the games sold in 2019 and then match it with the Steam reviews
#to see if the top selling games are in fact positively received by gamers.
#I will also do sentiment analysis on the reviews texts to see how positively gamers praised these games
#and what do gamers say about the worst games.


game_sales= pd.read_csv('vgchartz-7_7_2020.csv')
#print(game_sales.head())
game_sales= game_sales.drop(columns=['Unnamed: 0', 'vg_score','img','na_sales','jp_sales','pal_sales','other_sales','last_update', 'publisher','developer','user_score','critic_score','total_shipped','total_sales','genre']).rename(columns={
                                                                                     'release_date':'release date'}).fillna(np.nan).replace(np.nan,'Data not available')
game_sales_pc= game_sales[game_sales['console']== 'PC'].reset_index().drop(columns='index')
game_sales_pc['title']= game_sales_pc['title'].str.lower().replace(u'(\N{COPYRIGHT SIGN}|\N{TRADE MARK SIGN}|\N{REGISTERED SIGN})','', regex=True)
game_sales_pc['title']= game_sales_pc['title'].replace("[^A-Za-z\s]*","", regex=True)

reviews_2019= pd.read_csv('steam_reviews.csv')
reviews_2019['title']= reviews_2019['title'].str.lower().replace(u'(\N{COPYRIGHT SIGN}|\N{TRADE MARK SIGN}|\N{REGISTERED SIGN})','', regex=True)
reviews_2019['title']= reviews_2019['title'].replace("[^A-Za-z\s]*","", regex=True)
reviews_2019.head()


Unnamed: 0,date_posted,funny,helpful,hour_played,is_early_access_review,recommendation,review,title
0,2019-02-10,2,4,578,False,Recommended,&gt Played as German Reich&gt Declare war on B...,expansion hearts of iron iv man the guns
1,2019-02-10,0,0,184,False,Recommended,yes.,expansion hearts of iron iv man the guns
2,2019-02-07,0,0,892,False,Recommended,Very good game although a bit overpriced in my...,expansion hearts of iron iv man the guns
3,2018-06-14,126,1086,676,False,Recommended,Out of all the reviews I wrote This one is pro...,dead by daylight
4,2017-06-20,85,2139,612,False,Recommended,Disclaimer I survivor main. I play games for f...,dead by daylight


## Part 2:
In this part, I will aggregate all the recommended and not recommended suggestions for each game to see which games
has the most amount of recommendations from reviewers from Steam.

In [2]:
recommended= reviews_2019[reviews_2019['recommendation'] == 'Recommended'].groupby('title').size().reset_index().rename(columns={0:'Recommend'})
total_recommendations= reviews_2019.groupby('title')['recommendation'].count().reset_index().rename(columns={'recommendation':'total recommendations'})
not_recommended= reviews_2019[reviews_2019['recommendation'] == 'Not Recommended'].groupby('title').size().reset_index().rename(columns={0:'not recommended'})
recommended= recommended.merge(total_recommendations, how='left', on='title')
final_recommendation= recommended.merge(not_recommended, how='left', on='title')
final_recommendation= final_recommendation[['title','Recommend','not recommended','total recommendations']]
final_recommendation['Favorable %']= (final_recommendation['Recommend']/final_recommendation['total recommendations']*100).round()
final_recommendation['Unfavorable %']= (final_recommendation['not recommended']/final_recommendation['total recommendations']*100).round().fillna('Does not have a negative review')

# Since a lot of the games don't have any negative reviews or don't have enough reviews to make
# the game valid to analyze in the first place, I only look at the top 10 games with the most
# amount of reviews from players to avoid overfitting.
# I checked on Steam and found that Resident Evil 7: Biohazard and Astroneer received an overwhelming amount
# of positive reviews from players, so these games may be outliers in the dataset.

most_reviews= final_recommendation.sort_values(by='total recommendations', ascending=False).iloc[:10].reset_index().drop(columns='index')

In [3]:
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer= SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\gauph\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [None]:
reviews_2019['review']= reviews_2019['review'].replace('[^A-Za-z0-9\s]*','', regex=True).apply(lambda x: str(x))
reviews_2019['sentiment score']= reviews_2019['review'].apply(lambda x: analyzer.polarity_scores(x))
reviews_2019['compound score']= reviews_2019['sentiment score'].apply(lambda dict: dict['compound'])
reviews_2019= reviews_2019.drop(columns=['date_posted', 'funny','helpful','hour_played','is_early_access_review'])
steam_ranking= reviews_2019.groupby(['title','compound score']).count().reset_index().drop(columns=['recommendation', 'review','sentiment score'])
steam_ranking= steam_ranking.groupby('title')['compound score'].mean().reset_index()
most_reviews= most_reviews.merge(steam_ranking, how='left', on='title')

In [None]:
most_reviews.head()

## Part 3:
Looking at GTA 5, PUBG 5, and RUST data, I see that the compound score is very low when the game is "Favorable", according
to the number of positive comments. Therefore, I will investigate further and see if the negative 
reviews for GTA 5, PUBG, and RUST were "extreme" (i.e. contains curse words or words that intensify the negative effects)

In [8]:
gta_5= reviews_2019[reviews_2019['title']== 'grand theft auto v'].reset_index().drop(columns=['sentiment score','compound score','title', 'index'])

In [23]:
test= gta_5['review']
test= test.apply(lambda x: nltk.word_tokenize(x))


In [24]:
test.head(10)

0     [I, love, the, part, in, Online, where, you, g...
1     [best, Rockstar, optimization, so, farI, could...
2     [After, having, bought, this, game, on, PS3, P...
3       [Pros, SP, ModdingCons, Take, Two, Interactive]
4     [Everything, in, the, game, now, either, flies...
5     [The, single, player, story, was, fun, GTA, on...
6     [Single, player, is, fun, but, online, has, be...
7     [Hey, forget, about, all, those, cool, mods, b...
8     [Great, Campaign, Awful, Customer, Service, an...
9                                           [siki, yok]
10    [gg, rockstar, you, made, a, good, game, into,...
11                                          [Its, good]
12                              [No, Mods, No, Freedom]
13    [Still, negative, because, this, game, is, jus...
14    [Shame, on, the, Developers, for, making, modd...
15    [Waited, a, year, and, a, half, for, the, PC, ...
16    [Take, Two, are, heartless, cash, grabbing, pi...
17      [Making, mods, illegal, What, a, idiotic

In [30]:
fdist= test.apply(lambda x: nltk.probability.FreqDist(x))

In [33]:
fdist.head()

0        {'I': 1, 'love': 1, 'the': 1, 'part': 1, 'in':...
1        {'best': 1, 'Rockstar': 1, 'optimization': 1, ...
2        {'After': 1, 'having': 1, 'bought': 1, 'this':...
3        {'Pros': 1, 'SP': 1, 'ModdingCons': 1, 'Take':...
4        {'Everything': 1, 'in': 1, 'the': 3, 'game': 2...
5        {'The': 1, 'single': 1, 'player': 1, 'story': ...
6        {'Single': 1, 'player': 1, 'is': 3, 'fun': 1, ...
7        {'Hey': 1, 'forget': 1, 'about': 1, 'all': 1, ...
8        {'Great': 1, 'Campaign': 1, 'Awful': 1, 'Custo...
9                                    {'siki': 1, 'yok': 1}
10       {'gg': 1, 'rockstar': 1, 'you': 1, 'made': 1, ...
11                                   {'Its': 1, 'good': 1}
12                      {'No': 2, 'Mods': 1, 'Freedom': 1}
13       {'Still': 1, 'negative': 1, 'because': 1, 'thi...
14       {'Shame': 1, 'on': 1, 'the': 1, 'Developers': ...
15       {'Waited': 1, 'a': 3, 'year': 1, 'and': 1, 'ha...
16       {'Take': 1, 'Two': 1, 'are': 1, 'heartless': 1.