# Data Wrangling
Video game reviews along with other pertinent features for the genres of role-playing, shooter and sports games of three gaming consoles (Xbox One, PS4, Nintendo Switch) were scraped from www.metacritic.com. 15 of the most recent user reviews for all games were scraped along with their individual user score and sentiment. However, to truly focus on the actual sentiment of the common user only the average review score and sentiment for all critics were considered.

__Note:__ Not all gamers who submitted a review score left an actual review.

In [1]:
# Import libraries for data cleaning, text-preprocessing and EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Load data
gameReviews = pd.read_csv('MetacriticGameReviews.csv', index_col=0)
gameReviews.head()

Unnamed: 0,title,platform,metascore,metasentiment,average_userscore,average_usersentiment,developer,genre,number_of_players,esrb_rating,release_date,username,userscore,usersentiment,review,review_date
0,Red Dead Redemption 2,Xbox One,97.0,positive,7.9,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",gnadenlos,7,mixed,"The main problem is, that it's not a real open...","Nov 1, 2018"
1,Red Dead Redemption 2,Xbox One,97.0,positive,7.9,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",Feriatus,7,mixed,It's not a bad game but the gameplay is an out...,"Oct 29, 2018"
2,Red Dead Redemption 2,Xbox One,97.0,positive,7.9,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",ponux,7,mixed,"Visually superb (except cutscenes), good (not ...","Nov 5, 2018"
3,Red Dead Redemption 2,Xbox One,97.0,positive,7.9,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",Picklock,5,mixed,"Great looking game backed up by clumsy, overly...","Nov 4, 2018"
4,Red Dead Redemption 2,Xbox One,97.0,positive,7.9,positive,Rockstar Games,Action adventure,Up to 32,M,"Oct 26, 2018",Saints,6,mixed,Red Dead Redemption 2 is an amazing game that ...,"Oct 30, 2018"


In [3]:
gameReviews.review[0]

"The main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience.\r\n\r\nThe separate and realThe main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience.The separate and real open world part is done quite well, but it doesn't have enough interesting and coherent/interlocking elements to keep you motivated very long. If you remove the linear story and high production value, it's the usual open w

### Feature definition
***
-  __title:__ Title of the game <br>

2. __platform:__ The console reviewer played the game on <br>

3. __metascore:__ The average score given to the game by various game critics (float range of 1-100) <br>

4. __metasentiment:__ The overall critic sentiment classification based on critic ratings/metascore (positive, mixed, negative) <br>

5. __average_userscore:__ The average score given to the game by users (float range of 1-10) <br>

6. __average_usersentiment:__ The overall user sentiment classification based on average user score (positive, mixed, negative) <br>

7. __developer:__ Developer of game <br>

8. __genre:__ Genre of game <br>

9. __number_of_players:__ Number of players that can play the game <br>

10. __esrb_rating:__ Entertainment Software Rating Board (ESRB) rating <br>

11. __release_date:__ Release date of game <br>

12. __username:__ The Metacritic username of the game reviewer <br>

13. __userscore:__ Individual user rating (integer range of 1-10) <br>

14. __usersentiment:__ Individual user sentiment classification based on their user score (positive, mixed, negative) <br>

15. __review:__ Text review left by user

16. __review_date:__ Date review was left by user

### Initial data exploration

In [4]:
# Save shape of dataframe to compare after data is cleaned
init_shape = gameReviews.shape

# Check dtypes of each feature
gameReviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20910 entries, 0 to 20909
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   title                  20910 non-null  object 
 1   platform               20910 non-null  object 
 2   metascore              20895 non-null  float64
 3   metasentiment          20895 non-null  object 
 4   average_userscore      20895 non-null  float64
 5   average_usersentiment  20895 non-null  object 
 6   developer              20895 non-null  object 
 7   genre                  20910 non-null  object 
 8   number_of_players      17370 non-null  object 
 9   esrb_rating            20445 non-null  object 
 10  release_date           20910 non-null  object 
 11  username               20910 non-null  object 
 12  userscore              20910 non-null  int64  
 13  usersentiment          20910 non-null  object 
 14  review                 20908 non-null  object 
 15  re

Many of the features are of the object type including the release and review date which may be of better us as time series type. There appears to be a significant amount of null values for the "number_of_players" column and a small amount of nulls for a few others.

In [5]:
# Checkout basic statistical qualities for numerical features
gameReviews.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
metascore,20895.0,75.819813,10.997743,17.0,71.0,78.0,83.0,97.0
average_userscore,20895.0,6.771716,1.608292,0.2,6.1,7.2,7.9,9.6
userscore,20910.0,6.776471,3.418414,0.0,5.0,8.0,10.0,10.0


The average critic score tends to be around 76 which appear to be on a 10:1 scale compared to user scores. The average user scores and individual user scores both match around a score of 6.7.

### Dealing with nulls and duplicate rows

In [6]:
# There appears to be some missing values, check for percentage of missing values in each column
def missingData(df): # Function retrieved from github.com/ithisted/PetAdoptionPrediction/blob/master/PetAdoptionPrediction.ipynb
    missing = False
    
    for item in (df.isnull().sum()).iteritems():
        if item[1] > 0:
            print('Missing Data percentage for '+item[0]+' is {:2.2%}'.format((item[1]/df.shape[0])) )
            missing = True
    if not missing:
        print('Found no missing values.')

missingData(gameReviews)

Missing Data percentage for metascore is 0.07%
Missing Data percentage for metasentiment is 0.07%
Missing Data percentage for average_userscore is 0.07%
Missing Data percentage for average_usersentiment is 0.07%
Missing Data percentage for developer is 0.07%
Missing Data percentage for number_of_players is 16.93%
Missing Data percentage for esrb_rating is 2.22%
Missing Data percentage for review is 0.01%


In [7]:
# Drop rows of missing values of columns with 2% or less of nulls
def dropMissing(df, drop_list):
    df = df.dropna(axis=0, subset=drop_list)
    return df

toDrop = ['metascore', 'metasentiment', 'average_userscore', 'average_usersentiment', 'developer', 'esrb_rating','review']

gameReviews = dropMissing(gameReviews, toDrop)
missingData(gameReviews)

Missing Data percentage for number_of_players is 16.39%


In [8]:
# Explore different values of 'number_of_players column'
gameReviews.number_of_players.value_counts()

No Online Multiplayer    7905
Up to 4                  1875
2                        1544
Up to 8                  1140
Up to 10                  660
Up to 6                   645
Up to 12                  555
Up to 16                  434
Up to 22                  375
Online Multiplayer        330
Up to 5                   225
Up to 18                  195
Up to 24                  180
Up to more than 64        165
Up to 64                  165
Massively Multiplayer     150
Up to 3                   150
Up to 20                  105
Up to 32                   75
Up to 40                   75
Up to 60                   45
Up to 30                   45
1 Player                   30
Name: number_of_players, dtype: int64

In [8]:
# Explore titles of 'No Online Multiplayer' games
#pd.unique(gameReviews[gameReviews.number_of_players == 'No Online Multiplayer']['title'])

In [9]:
# Retrieve all unique values in 'number_of_players' column and seperate which will be considered as single and multi player
num_players_values = pd.unique(gameReviews.number_of_players)
single = [val for val in num_players_values if val in ['No Online Multiplayer', '1 Player']]
multi = [val for val in num_players_values if val not in ['No Online Multiplayer', '1 Player']]

# Replace corresponding values to get a binary column
gameReviews['number_of_players'] = gameReviews.number_of_players.replace(single, 'singleplayer')
gameReviews['number_of_players'] = gameReviews.number_of_players.replace(multi, 'multiplayer')

# Double check only two unique values exists; singleplayer and multiplayer
gameReviews.number_of_players.value_counts()

multiplayer     12478
singleplayer     7935
Name: number_of_players, dtype: int64

To reduce the amount of different, uneccessary amount of values for the 'number_of_players' columns of the dataframe, the game titles for 'No Online Multiplayer' games were explored. Mostly all, if not all were single player games. For simplicity, the 'number_of_players' columns is converted to a binary columnn where a game is either a single player game (values of 'No Online Multiplayer' and '1 Player) or multiplayer game (all other values).

In [10]:
# Check for duplicate rows and drop if any
def dropDuplicates(df):
    num_dups = len(df) - len(df.drop_duplicates())
    print(num_dups, "duplicate rows were dropped.")
    return df.drop_duplicates()

gameReviews = dropDuplicates(gameReviews)
print()

30 duplicate rows were dropped.



In [11]:
# Compare shape of clean dataframe to the initiail shape
clean_shape = gameReviews.shape

print('Initial shape:', init_shape)
print('Current shape:', clean_shape)

Initial shape: (20910, 16)
Current shape: (20383, 16)


### Text pre-processing

In [None]:
# TO DO:
# Include contractions
# Order functions and add doc string with " "
# Order: lower, contractions, digits, punctuation, whitespace, lemmatize, stopwords
# Potentially see all languages included in corpus
# Compare a couple of the reviews to the reviews on metacritic, may potentially need to fix scraper script

In [30]:
from langdetect import detect_langs
from contractions import CONTRACTION_MAP
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import string
import re

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\filia\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\filia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\filia\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [14]:
def make_lower(docs):
    lowered = [doc.lower() for doc in docs]
    return lowered

In [15]:
def remove_digits(docs):
    pattern = r'\d*'
    digitless = [re.sub(pattern, '', doc) for doc in docs]
    return digitless

In [47]:
# Function retrieved from https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72
def expand_contractions(docs, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    
    expanded_texts = []
    for doc in docs:
        expanded_text = contractions_pattern.sub(expand_match, doc)
        expanded_text = re.sub("'", "", expanded_text)
        expanded_texts.append(expanded_text)
        
    return expanded_texts

In [17]:
def remove_punctuation(docs):
    punc_filtered = []
    
    # Seperate punctuation marks that are in between words from those that occur at the end of words
    punctuation = ''.join([p for p in string.punctuation if p not in ['/','-']])
    between_words = r'[/\-]'
    
    # Translate punctuation marks end of word punctuation marks with ''
    table = str.maketrans("","",punctuation)
    for doc in docs:
        no_puncs = re.sub(between_words, ' ', doc)
        no_puncs = no_puncs.translate(table)
        punc_filtered.append(no_puncs)
    
    #no_puncs = [doc.translate(table) for doc in docs]
    return punc_filtered

In [18]:
def remove_whitespace(docs):
    no_ws = [' '.join(doc.split()) for doc in docs]
    return no_ws

In [19]:
def remove_stopwords(docs):
    ''' This function removes english stop words with 'game' and 'video' in consideration '''
    stopword_filtered = []

    # Potentially download sw from other languages
    stop_words = stopwords.words('english')
    # Add domain specific words to the list of stop words
    stop_words.extend(['video','game'])
    # Remove negation words to extract correct sentiment
    stop_words.remove('no')
    stop_words.remove('not')
    #negation_words = ['no','nor','not']
    #stop_words = [w for w in stop_words if w not in negation_words]

    for doc in docs:
        tokens = word_tokenize(doc)
        output = [t for t in tokens if t not in stop_words]
        stopword_filtered.append(' '.join(output))
  
    return stopword_filtered

In [20]:
def lemmatize(docs):
    # Initialize empty string to store lemmatized text
    lemmatized = []
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Loop over all text files, tokenize each words, get its lemma and join lemmas to create lemmatized strings
    for doc in docs:
        tokens = word_tokenize(doc)
        output = [lemmatizer.lemmatize(t) for t in tokens]
        lemmatized.append(' '.join(output))
    
    return lemmatized

In [25]:
def clean_text(df, col, lower=True, contraction=True, digit=True, punctuation=True, whitespace=True, stopwords=True, lemma=True):
    clean_corpus = list(df[col])
    
    if lower:
        clean_corpus = make_lower(clean_corpus)
        
    if contraction:
        clean_corpus = expand_contractions(clean_corpus)
        
    if digit:
        clean_corpus = remove_digits(clean_corpus)
        
    if punctuation:
        clean_corpus = remove_punctuation(clean_corpus)
        
    if whitespace:
        clean_corpus = remove_whitespace(clean_corpus)
        
    if stopwords:
        clean_corpus = remove_stopwords(clean_corpus)
        
    if lemma:
        clean_corpus = lemmatize(clean_corpus)
        
    df['clean_text'] = clean_corpus

In [36]:
def clean_text2(docs, lower=True, contraction=True, digit=True, punctuation=True, whitespace=True, stopwords=True, lemma=True):
    clean_corpus = list(docs)
    
    if lower:
        clean_corpus = make_lower(clean_corpus)
        
    if contraction:
        clean_corpus = expand_contractions(clean_corpus)
        
    if digit:
        clean_corpus = remove_digits(clean_corpus)
        
    if punctuation:
        clean_corpus = remove_punctuation(clean_corpus)
        
    if whitespace:
        clean_corpus = remove_whitespace(clean_corpus)
        
    if stopwords:
        clean_corpus = remove_stopwords(clean_corpus)
        
    if lemma:
        clean_corpus = lemmatize(clean_corpus)
        
    #df['clean_text'] = clean_corpus
    return clean_corpus

In [38]:
first2 = ["The main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience.The separate and real open world part is done quite well, but it doesn't have enough interesting and coherent/interlocking elements to keep you motivated very long. If you remove the linear story and high production value, it's the usual open world stuff, we've known for years, mixed with light survival elements.So what you will get is an average to good open world sandbox for maybe 15-20 hours of random fun and a separate, very scripted and cinematic game with high production value. Both parts have average gameplay and problematic controls. The only highlights and reasons to play the game are its great graphics, atmosphere and story", "Red Dead Redemption 2 is an amazing game that is plagued with an outdated control scheme and a bounty system that needs some understanding."]


In [50]:
clean_first2 = clean_text2(first2) 
print(first2[0])
print(' ')
print(clean_first2[0])

The main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience.The separate and real open world part is done quite well, but it doesn't have enough interesting and coherent/interlocking elements to keep you motivated very long. If you remove the linear story and high production value, it's the usual open world stuff, we've known for years, mixed with light survival elements.So what you will get is an average to good open world sandbox for maybe 15-20 hours of random fun and a separate, very scripted and cinematic game with high production value. Both parts have average gameplay and problematic controls. The only highlights and reasons to play the game are its great graphics, atmosphere and s

***

In [50]:
pattern = r"["+string.punctuation+"]"
print(pattern)
input_str = "The main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience." # Sample string
new_str = re.sub(pattern, ' ', input_str)
new_str = new_str.strip(' ')
#new_str = re.sub(r'\ *', ' ', new_str)
print(input_str)
print(' ')
print(new_str)

[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
The main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience.
 
The main problem is  that it s not a real open world game  If you focus on the main story  like many reviewers and some users do  you will experience a linear and scripted game with almost no freedom  Every time you try something different the missions will fail Controls aren t very good  so it s also hard to recommend the game for that linear story experience


In [53]:
#this works
words = input_str.split()
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped)

['The', 'main', 'problem', 'is', 'that', 'its', 'not', 'a', 'real', 'open', 'world', 'game', 'If', 'you', 'focus', 'on', 'the', 'main', 'story', 'like', 'many', 'reviewers', 'and', 'some', 'users', 'do', 'you', 'will', 'experience', 'a', 'linear', 'and', 'scripted', 'game', 'with', 'almost', 'no', 'freedom', 'Every', 'time', 'you', 'try', 'something', 'different', 'the', 'missions', 'will', 'failControls', 'arent', 'very', 'good', 'so', 'its', 'also', 'hard', 'to', 'recommend', 'the', 'game', 'for', 'that', 'linear', 'story', 'experience']


In [None]:
pattern = r"["+string.punctuation+"]"

input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
new_str = re.sub(pattern, '', input_str)
print(input_str)
print(new_str)

In [32]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\filia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\filia\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [33]:
def make_lower(docs):
    low = [doc.lower() for doc in docs]
    return low

def remove_punctuation(docs):
    pattern = r"["+string.punctuation+"]"
    no_puncs = [re.sub(pattern, '', doc) for doc in docs]
    return no_puncs

def remove_whitespace(docs):
    no_whitespace = [doc.strip() for doc in docs]
    return no_whitespace

def remove_stopwords(docs):
    stopword_filtered = []
    for doc in docs:
        tokens = word_tokenize(doc)
        output = [t for t in tokens if t not in ENGLISH_STOP_WORDS]
        stopword_filtered.append(' '.join(output))
    return stopword_filtered

def lemmatize(docs):
    lemmatized = []
    lemmatizer = WordNetLemmatizer()
    for doc in docs:
        tokens = word_tokenize(doc)
        output = [lemmatizer.lemmatize(t) for t in tokens]
        lemmatized.append(' '.join(output))
    return lemmatized

def txt_process(df, col, lower=True, punctuation=True, whitespace=True, stopwords=True, lemma=True):
    clean_corpus = list(df[col])
    
    if lower:
        clean_corpus = make_lower(clean_corpus)
        
    if punctuation:
        clean_corpus = remove_punctuation(clean_corpus)
        
    if whitespace:
        clean_corpus = remove_whitespace(clean_corpus)
        
    if stopwords:
        clean_corpus = remove_stopwords(clean_corpus)
        
    if lemma:
        clean_corpus = lemmatize(clean_corpus)
        
    df['clean_text'] = clean_corpus

In [35]:
test = gameReviews.iloc[0:10,:].copy()

txt_process(test, 'review')

print(test.review[0])
print(test.clean_text[0])

The main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience.

The separate and realThe main problem is, that it's not a real open world game. If you focus on the main story, like many reviewers and some users do, you will experience a linear and scripted game with almost no freedom. Every time you try something different the missions will fail.Controls aren't very good, so it's also hard to recommend the game for that linear story experience.The separate and real open world part is done quite well, but it doesn't have enough interesting and coherent/interlocking elements to keep you motivated very long. If you remove the linear story and high production value, it's the usual open world 

In [None]:
#def identify_lang(docs):

### EDA

In [None]:
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS