# Exploration
This notebook does some basic exploration on the Steam reviews data. Some questions that this notebook answers:
* What languages are in the dataset?
* Is the dataset unbalanced?

In [3]:
import pandas as pd
import numpy as np
from langdetect import detect

In [8]:
filename = "steam_sample.csv"

In [10]:
# read in data
df = pd.read_csv(filename,
                 header=None,
                 names=["GameID", "Review", "Recommend", "FoundHelpful"])

# set positive sentiment to 1 and negative sentiment to 0
df.loc[df['Recommend'] == 1, 'Sentiment'] = 1.0
df.loc[df['Recommend'] == -1, 'Sentiment'] = 0.0
df.drop(columns=["GameID", "Recommend", "FoundHelpful"], inplace=True)

# show a few examples
df.head()

Unnamed: 0,Review,Sentiment
0,Ruined my life.,1.0
1,This will be more of a ''my experience with th...,1.0
2,This game saved my virginity.,1.0
3,• Do you like original games? • Do you like ga...,1.0
4,"Easy to learn, hard to master.",1.0


In [11]:
# replace null reviews with null strings
df.loc[df["Review"].isnull(), 'Review'] = ""

## Languages
We try to use the Python `langdetect` package to detect if there are any reviews in different languages below.

In [13]:
def detect_text(text):
    try:
        language = detect(text)
    except:
        language = "unknown"
    
    return language

detect_corpus = np.vectorize(detect_text)

# preprocess by replacing html and accented/special characters
df.loc[:, 'Language'] = detect_corpus(df["Review"])

df.head()

Unnamed: 0,Review,Sentiment,Language
0,Ruined my life.,1.0,en
1,This will be more of a ''my experience with th...,1.0,en
2,This game saved my virginity.,1.0,en
3,• Do you like original games? • Do you like ga...,1.0,en
4,"Easy to learn, hard to master.",1.0,en


In [29]:
# what are the most commonly occurring other languages?
df.groupby("Language")["Review"].count().reset_index(name='count').sort_values(['count'], ascending=False).head(10)

Unnamed: 0,Language,count
6,en,96995
35,unknown,2156
0,af,118
5,de,77
3,cy,74
20,no,62
19,nl,62
4,da,61
32,tl,48
14,it,39


It seems above that there is a small percentage of reviews in other languages.

However, looking at some of the supposed Afrikaans and German reviews below, they seem like they are still in English. It is probable that almost all of the reviews are in English and there isn't a significant number of reviews in other languages that would affect our preprocessing or model.

In [35]:
# look at some "afrikaans" entries
df[df["Language"]=="af"].head(10)

Unnamed: 0,Review,Sentiment,Language
452,it's awesome!!!,1.0,af
738,very good game,1.0,af
2613,very very nice game :D goooood job,1.0,af
2816,Very good game :],1.0,af
3306,best game in counter strike series,1.0,af
3548,very good game,1.0,af
4036,Just awesome.,1.0,af
4591,this game is like 10 dollars get it,1.0,af
5060,Good game Sanks Aftor!!!,1.0,af
5787,very good game,1.0,af


In [34]:
# look at some "german" entries
df[df["Language"]=="de"].head(10)

Unnamed: 0,Review,Sentiment,Language
749,The Best FPS :D,1.0,de
1950,THE BEST AS ALWAYS! :D,1.0,de
1951,The Best Game Ever !,1.0,de
3123,WOW THIS GAME JUST NEVER GETS OLD :) ONE OF TH...,1.0,de
4992,THE OLD COUNTER IS STUPID THERE UNBOXINGS ITS ...,0.0,de
6298,BEST GAME! :),1.0,de
6357,Wins Counter-Strike Source and Counter-Strike ...,1.0,de
8057,hmpf...,1.0,de
8982,Very Fun and Simple Killing,1.0,de
9220,Better then CS: GO,1.0,de


## Dataset Balance
Next, we look what percentage of the dataset has what sentiment.

As we can see below, about 86% of this dataset has positive sentiment (1.0) while the remainder is negative. This is fairly unbalanced.

In [36]:
# look at how unbalanced dataset is
df.groupby("Sentiment")["Language"].count().reset_index(name="count")

Unnamed: 0,Sentiment,count
0,0.0,13566
1,1.0,86434
