# IGN Reviews - 10/10 Statistical Analysis

#### Sometimes during the weekends I like to do little fun projects. Beeing a huge video games fan, I think this is going to be a very interesting one. So this is a statistical analysis of over 18 thousand IGN reviews made in the last 50 years. 

In [None]:
# Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
sns.set_style('darkgrid')
sns.set_palette(sns.color_palette("hls", 4))

import warnings
warnings.filterwarnings('ignore')

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
df = pd.read_csv('../input/ign.csv')

In [None]:
df.columns

It looks like the Unnamed: 0 column can be safely dropped, let's also drop the url column, since it doesn't add much to our analysis.

In [None]:
df.drop(['Unnamed: 0', 'url'], axis=1, inplace=True)
df.dropna(inplace=True)
df.head()

Now let's check which values are present in the score_phrase column, and let's start preparing it to plot in a histogram

In [None]:
df['score_phrase'].value_counts()

Hmm, interesting, there seems to be way more positive reviews than negative ones, let's investigate this further. We can make a list with all this values to use it in our graph.

In [None]:
scores = ['Masterpiece', 'Amazing', 'Great', 'Good', 'Okay', 'Mediocre', 'Bad', 'Awful',
         'Painful', 'Unbearable', 'Disaster']
scores_values = list(reversed(range(0, 11)))

# Reviews Distribution

In [None]:
fig, (axis1, axis2) = plt.subplots(2, 1, figsize=(15, 10))
plt.figure(figsize=(15, 5))
sns.countplot(df['score_phrase'], order=scores, ax=axis1)
sns.countplot(df['score'], order=list(reversed(range(0, 11))), ax=axis2)

As we can see in the graphs above, the reviews distribution is definetely more skewed towards the positive side, with a big concentration of reviews around the 'Great' and 'Good', with scores of 8 and 7. 

# Cleaning Some Data

Let's take a look at all the values present in the 'genre' column.

In [None]:
df['genre'].unique()

Well... that looks bad... There are a lot of genres that seem to have 2 keywords, like 'Puzzle, Platformer' eventhough there is already both genres separated. We can fix this with a few lambda functions to get rid of the second part in this genres and format it. Also, for some reason, 'Baseball' is a separated genre from 'Sports', let's fix them by putting Baseball into the sports category.

In [None]:
df['genre'] = df['genre'].apply(lambda x: str(x).split(',')[0])
df['genre'] = df['genre'].apply(lambda x: 'Sports' if x == 'Baseball' else x)

Great! Now let's plot and see if we find out anything interesting.

In [None]:
plt.figure(figsize=(15, 7))
sns.countplot(y=df['genre'])

WOW! There seems to be A LOT more games in the 'Action' category (more than double) than any other one. The second most popular category is Sports, followed by Shooter, which I argue, could be included in the Action group, then the next most popular genre would be Racing, which again, I think could be included in the Sports category. Then the next most popular group would be Adventure. I think it's interesting that there are more Racing then Adventure games released.

## Titles WordCloud

Let's take a look at a word cloud of the most common words used in titles of video games releases in the last 30 years.

In [None]:
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white',
                      width=1280,height=720, max_words=60,
                      prefer_horizontal=0.85, colormap='tab10').generate(" ".join(df['title']))

plt.figure(figsize=(10, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')

Interesting to see that there are more Star Wars titles than Call of Duty. I'd never guessed that. Also, it's funny how some TV shows, like Walking Dead and movies like Harry Potter show in the cloud. Another interesting thing is that Tiger Woods also makes an appearance.

# Most Popular Platforms

Now let's take a look at the most popular platforms by titles released.

In [None]:
df['platform'].value_counts()

In [None]:
platforms = ['PC', 'PlayStation 2', 'Xbox 360', 'Wii', 'PlayStation 3', 'Nintendo DS',
             'PlayStation', 'Wireless', 'iPhone', 'Xbox', 'PlayStation Portable',
            'Game Boy Advance', 'GameCube', 'Game Boy Color', 'Nintendo 64', 'Dreamcast', 
            'PlayStation 4', 'Nintendo DSi', 'Nintendo 3DS', 'Xbox One', 'PlayStation Vita',
            'Wii U']

In [None]:
df['platform'] = df['platform'].apply(lambda x: 'Others' if x not in platforms else x)
plt.figure(figsize=(15, 7))
sns.countplot(y=df['platform'])

We can see that the platform with the most releases is PC, that makes sense, since computer's don't really have a life spam, they get upgraded, but stay categorized as PC. The second and third platforms with the most releases are XBOX 360 and Playstation 2.  Followed by the Playstation 3 and Wii.

# Game Releases by Date

Let's take a look at the graphs for games released by day, month and year. From my gamer experience I feel that the hottest months for releases are around september/october. Let's see if that's how it goes and if there is some other interesting information.

In [None]:
fig, (axis1, axis2, axis3) = plt.subplots(3, 1, figsize=(15, 15))
df.groupby(['release_day']).size().plot(ax=axis1, c='b')
df.groupby(['release_month']).size().plot(ax=axis2, c='b')
df.groupby(['release_year']).size().plot(ax=axis3, c='b')

Interesting! Apparently the hottest month for game releases is november! I guess it makes sense, since it's the month before the holidays, it's a pretty obvious business decision to sell games in november antecipating christmas. Also, it seems like the releases distribution for days of the month is pretty even, except for the last few days, where they drop off abruptly.

From the graph of releases by year we can see that the video game industry realy took off at the end of the 90's, probably boosted by the releases of the Nintendo 64 and Playstation in the mid 90's.

# Masterpieces

Let's take a look at the masterpieces in the last 40 years according to IGN.

In [None]:
masterpieces = df[df['score_phrase']=='Masterpiece']
masterpieces.drop('score_phrase', axis=1, inplace=True)
masterpieces

## Masterpieces by Platform

In [None]:
masterpieces['platform'].value_counts()

In [None]:
plt.figure(figsize=(13, 5))
sns.countplot(y='platform', data=masterpieces)

We can see that the platform with the most masterpieces is the Game Boy Color, followed by the Playstation 3 and Xbox 360. It's remarkable that the Game Boy Color has the most masterpieces, since it's not even in the top 10 for release count.

## Masterpieces by Year

Let's take a look at the masterpieces distribution by year

In [None]:
plt.figure(figsize=(13, 8))
sns.countplot('release_year', data=masterpieces)

# Worst Titles

![](http://)Since there are only 3 disasters, let's take a look at the bottom two scores, disasters/unberables.

In [None]:
disasters = df[df['score_phrase']=='Disaster']
unbearables = df[df['score_phrase']=='Unbearable']
worst = pd.concat([disasters, unbearables])
worst

## Worst Titles by Platform

In [None]:
worst['platform'].value_counts()

In [None]:
plt.figure(figsize=(13, 5))
sns.countplot(y='platform', data=worst)

Well.. things are looking pretty bad for the Wii. Having so many bad titles on the pc is expected, since it's the platform with the most releases, but the wii is only 4th in the titles released ranking.

#### Thank you all for taking a look at my notebook. This is for fun/learning purposes only. I'm a huge video games fan, so it's a really interesting project for me. Any suggestion, please leave them in the comments section of Kaggle. Thanks!