# Game Sales (Integrated Project 1)

## Introduction

Ice, an online store wishes to plan advertising campaigns in the best way possible. Historical game data has been provided to analyze the industry and identify patterns that will determine if a game will succeed or not. User and expert reviews, genres, platforms and game sale data has all been provided. 

The process will include six steps that will:
1. Open and study general information
2. Prepare the data
3. Analyze the data
4. Create user profiles for regions
5. Test hypotheses
6. Write a conclusion

### Data Description

Dataset: `games.csv`

`name:` Name of the video Game

`platform:` Platform the video game is released on

`year-of-release:` Release year

`genre:` Genre of the video game

`na_sales:` North American sales

`eu_sales:` Europe sales

`jp_sales:` Japan sales

`other_sales:` Sales made elsewhere

`critic_score:` Score provided by game critic

`user_score:` Score made by game user

`rating:` Rating provided by the Entertainment Software Rating Board (ESRB)

## 1. General Information

### 1.1 Import appropriate libraries

In [1]:
import pandas as pd # for data manipulation
import numpy as np # for linear algebra between arrays
import plotly.express as px # for visualization
import plotly.graph_objects as go # for visualization
import scipy.stats as stats # statistical library

### 1.2 Import data

In [2]:
# Read in data
df = pd.read_csv('data/games.csv')

### 1.3 Study general information

In [3]:
# Show head
df.head(5)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,


In [4]:
# Show info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


In [5]:
# Show missing values
mis_values = df.isnull().sum().to_frame('missing_values')
mis_values['%'] = round(df.isnull().sum()*100/len(df),2)
mis_values.sort_values(by='%', ascending=False)

Unnamed: 0,missing_values,%
Critic_Score,8578,51.32
Rating,6766,40.48
User_Score,6701,40.09
Year_of_Release,269,1.61
Name,2,0.01
Genre,2,0.01
Platform,0,0.0
NA_sales,0,0.0
EU_sales,0,0.0
JP_sales,0,0.0


In [6]:
# Show summary statistics
df.describe()

Unnamed: 0,Year_of_Release,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score
count,16446.0,16715.0,16715.0,16715.0,16715.0,8137.0
mean,2006.484616,0.263377,0.14506,0.077617,0.047342,68.967679
std,5.87705,0.813604,0.503339,0.308853,0.186731,13.938165
min,1980.0,0.0,0.0,0.0,0.0,13.0
25%,2003.0,0.0,0.0,0.0,0.0,60.0
50%,2007.0,0.08,0.02,0.0,0.01,71.0
75%,2010.0,0.24,0.11,0.04,0.03,79.0
max,2016.0,41.36,28.96,10.22,10.57,98.0


### 1.4 Initial Findings

#### Missing Values:

Exist in the following columns:
- name (0.01%)
- genre (0.01%)
- year-of-release (1.61%)
- critic_score (51.32%)
- user_score (40.09%)
- rating (40.48%)

#### Column types:

Changes will need to be made to:

- year_of_release (float to integer)
- critic_score (float to integer, if applicable)
- user_score (object to float)

## Step 2. Prepare the data

### 2.1 Replace the column names (snake_case).

In [7]:
# Make column names lowercase
df.columns = df.columns.str.lower()

### 2.2 Convert the data to the required types.

#### 2.2.1 Year of Release

Year of release should be an integer as games are only released in a unique year

In [8]:
# change year_of_release to int
df['year_of_release'] = df['year_of_release'].astype('Int64')

#### 2.2.2 User Score

User scores should be converted to a type to allow for arithmetic. As ratings are given to the decimal, they will be converted from objects to floats.

However, Not all entries in user score can be converted to a float. Find values that cannot be, then change these values to null.

##### 2.2.2.1 Find unique values that cannot be converted to string

In [9]:
# Create empty list to store values
unique_string_values=[]

# Try to change objects to floats with loop, return where it fails
for i in range(len(df['user_score'])):
    try:
        df.iloc[i, df.columns.get_loc('user_score')] = float(df.iloc[i, df.columns.get_loc('user_score')])
    except ValueError:
        if df.iloc[i, df.columns.get_loc('user_score')] not in unique_string_values:
            unique_string_values.append(df.iloc[i, df.columns.get_loc('user_score')])
        

# print unique values
print(unique_string_values)

['tbd']


##### 2.2.2.2 Change tbd values to null values

As TBD stands for to be determined, these values have not been given.  Change them to null values and revisist them in the next section.

In [10]:
# Change tbd to NaN
df['user_score'] = df['user_score'].replace('tbd', np.nan)

# Convert user_score to float
df['user_score'] = df['user_score'].astype('float')

#### 2.2.3 Critic Score

From first inspection, it seems like critic scores are given as a whole number out of 100. Converting this column to an integer will make it easier to read. Ensure all values can be converted to integer before making the conversion. 

In [11]:
# Check critic score values to see if all int values
print(df['critic_score'].unique())

[76. nan 82. 80. 89. 58. 87. 91. 61. 97. 95. 77. 88. 83. 94. 93. 85. 86.
 98. 96. 90. 84. 73. 74. 78. 92. 71. 72. 68. 62. 49. 67. 81. 66. 56. 79.
 70. 59. 64. 75. 60. 63. 69. 50. 25. 42. 44. 55. 48. 57. 29. 47. 65. 54.
 20. 53. 37. 38. 33. 52. 30. 32. 43. 45. 51. 40. 46. 39. 34. 35. 41. 36.
 28. 31. 27. 26. 19. 23. 24. 21. 17. 22. 13.]


In [12]:
# Change critic_score to int
df['critic_score'] = df['critic_score'].astype('Int64')

### 2.3 Missing Values

#### 2.3.1 Name

In [13]:
# Find name missing values
df[df['name'].isna()]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,,GEN,1993,,1.78,0.53,0.0,0.08,,,
14244,,GEN,1993,,0.0,0.0,0.03,0.0,,,


Both missing name values were realeased in 1993 on the Genesis platform. Search for games released in 1993 on the Genesis platform to see if a potential duplicate was created.

In [14]:
# Find values where platform in GEN and year_of_release is 1993
df[(df['platform'] == 'GEN') & (df['year_of_release'] == 1993)]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,,GEN,1993,,1.78,0.53,0.0,0.08,,,
7885,Shining Force II,GEN,1993,Strategy,0.0,0.0,0.19,0.0,,,
8893,Super Street Fighter II,GEN,1993,Fighting,0.0,0.0,0.15,0.0,,,
11986,Ecco: The Tides of Time,GEN,1993,Adventure,0.0,0.0,0.07,0.0,,,
12098,Street Fighter II': Special Champion Edition (...,GEN,1993,Action,0.0,0.0,0.07,0.0,,,
12264,Streets of Rage 3,GEN,1993,Action,0.0,0.0,0.07,0.0,,,
12984,Dynamite Headdy,GEN,1993,Platform,0.0,0.0,0.05,0.0,,,
13343,Beyond Oasis,GEN,1993,Role-Playing,0.0,0.0,0.05,0.0,,,
14244,,GEN,1993,,0.0,0.0,0.03,0.0,,,


As no other game titles with the same sales exists, no duplicate can be found. The missing values will be dropped.

**Note:** This also eliminates all missing values within the genre column

In [15]:
# As no other games match the sales description, drop the two missing name rows
df= df.dropna(subset=['name'])

#### 2.3.2 Year of Release

In [16]:
# Print length of missing year_of_release values
print('Missing year_of_release total: ',len(df[df['year_of_release'].isna()]))

# Find missing year_of_release values
df[df['year_of_release'].isna()]

Missing year_of_release total:  269


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
183,Madden NFL 2004,PS2,,Sports,4.26,0.26,0.01,0.71,94,8.5,E
377,FIFA Soccer 2004,PS2,,Sports,0.59,2.36,0.04,0.51,84,6.4,E
456,LEGO Batman: The Videogame,Wii,,Action,1.80,0.97,0.00,0.29,74,7.9,E10+
475,wwe Smackdown vs. Raw 2006,PS2,,Fighting,1.57,1.02,0.00,0.41,,,
609,Space Invaders,2600,,Shooter,2.36,0.14,0.00,0.03,,,
...,...,...,...,...,...,...,...,...,...,...,...
16373,PDC World Championship Darts 2008,PSP,,Sports,0.01,0.00,0.00,0.00,43,,E10+
16405,Freaky Flyers,GC,,Racing,0.01,0.00,0.00,0.00,69,6.5,T
16448,Inversion,PC,,Shooter,0.01,0.00,0.00,0.00,59,6.7,M
16458,Hakuouki: Shinsengumi Kitan,PS3,,Adventure,0.01,0.00,0.00,0.00,,,


Games that are released on different platforms may give hints as to the year of release. Searching for Madden NFL 2004 confirms this:

In [17]:
# Find Madden NFL 2004 entries
df[df['name'] == 'Madden NFL 2004']

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
183,Madden NFL 2004,PS2,,Sports,4.26,0.26,0.01,0.71,94,8.5,E
1881,Madden NFL 2004,XB,2003.0,Sports,1.02,0.02,0.0,0.05,92,8.3,E
3889,Madden NFL 2004,GC,2003.0,Sports,0.4,0.1,0.0,0.01,94,7.7,E
5708,Madden NFL 2004,GBA,2003.0,Sports,0.22,0.08,0.0,0.01,70,6.6,E


It can be assumed that where game titles with the same name exist, the year of release is the same. This is because games tend to be released on all platforms around the same dates. A formula will be created where a certain missing aspect from a game will be replaced by the mode of the games found with the same name.

 However, some game titles with the same name have been remade for newer consoles. If the newly added year comes before the console was released, the year will not be added.



In [18]:
df[(df['name'] == df.iloc[1, df.columns.get_loc('name')])]['year_of_release']

# Save range of year_of_release values for each platform in a dictionary
platform_year_range = {}
for i in df['platform'].unique():
    platform_year_range[i] = [df[df['platform'] == i]['year_of_release'].min(), df[df['platform'] == i]['year_of_release'].max()]
    
# Create a function that fills missing values with the mode, when the mode is within the platform's year_of_release range
def fill_missing_with_mode(x): # x is the column name
    for i in range(len(df[x])): # Loop through the column
        if pd.isnull(df.iloc[i, df.columns.get_loc(x)]): # When the value is missing
            if df[(df['name'] == df.iloc[i, df.columns.get_loc('name')]) # gather matching names
                  & (df['year_of_release'] >= platform_year_range[df.iloc[i, df.columns.get_loc('platform')]][0]) # and the year is within the platform's year range
                  & (df['year_of_release'] <= platform_year_range[df.iloc[i, df.columns.get_loc('platform')]][1])][x].mode().empty: # and the mode is empty
                pass
            # Create a condition for when the mode is not empty and replace the missing value with the mode
            else:
                df.iloc[i, df.columns.get_loc(x)] = df[(df['name'] == df.iloc[i, df.columns.get_loc('name')]) # replace with when: matching names
                                                       & (df['year_of_release'] >= platform_year_range[df.iloc[i, df.columns.get_loc('platform')]][0]) # and the year is within the platform's year range
                                                       & (df['year_of_release'] <= platform_year_range[df.iloc[i, df.columns.get_loc('platform')]][1])][x].mode()[0] # and the mode is not empty
        else:
            pass
    

In [19]:
# Apply fill_missing_with_mode function to year_of_release
fill_missing_with_mode('year_of_release')

In [20]:
# Print length of missing year_of_release values
print('Missing year_of_release total: ',len(df[df['year_of_release'].isna()]))

Missing year_of_release total:  151


Values with missing years still exist. However, this can be reduced further by looking at titles with a year number in their name. It is common knowledge that these video games that are released in the previous year. A quick search will confirm this when looking at WWE Smackdown vs. Raw 2006.

Find these titles and add the appropriate year to their release.

In [21]:
# Save year_titles when year_of_release is missing and contains a year in the name
year_titles = df[(df['year_of_release'].isna()) & (df['name'].str.contains(r'\d{4}'))]

# Show year_titles
year_titles


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
475,wwe Smackdown vs. Raw 2006,PS2,,Fighting,1.57,1.02,0.0,0.41,,,
4775,NFL GameDay 2003,PS2,,Sports,0.2,0.15,0.0,0.05,60.0,,E
5655,All-Star Baseball 2005,PS2,,Sports,0.16,0.12,0.0,0.04,72.0,8.6,E
8918,All-Star Baseball 2005,XB,,Sports,0.11,0.03,0.0,0.01,75.0,8.8,E
13195,Tour de France 2011,X360,,Racing,0.0,0.04,0.0,0.01,46.0,7.6,
13929,Sega Rally 2006,PS2,,Racing,0.0,0.0,0.04,0.0,,,
16079,Football Manager 2007,X360,,Sports,0.0,0.01,0.0,0.0,,,


In [22]:
# Extract each year from 'name' column and fill missing 'year_of_release' values
extracted_years = df['name'].str.extract(r'(\d{4})', expand=False)

# Subtract each non-null value by 1
if extracted_years.notnull().any():
    extracted_years[extracted_years.notnull()] = extracted_years[extracted_years.notnull()].astype(int) - 1

# Fill missing values with extracted years
df['year_of_release'] = df['year_of_release'].fillna(extracted_years)

In [23]:
# Print length of missing year_of_release values
print('New missing year_of_release total:',len(df[df['year_of_release'].isna()]))

New missing year_of_release total: 144


Looking at yearly data, some games release do not make sense. For instance, the Nintendo DS came out in 2004, so no games before 2004 should have been released (see boxplots below). Console creation dates will be created and titles that have been recorded as existing before will have their year of release changed.

In [24]:
# Create rainbow color sequence
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, len(df['platform'].unique()))]

# Create plotly box plots using go
fig = go.Figure()

# Iterate over unique platforms and create a box plot for each
for i, platform in enumerate(df['platform'].unique()):
    platform_data = df[df['platform'] == platform]['year_of_release']
    fig.add_trace(go.Box(y=platform_data, name=platform, marker_color=c[i]))

# Update figure layout
fig.update_layout(
    title='Year of Release by Platform',
    xaxis_title='Platform',
    yaxis_title='Year of Release',
    xaxis=dict(tickmode='linear', tick0=0, dtick=1),
    showlegend=False
)

# Hide legend
fig.update_layout(showlegend=False)

# Show figure
fig.show()

In [25]:
# Set platform release dates (gathered from Wikipedia)
platform_release_dates = {'2600': 1977,
 '3DO': 1993,
 '3DS': 2011,
 'DC': 1998,
 'DS': 2004,
 'GB': 1988,
 'GBA': 2000,
 'GC': 2001,
 'GEN': 1988,
 'GG': 1990,
 'N64': 1996,
 'NES': 1983,
 'NG': 1993,
 'PC': 1985,
 'PCFX': 1996,
 'PS': 1994,
 'PS2': 2000,
 'PS3': 2006,
 'PS4': 2013,
 'PSP': 2004,
 'PSV': 2011,
 'SAT': 1994,
 'SCD': 1991,
 'SNES': 1990,
 'TG16': 1989,
 'WS': 1999,
 'Wii': 2006,
 'WiiU': 2012,
 'X360': 2005,
 'XB': 2000,
 'XOne': 2013}


Set dates to null when year_of_release is less than platform release date. Do this by creating a mask to identify rows of interest and set year of release to null when true. 

In [26]:

# Create a mask to identify rows where 'year_of_release' is earlier than platform release date
mask = (df['year_of_release'].fillna(pd.Series(platform_release_dates)).astype('Int64') <
        df['platform'].map(platform_release_dates))

# Set 'year_of_release' values to NaN where the mask is True
df.loc[mask, 'year_of_release'] = np.nan

In [27]:

# Print length of missing year_of_release values
print('New missing year_of_release total:',len(df[df['year_of_release'].isna()]))


New missing year_of_release total: 145


In [28]:
# Change year_of_release to integer form
df['year_of_release'] = df['year_of_release'].astype('Int64')

**Conclusion:** Unfortunately missing release year data still remains when it must exist. However, the data may still be kept as it shows insights into ratings and sales data.

#### 2.3.3 Critic Score


In [29]:
# Print length of missing critic_score values
print('Missing critic_score total: ',len(df[df['critic_score'].isna()]))

# Show missing critic_score values
df[df['critic_score'].isna()]

Missing critic_score total:  8576


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
1,Super Mario Bros.,NES,1985,Platform,29.08,3.58,6.81,0.77,,,
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,11.27,8.89,10.22,1.00,,,
5,Tetris,GB,1989,Puzzle,23.20,2.26,4.22,0.58,,,
9,Duck Hunt,NES,1984,Shooter,26.93,0.63,0.28,0.47,,,
10,Nintendogs,DS,2005,Simulation,9.05,10.95,1.93,2.74,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,Samurai Warriors: Sanada Maru,PS3,2016,Action,0.00,0.00,0.01,0.00,,,
16711,LMA Manager 2007,X360,2006,Sports,0.00,0.01,0.00,0.00,,,
16712,Haitaka no Psychedelica,PSV,2016,Adventure,0.00,0.00,0.01,0.00,,,
16713,Spirits & Spells,GBA,2003,Platform,0.01,0.00,0.00,0.00,,,


In [30]:
# Similarily to year_of_release, fill missing critic_score values with the mode of critic_score for each game when appropriate
fill_missing_with_mode('critic_score')

# Print new length of missing critic_score values
print('New missing critic_score total: ',len(df[df['critic_score'].isna()]))

New missing critic_score total:  7711


**Conclusion:** Whilst many values are still missing, the critic score data is not pivotal to the analysis and will be kept. It is quite possible that no critic data was given for these titles.

#### 2.3.4 User Score

In [31]:
# Print length of missing user_score values
print('Missing user_score total: ',len(df[df['user_score'].isna()]))

# Find missing user_score values
df[df['user_score'].isna()]

Missing user_score total:  9123


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
1,Super Mario Bros.,NES,1985,Platform,29.08,3.58,6.81,0.77,,,
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,11.27,8.89,10.22,1.00,,,
5,Tetris,GB,1989,Puzzle,23.20,2.26,4.22,0.58,,,
9,Duck Hunt,NES,1984,Shooter,26.93,0.63,0.28,0.47,,,
10,Nintendogs,DS,2005,Simulation,9.05,10.95,1.93,2.74,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,Samurai Warriors: Sanada Maru,PS3,2016,Action,0.00,0.00,0.01,0.00,,,
16711,LMA Manager 2007,X360,2006,Sports,0.00,0.01,0.00,0.00,,,
16712,Haitaka no Psychedelica,PSV,2016,Adventure,0.00,0.00,0.01,0.00,,,
16713,Spirits & Spells,GBA,2003,Platform,0.01,0.00,0.00,0.00,,,


In [32]:
# Similarly to year_of_release, fill missing user_score values with the mode of user_score for each game
fill_missing_with_mode('user_score')

In [33]:
# Print new length of missing user_score values
print('New missing user_score total: ',len(df[df['user_score'].isna()]))

New missing user_score total:  8120


**Conclusion:** Similar to the critic score data, it is quite possible that no user data was given. This information is still important and these rows will be kept.

##### 2.3.5 Rating

In [34]:
# Print length of missing rating values
print('Missing rating total: ',len(df[df['rating'].isna()]))

# Find missing rating values
df[df['rating'].isna()]

Missing rating total:  6764


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
1,Super Mario Bros.,NES,1985,Platform,29.08,3.58,6.81,0.77,,,
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,11.27,8.89,10.22,1.00,,,
5,Tetris,GB,1989,Puzzle,23.20,2.26,4.22,0.58,,,
9,Duck Hunt,NES,1984,Shooter,26.93,0.63,0.28,0.47,,,
10,Nintendogs,DS,2005,Simulation,9.05,10.95,1.93,2.74,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,Samurai Warriors: Sanada Maru,PS3,2016,Action,0.00,0.00,0.01,0.00,,,
16711,LMA Manager 2007,X360,2006,Sports,0.00,0.01,0.00,0.00,,,
16712,Haitaka no Psychedelica,PSV,2016,Adventure,0.00,0.00,0.01,0.00,,,
16713,Spirits & Spells,GBA,2003,Platform,0.01,0.00,0.00,0.00,,,


In [35]:
# Similarlly to year_of_release, fill missing rating values with the mode of rating for each game
fill_missing_with_mode('rating')

In [36]:
# Print new length of missing rating values
print('New missing rating total: ',len(df[df['rating'].isna()]))

New missing rating total:  6424


**Conclusion:** Whilst ratings may not be given for all games, the data is still useful for analysis, so we will leave the missing values as they are. It may be possible that some games may not have been rated as they fell out of the ESRB jurisdiction where games are made, produced and sold outside of North America.

### 2.4 Create Total Sales Column

Sum the columns with North America, Europe, Japan and Other sales data.

In [37]:
# Sum sales columns
df['total_sales']=df[['na_sales','eu_sales','jp_sales','other_sales']].sum(axis=1)

## Step 3. Analyze the data

### 3.1 Number of Games Released Each Year

In [38]:
# Create plotly histogram of yearly_sales, make the graph green
fig = px.histogram(df.dropna(subset=['year_of_release']), 
                   x='year_of_release', 
                   color_discrete_sequence=['hotpink'])

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Number of Games Released Each Year',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Year',
                    yaxis_title='Number of Games')

# Show figure
fig.show()

Very little games were sold before 1994. The number of games released peaked in 2008 at 1437. In 2011, a dramatic drop in game releases occurred. From 2013-2016 around 500 games are released yearly.

### 3.2 Top 5 Best Selling Platforms of All Time

First find the top selling platforms in general. Seperate all platforms by platform and year. Filter for top 5 platforms to get yearly data. Show this as a line graph.

In [39]:
# Group platforms by total sales
platform_sales=df.groupby('platform')['total_sales'].sum().sort_values(ascending=False).reset_index()

platform_sales.head(5)

Unnamed: 0,platform,total_sales
0,PS2,1255.77
1,X360,971.42
2,PS3,939.65
3,Wii,907.51
4,DS,806.12


In [40]:
# Group df by platform and year_of_release and sum total_sales
platform_year_sales = df.groupby(['platform', 'year_of_release'])['total_sales'].sum().sort_values(ascending=False).reset_index()

# Filter platform_year_sales for the top 5 platforms
top_platform_year_sales = platform_year_sales[platform_year_sales['platform'].isin(platform_sales['platform'].head(5))]

# Group top_platform_year_sales by platform and year_of_release and sum total_sales
top_platform_year_sales = top_platform_year_sales.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()

In [41]:
# Create plotly linegraph of top_platform_year_sales
fig = px.line(top_platform_year_sales, 
              x='year_of_release', 
              y='total_sales', 
              color='platform', 
              color_discrete_sequence=['hotpink', 'blue', 'green', 'purple', 'orange'],
              hover_data=['platform', 'total_sales'],
              title = 'Top 5 Selling Platforms Over Time')

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Year',
                    yaxis_title='Total Sales')

# Show figure
fig.show()

**Conclusion:** The lifespan of top platforms is around 9 - 11 years. New platforms can rise quite rapidly to peak within 3-5 years. All seem to fade slowly over their remaing lifespan, except for the Wii which saw a quick drop after 2009.

Only a small subset of the top selling platforms are still making sales.

Using this information, data from 2012 and onwards should be used to make assessments for the climate in 2017. Any data from before is not relevant.

### 3.3 Current Platform Leaders

Filter so only data from 2012 is selected. Group by platform and find total sales. Find platforms that have shown recent success and potential on a line graph.

In [42]:
# Cut the data to only include data from 2012 and later
df = df[df['year_of_release'] >= 2012]

# Group by platform and year_of_release and sum total_sales
platform_sales = df.groupby(['platform'])['total_sales'].sum()

In [43]:
# Show platform_sales sorted in descending order
platform_sales.sort_values(ascending=False)

platform
PS4     314.14
PS3     289.71
X360    237.52
3DS     195.01
XOne    159.32
WiiU     82.19
PC       63.51
PSV      49.18
Wii      36.60
DS       13.21
PSP      11.69
Name: total_sales, dtype: float64

As the PS3, X360, Wii, DS and PSP are all on the end of their life cycles, we will not include them in our analysis

In [44]:
# Drop the PS3, X360, Wii, DS, and PSP Platforms
df = df[~df['platform'].isin(['PS3', 'X360', 'Wii', 'DS', 'PSP'])]

# Group by platform and year_of_release and sum total_sales
platform_year_sales = df.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()

In [45]:
# Create plotly linegraph of platform_years_sales
fig = px.line(platform_year_sales, 
              x='year_of_release', 
              y='total_sales', 
              color='platform', 
              color_discrete_sequence=['hotpink', 'blue', 'green', 'purple', 'orange', 'red', 'brown', 'pink', 'yellow', 'black'],
              hover_data=['platform', 'total_sales'],
              title='Total Sales by Top Platforms in the last 5 Years')

# Update figure, center title, change xaxis title font size, change yaxis title font size, restrict xaxis as integer
fig.update_layout(title={'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Year',
                    yaxis_title='Total Sales ($Millions)',
                    xaxis={'type': 'category'})

# Show figure
fig.show()

**Analysis:** All platforms seem to be on the downtrend. The PS4 and the XOne are the only platforms to have had major success in the last 3 years, however, they are both still headed downwards.

### 3.4 Game Sales on High Performing Platforms

Compare sales of individual games within their platforms. This will be done by creating boxplots and looking at summary statistics.

In [46]:
# Groupby by platform and find total_sales
platform_sales = df.groupby(['platform'])['total_sales'].sum().sort_values(ascending=False).reset_index()

# Create rainbow colors for boxplot
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, len(platform_sales['platform']))]

# Create box plot of total_sales by platform
fig = go.Figure()

# Add trace for each platform
for i in range(len(platform_sales['platform'])):
    fig.add_trace(go.Box(x=df[df['platform'] == platform_sales.iloc[i, platform_sales.columns.get_loc('platform')]]['total_sales'],
                         name=platform_sales.iloc[i, platform_sales.columns.get_loc('platform')],
                         marker_color=c[i]))
    
# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title_text='Boxplots of Sales by Platform',
                  title={'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                    xaxis_title='Platform',
                    yaxis_title='Total Sales ($Millions)')

# Show figure
fig.show()

In [47]:
# Create summary statistics table of total_sales by platform
df.groupby(['platform'])['total_sales'].describe()


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3DS,397.0,0.491209,1.385416,0.01,0.04,0.11,0.32,14.6
PC,255.0,0.249059,0.490149,0.01,0.03,0.08,0.24,5.14
PS4,392.0,0.801378,1.609456,0.01,0.06,0.2,0.73,14.63
PSV,411.0,0.119659,0.203011,0.01,0.02,0.05,0.12,1.96
WiiU,147.0,0.559116,1.058836,0.01,0.08,0.22,0.525,7.09
XOne,247.0,0.64502,1.036139,0.01,0.06,0.22,0.685,7.39


**Analysis:** The difference in sales is significant. The Nintendo 3DS and the PS4 have a hold on the top 11 most selling games. However, the 3DS has a much lower median at 110,000 per game when compared to PS4, XOne, and WiiU all at 200,000 or above per game. PS4 wins on mean at 800,000, followed by XOne at 645,000. PC is weaker all across the board. PSV brags the most games, but that's all it can when compared to all other platforms. PS4 is clearly winning the battle.

### 3.5 Reviews and Sales on the PS4

Build a scatterplot of critic and user scores vs total sales on the PS4. Determine correlation to see how ratings may influence sales.

In [48]:
# Create scaled_user_ratings column
df['scaled_user_ratings'] = df['user_score'] * 10

# Create PS4 dataframe
ps4 = df[df['platform'] == 'PS4']

# Drop critic and user rating scores (note: that in latest version of python this is not needed)
ps4= ps4.dropna(subset=['critic_score'])
ps4= ps4.dropna(subset=['scaled_user_ratings'])
ps4= ps4.dropna(subset=['total_sales'])

# Make critic_score and scaled_user_ratings floats
ps4['critic_score'] = ps4['critic_score'].astype(float)

In [49]:
# Create scatterplot
fig = px.scatter(ps4, 
                 x=['critic_score', 'scaled_user_ratings'],
                 y='total_sales', 
                 title='Critic Score vs Total Sales on the PS4',
                 color_discrete_sequence=['hotpink','darkblue'],
                 hover_data=['name'],
                 trendline="ols")

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Critic Score',
                    yaxis_title='Total Sales ($Millions)',
                        legend_title_text='Score Type'
                        )

# Show figure
fig.show()

In [50]:
# Calculate correlation coefficients
print('Correlation Coefficient (critic_score - total_sales):', round(ps4['critic_score'].corr(ps4['total_sales']),3))
print('Correlation Coefficient (scaled_user_ratings - total_sales):', round(ps4['scaled_user_ratings'].corr(ps4['total_sales']),3))


Correlation Coefficient (critic_score - total_sales): 0.405
Correlation Coefficient (scaled_user_ratings - total_sales): -0.013


**Analysis:** The correlation coefficient between critic score and sales on the PS4 is 0.405, which indicates a weak positive correlation, however stronger than that between user score and sales on the PS4 is -0.013. Critic score is a better indicator of sales than user score on the PS4.

### 3.6 Sales of Same Titles on Other Platforms

How do the sales of the same games fair on different platforms?

In [51]:
# Find games with same names across platforms
same_name = df[df['name'].isin(df[df['platform'] == 'PS4']['name'])]

# Group same name by platform and total_sales and sum
same_name = same_name.groupby(['name', 'platform'])['total_sales'].sum().reset_index()

# Sort same_name by total_sales in descending order
same_name = same_name.sort_values(by='total_sales', ascending=False)

# Create a mask to filter same_name for the top 10 games
mask = same_name['name'].isin(same_name['name'].head(10))

# Filter same_name for the top 10 games
same_name = same_name[mask]

# Create rainbow color sequence
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, len('platform'))]

# Show these games on a bar chart 
fig = px.bar(same_name,
                x='name',
                y='total_sales',
                color='platform',
                color_discrete_sequence=c,
                title='Total Sales of Games with the Same Name on Different Platforms',
                hover_data=['name', 'platform'])

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Game Name',
                    yaxis_title='Total Sales ($Millions)',
                        legend_title_text='Platform'
                        )


# Show figure
fig.show()

**Analysis:** PS4 still dominates in sales of the same games when compared to other platforms. XOne comes next, followed by PC. 

### 3.7 Distribution of Sales by Genre
Take a look at the general distribution of games by genre. What can we say about the most profitable genres? Can you generalize about genres with high and low sales?

In [52]:
# Group by Genre
genres = df.groupby(['genre'])['total_sales'].sum().sort_values(ascending=False).reset_index()

# Create rainbow color sequence
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, len(genres['genre']))]

# Create bar chart of genres with go.figure
fig = go.Figure(data=[go.Bar(
    x=genres['genre'],
    y=genres['total_sales'],
    marker_color=c
)])

# Add title
fig.update_layout(title_text='Total Sales by Genre')

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title_text='Total Sales by Genre',
                  title={'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                    xaxis_title='Genre',
                    yaxis_title='Total Sales ($Millions)')

# Show figure
fig.show()


In [53]:
# Group by Genre
genres = df.groupby(['genre'])['total_sales'].sum().sort_values(ascending=False).reset_index()

# Create rainbow colors for bar chart
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, len(genres['genre']))]

# Create box plot of total_sales by genre
fig = go.Figure()

# Add trace for each genre
for i in range(len(genres['genre'])):
    fig.add_trace(go.Box(x=df[df['genre'] == genres.iloc[i, genres.columns.get_loc('genre')]]['total_sales'],
                         name=genres.iloc[i, genres.columns.get_loc('genre')],
                         marker_color=c[i]))
    
# Show mean in box plot
fig.update_traces(boxmean=True)

fig.update_layout(title_text='Boxplots of Sales by Genre',
                  title={'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                    xaxis_title='Total Sales ($Millions)',
                    yaxis_title='Genre',
                    height=800,
                    width=1200)

# Find mean of each genre
genre_means = df.groupby(['genre'])['total_sales'].mean().sort_values(ascending=False).reset_index().round(3)
# Show figure
fig.show()

In [54]:
df.groupby(['genre'])['total_sales'].agg(['count', 'mean', 'median', 'std', 'min', 'max']).sort_values('mean', ascending=False)

Unnamed: 0_level_0,count,mean,median,std,min,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shooter,137,1.25292,0.45,2.004055,0.01,14.63
Platform,64,0.807188,0.225,1.535765,0.01,9.9
Sports,142,0.677535,0.22,1.254101,0.01,8.58
Role-Playing,262,0.56271,0.15,1.453253,0.01,14.6
Simulation,57,0.549474,0.11,1.46098,0.01,9.17
Fighting,56,0.502857,0.12,1.22827,0.02,7.55
Racing,72,0.465556,0.11,0.983376,0.01,7.09
Action,665,0.343895,0.11,0.813269,0.01,12.62
Misc,129,0.330078,0.12,0.65489,0.01,4.42
Puzzle,22,0.185909,0.055,0.321598,0.01,1.19


**Analysis:** The most profitable genres are Action, Shooter, and Sports. The least profitable genres are Adventure, Strategy, and Puzzle. 

This changes slightly when looking on a per game average. Whilst shooter games dominate on average, platform games are the second most profitable with a mean of 0.807 and a median of 0.225 in millions of sales. Action games get a bump down, ranking lower than all other categories but puzzle, strategy, adventure and miscellaneous games.

However, the game with the third highest revenue is an action game. The other top five games include a platform, a shooter and two role-playing games.

## Step 4. Create a user profile for each region

### Step 4.1 Top Five Platforms per Region

In [55]:
# group each region by region, include null values
na_sales_platform = df.groupby(['platform'])['na_sales'].sum().sort_values(ascending=False).reset_index()
eu_sales_platform = df.groupby(['platform'])['eu_sales'].sum().sort_values(ascending=False).reset_index()
jp_sales_platform = df.groupby(['platform'])['jp_sales'].sum().sort_values(ascending=False).reset_index()

# Add a column to each dataframe with the region name to give percentage of total sales
na_sales_platform['percentage'] = round(na_sales_platform['na_sales'] / na_sales_platform['na_sales'].sum() * 100, 2)
eu_sales_platform['percentage'] = round(eu_sales_platform['eu_sales'] / eu_sales_platform['eu_sales'].sum() * 100, 2)
jp_sales_platform['percentage'] = round(jp_sales_platform['jp_sales'] / jp_sales_platform['jp_sales'].sum() * 100, 2)

In [56]:
# Create color
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 180, 3)]

# Create bar chart for North America
fig = go.Figure(go.Bar(x=na_sales_platform['platform'],
                     y=na_sales_platform['percentage'],
                     name='North America',
                     marker_color=c[0]))

# Add bar chart for Europe and Japan
fig.add_trace(go.Bar(x=eu_sales_platform['platform'],
                        y=eu_sales_platform['percentage'],
                        name='Europe',
                        marker_color=c[1]))

fig.add_trace(go.Bar(x=jp_sales_platform['platform'],
                        y=jp_sales_platform['percentage'],
                        name='Japan',
                        marker_color=c[2]))

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Percentage of Total Sales by Platform in Each Region',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Platform',
                    yaxis_title='Percentage of Total Sales')

# Show figure
fig.show()

**Analysis:** PS4 and XOne are the top 2 platforms in North America and Europe, taking up over 60% of the marketshare. PC gamers also have a place in these markets, albeit smaller.

 In Japan, the 3DS has a whopping 63.55% of marketshare. This is more than the top 2 platforms in America and Europe. With the PSV in second, Japanese gamers probably prefer more immersive experiences that put the player into the game with 3D features and virutal reality. Interestingly XOne is basically non-existent in Japan.

### Step 4.2 Top Five Genres per Region

- For each region (NA, EU, JP), determine:
- Do ESRB ratings affect sales in individual regions?

In [57]:
# group each region by genre
na_sales_genre = df.groupby(['genre'])['na_sales'].sum().sort_values(ascending=False).reset_index()
eu_sales_genre = df.groupby(['genre'])['eu_sales'].sum().sort_values(ascending=False).reset_index()
jp_sales_genre = df.groupby(['genre'])['jp_sales'].sum().sort_values(ascending=False).reset_index()

# Add a column to each dataframe with the region name to give percentage of total sales
na_sales_genre['percentage'] = round(na_sales_genre['na_sales'] / na_sales_genre['na_sales'].sum() * 100, 2)
eu_sales_genre['percentage'] = round(eu_sales_genre['eu_sales'] / eu_sales_genre['eu_sales'].sum() * 100, 2)
jp_sales_genre['percentage'] = round(jp_sales_genre['jp_sales'] / jp_sales_genre['jp_sales'].sum() * 100, 2)

In [60]:
# Add trace for each region
fig = go.Figure(go.Bar(x=na_sales_genre['genre'],
                     y=na_sales_genre['percentage'],
                     name='North America',
                     marker_color=c[0]))

fig.add_trace(go.Bar(x=eu_sales_genre['genre'],
                        y=eu_sales_genre['percentage'],
                        name='Europe',
                        marker_color=c[1]))

fig.add_trace(go.Bar(x=jp_sales_genre['genre'],
                        y=jp_sales_genre['percentage'],
                        name='Japan',
                        marker_color=c[2]))

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Percentage of Total Sales by Genre in Each Region',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Genre',
                    yaxis_title='Percentage of Total Sales')

# Show figure
fig.show()

**Analysis:** North America and the EU both share the top five genres, along with a similar marketshare. These are action, shooter, role-playing, sports and platform games. The first two take up almost 50% of all marketshare.

Japan differs significantly with roleplaying games taking up a much larger market share than the other regions top genre. This is followed by action games which occupies a similar marketshare across all regions. Miscellaneous, simulation and platform games then all take less than 8% each.

This supports the conclusion from the last section as Japanese gamers prefer more immersive experiences. Gamers in North America and Europe prefer action and shooter type games.

### Step 4.3 ESRB Ratings

The ESRB rating system designates a recommended age for gamers based on elements such as blood, violence, gore, sexual content and language. The categories for these ratings are:

`E` - Everyone

`E10` - Everyone, ages 10 and over

`T` - Teen, ages 13 and over

`M` - Mature, ages 17 and over

In [62]:
# Fill null values in the 'rating' column with a placeholder (e.g., 'Unknown')
df['rating'].fillna('Not Evaluated', inplace=True)

# Group each region by ESRB rating, include null values
na_sales_rating = df.groupby(['rating'])['na_sales'].sum().sort_values(ascending=False).reset_index()
eu_sales_rating = df.groupby(['rating'])['eu_sales'].sum().sort_values(ascending=False).reset_index()
jp_sales_rating = df.groupby(['rating'])['jp_sales'].sum().sort_values(ascending=False).reset_index()


# Add a column to each dataframe with the region name to give percentage of total sales that each rating makes up, include null values
na_sales_rating['percentage'] = round(na_sales_rating['na_sales'] / na_sales_rating['na_sales'].sum() * 100, 2)
eu_sales_rating['percentage'] = round(eu_sales_rating['eu_sales'] / eu_sales_rating['eu_sales'].sum() * 100, 2)
jp_sales_rating['percentage'] = round(jp_sales_rating['jp_sales'] / jp_sales_rating['jp_sales'].sum() * 100, 2)

In [63]:
# Create x-axis order
rating_order = [ 'M', 'T', 'E10+','E', 'Not Evaluated']

# Create plotly bar chart for North American data
fig = go.Figure(go.Bar(x=na_sales_rating['rating'],
                     y=na_sales_rating['percentage'],
                     name='North America',
                     marker_color=c[0]))

# Add trace for other regions
fig.add_trace(go.Bar(x=eu_sales_rating['rating'],
                        y=eu_sales_rating['percentage'],
                        name='Europe',
                        marker_color=c[1]))

fig.add_trace(go.Bar(x=jp_sales_rating['rating'],
                        y=jp_sales_rating['percentage'],
                        name='Japan',
                        marker_color=c[2]))

# Update figure, center title, change xaxis title font size, change yaxis title font size, order data by rating_order
fig.update_layout(title={'text': 'Percentage of Total Sales by ESRB Rating in Each Region',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='ESRB Rating',
                    yaxis_title='Percentage of Total Sales',
                    xaxis={'categoryorder': 'array', 'categoryarray': rating_order})

# Show figure
fig.show()

**Analysis:** North America and Europe once again share similar results with around 35% of games sold in the mature category. This is not surprising as action and shooters make up the larger portion of their marketshare. 

Contrarily, this is one of Japan's lowest . However, 58% of Japanese sales are not attached to a ERSB rating. This suggests that the ESRB is not concerned with rating games that are sold in Japan as much as they are about rating games in North America and Europe. The data of Japanese ratings may not be trusted as we do not know the true distribution.

## 5 Hypotheses Tests

Two hypotheses will be tested 

### 5.1 - Xbox One and PC User Ratings

**Null Hypothesis**: The average user ratings of the Xbox One and PC platforms are the same.

**Alternative Hypothesis**: The average user ratings of the Xbox One and PC platforms are not the same. 

**Population**: User ratings for Xbox One and PC for all years

**Sample:** User ratings for Xbox One and PC from 2012-2016

**Statistical Test**: Two-sided hypothesis test on the equality of two population means. Use ttest_ind from stats of Xbox and PC user ratings as the two data samples. Identify whether variances are similar for equal_var argument.

**Alpha**: Reject null hypothesis if p-value < **0.05**. This is a standard p-value that is frequently used.

In [None]:
# Save alpha value
alpha = 0.05

# Create dataframe for Xbox One and PC, drop null values from user_score
xbox_one = df[df['platform'] == 'XOne'].dropna(subset=['user_score'])
pc = df[df['platform'] == 'PC'].dropna(subset=['user_score'])

# Perform Levene's test
levene_test = stats.levene(xbox_one['user_score'], pc['user_score'])

# Determine equal_var based on Levene's test result
if levene_test.pvalue < alpha:
    equal_var = False
    print("The variances are not equal (according to Levene's test)")
else:
    equal_var = True
    print("The variances are equal (according to Levene's test)")

The variances are not equal (according to Levene's test)


In [None]:
# Use ttest_ind to compare two dataset samples, set equal_var to false
results = stats.ttest_ind(xbox_one['user_score'], pc['user_score'], equal_var = False )

# Print p-value
print('p-value:', results.pvalue)

# Print condition depending on pvalue compared to alpha
if results.pvalue < alpha:
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

p-value: 0.4265585178936072
We can't reject the null hypothesis


**Conclusion:** As we cannot reject the null hypothesis, we can assume that the average user ratings of the Xbox One and PC platforms are similar.

### 5.2 - Action and Sports Genres User Ratings 

**Null Hypothesis**: The average user ratings for action and sports genre games are the same. The null hypothesis proposes a theory that no statistically significant differences exist.

**Alternative Hypothesis**: The average user ratings for action and sports genre games are different.

**Population**: User ratings for Xbox One and PC for all years

**Sample:** User ratings for Xbox One and PC from 2012-2016

**Statistical Test**: Two-sided hypothesis test on the equality of two population means. Use ttest_ind from stats of Xbox and PC user ratings as the two data samples. Identify if variances are similar for equal_var.

**Alpha**: Reject null hypothesis if p-value < **0.05**. This is a standard p-value that is frequently used. Note however, that this value has been flipped in the case where the null hypothesis states that user ratings are different, and not the same.

In [None]:
# Create dataframe for Action and Sports, drop null values from user_score
action = df[df['genre'] == 'Action'].dropna(subset=['user_score'])
sports = df[df['genre'] == 'Sports'].dropna(subset=['user_score'])

# Perform Levene's test
levene_test = stats.levene(action['user_score'], sports['user_score'])

# Determine equal_var based on Levene's test result
if levene_test.pvalue < alpha:
    equal_var = False
    print("The variances are not equal (according to Levene's test)")
    
else:
    equal_var = True
    print("The variances are equal (according to Levene's test)")

The variances are not equal (according to Levene's test)


In [None]:
# Use ttest_ind to compare two dataset samples, set equal_var to false
results = stats.ttest_ind(action['user_score'], sports['user_score'], equal_var = False)

# Print p-value
print('p-value:', results.pvalue)

# Print condition depending on pvalue compared to alpha
if results.pvalue < alpha:
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

p-value: 8.072169369542124e-12
We reject the null hypothesis


**Conclusion:** As we reject the null hypothesis, we can assume that the average user ratings for action and sports genres are different.

## Step 6: Conclusion

### 6.1 General Analysis

**General Statistics**

To summarize, the total number of games sold peaked in 2008 and has since dropped to almost one third of what it used to be. Platforms have a general lifespan of 9-11 years. 

**Current Environment**

Most current platforms are headed towards the end of their life cycle, except the PS4 and Xbox One which may have upwards of 6 years left. PS4 and Nintendo 3DS have dominated top games sales, holding the top 11 most profitable games.

**Top Platforms**

User scores are not correlated with PS4 sales, however, critic scores have a weak positive correlation. Sales of the same games on other platforms do not do nearly as well as the PS4 with Xbox One lagging well behind in secondplace.

**Genres**

The most profitable genres are action, shooter and role-playing whilst the least are puzzle, strategy and adventure.

### 6.2 User Profiles by Region

**Platforms**

The two highest selling platforms in Europe and North America are the PS4 and Xbox One. They make up over 60% of the marketshare. This is different to Japan where the Nintendo 3DS consumes more than 63%.

**Genres**

North America and the EU both share the top five genres, along with a similar marketshare. The top two are action and shooter games that take up more than 50% of marketshare. Japan has a larger following in role-play games. 

**Preferences**

Looking at this data, Japanese gamers may prefer more immersive experiences. Gamers in North America and Europe prefer action and shooter games.

**Ratings**

Europe and North American gamers are not effected by ratings with mature rated games taking up the majority of the marketshare. In Japan, the everbody category is the highest, however, a large amount of rating data is missing.

### 6.3 Hypotheses Tests
**XBox One and PC Ratings**

The null hypothesis that average user ratings of the Xbox One and PC platforms are similar is supported.

**Action and Sports Ratings**

The null hypothesis that the average user ratings for action and sports genres are the same is unsupported. 