# Game Development

In [47]:
# libraries
import pandas as pd
import numpy as np
import altair as alt

Group Members:

#### Author contributions

Author 1 contributed ...

Author 2 contributed ...

#### Abstract

Prepare an abstract *after* you've written the entire report. The abstract should be 4-6 sentences summarizing the report contents. Typically:
* the first 1-2 sentences introduce and motivate the topic;
* the next 1-2 sentences state the aims;
* the next 1-2 sentences state the findings.

---
## Background: Sales of videogames 

Videogames have been an increasingly popular form of entertainment coming out of the crash of 1983 to the present pushing computational hardware, content creation, competitive play, and opening many commercial opportunities along the way. Titles or the games themselves have existed through various means such as consoles or specifically curated hardware from companies such as Sony with their Playstation series of console, Microsoft with Xbox consoles, and Nintendo with their most recent Switch console. Another popular option in terms of hardware comes through custom built computers.
<br>


<center><img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS3Ecto-QEVl3cg5U7-xtEhNdgRFqLHoa6_nQ&usqp=CAU" width="300" height="300"/></center>

In [48]:
vg_sales = pd.read_csv('vgsales.csv')
vg_sales.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


<blockquote> 
<p>
    <strong>Figure 1</strong>: Shows the total number of video game sold globally from 1980 - 2016.
</p>
</blockquote>

In [14]:
vg_sales_mod = vg_sales.dropna(0)
vg_sales_mod2 = vg_sales_mod.groupby('Year').sum().reset_index()
alt.data_transformers.disable_max_rows()
alt.Chart(vg_sales_mod2).mark_bar().encode(
    x=alt.X('Year'),
    y=alt.Y('Global_Sales') #sum sales
)

--- 
## 1. Data Description

The data are recorded number of sales in games from year 1980 through 2020 that had sold more than 100,000 copies.

### Basic Information About Collection of Data

The data was found online at https://www.kaggle.com/arslanali4343/sales-of-video-games. The intention for the collection of the data was for general purposes. Regarding the source of data, it is noted that the data values were web scraped off the website https://www.vgchartz.com/ using the python library Beautiful Soup, which is known to pull data from HTML or XML documents. The dataset is composed of popular games introduced to the game market since 1980 and the sampling mechanism was specifically framed around games that totalled up to more than 100,000 copies sold. Additionally, the scope of inference can be extended to games leaning more towards consoles (data is limited in that it does not contain the entirety of the Steam and other platforms) that are currently out for sale in the game market. 

### Data Structure

For this study, the observational units are popular games published between 1980 and 2017, with the variables consisting of the rank, Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, and the Global_Sales. Additional details regarding the variables can be found below.

<blockquote> 
<p>
    <strong>Table 1</strong>: The variable description, data type, and the units of measurement for each variable are provided in the table below.
</p>
</blockquote>

Name | Variable description | Type | Units of measurement
---|---|---|---
Rank | Ranking of Overall Sales | Ordinal | Rank Numbers
Name | Name of The Game | Nominal | Game Names
Platform | Platform of Game Release | Nominal | Game Platforms
Year | Year of Game Release | Ordinal | Calendar Year
Genre | Genre of The Game | Nominal | Game Genres
Publisher | Publisher of The Game | Nominal | Game Companies
NA_Sales | Total Sales in North America | Numeric | in Millions
EU_Sales | Total Sales in Europe | Numeric | in Millions
JP_Sales | Total Sales in Japan | Numeric | in Millions
Other_Sales | Total Sales in Rest of the World | Numeric | in Millions
Global_Sales | Total Sales Worldwide | Numeric | in Millions

### Preprocessing of Data

After doing some data exploration, it was seen that a substantial number of observations had missing values in them.    The exact number of missing numbers for each variable had can be found below. 

<blockquote> 
<p>
    <strong>Table 2</strong>: Table shows the number of missing values in the observations by variable in the table below.
</p>
</blockquote>

Name | No of Missing Values 
---|---
Rank | 0 
Name | 0
Platform | 0
Year | 271 
Genre | 0 
Publisher | 58 
NA_Sales | 0
EU_Sales | 0
JP_Sales | 0
Other_Sales | 0
Global_Sales | 2

In [55]:
vg_sales.iloc[179,3]=2004
vg_sales.iloc[377,3]=2004
vg_sales.iloc[431,3]=2008
vg_sales.iloc[470,3]=2006
vg_sales.iloc[470,5]='THQ'
vg_sales.iloc[607,3]=1980
vg_sales.iloc[624,3]=2007
vg_sales.iloc[649,3]=2001
vg_sales.iloc[652,3]=2008
vg_sales.iloc[711,3]=2006
vg_sales.iloc[782,3]=2008

Upon exploration of the data, it was decided that it would be too dangerous/risky to plainly remove the observations with missing values as they might later be crucial for the subsequent analysis portion. For each observation that was ranked between 1 and 1000 -- signfying that they had greater sales and proved to be of greater importance; these observations had their missing values replaced with their actual values after searching online.

---
## 2. Methods

---
## 3. Results

In [87]:
# data_plot1 = vg_sales_mod.groupby(['Platform', 'Year']).sum().reset_index().drop('Rank', axis = 1)
# alt.Chart(data_plot1).mark_line(point = True).encode(
#     y = alt.Y('Global_Sales:Q', title = 'Total Global Sales'),
#     x = alt.X('Year:Q', title = 'Year'),
#     color = alt.Color('Platform', title = 'Console')
# ).properties(
#     width = 500,
#     height = 500
# ).mark_area(opacity = 0.5) #let's smooth this, also maybe do some density work, we could do mean estimates
#                                                  # to do more comparisons

In [17]:
vg_sales.groupby('Genre').aggregate(sum).sort_values(by='Global_Sales', ascending=False).head(5)

Unnamed: 0_level_0,Rank,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Action,26441383,875.45,521.68,159.95,187.38,1751.18
Sports,17419112,680.92,376.41,135.6,134.97,1330.93
Shooter,9653872,581.65,312.12,38.28,102.69,1037.37
Role-Playing,12032228,326.41,187.18,352.31,59.61,927.37
Platform,6137545,445.59,200.37,130.77,51.59,831.37


### Sales of Video Games by Genre Over Time ( Top 5 )

<blockquote> 
<p>
    <strong>Figure 1</strong>: Plots show the changes in overall sales for the top 5 genres between 1980 and 2017.
</p>
</blockquote>

In [18]:
data_plot2 = vg_sales[(vg_sales.Genre=='Action') | (vg_sales.Genre=='Sports') | (vg_sales.Genre=='Shooter') 
                      | (vg_sales.Genre=='Role-Playing') | (vg_sales.Genre=='Platform')].groupby(['Genre','Year']).aggregate(sum).reset_index()
alt.Chart(data_plot2).mark_line().properties(
     width=500
).encode(
x=alt.X('Year', axis=alt.Axis(labels=False, ticks=False)),
y=alt.Y('Global_Sales'),
color=alt.Color('Genre')#sum sales
).properties(width=150).mark_area(opacity = 0.5).facet('Genre')

---
## 4. Conclusion