<a href="https://colab.research.google.com/github/agusrusmawan/GameSales/blob/main/Games_Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Game Sales Project
## Name : Agus Rusmawan


# Context
## Gaming industry is an interesting field to explore, it would be fun knowing who the most popular publishers and developers and which games are the most popular.

#Questions to be Answered
 - Which game is oldest, and the newest games in that dataset?
 - Which publisher published most of games?
 - Which developer developed most of games?
 - Which series has the most sales?
 - Which series have the most games?


Work steps Project
    Gathering Data
    Import Library
    Load file/table from github.com
    Assesing Data (check type data, check missing value, check duplication, check Summary of statistical parameters of numeric columns)
    Cleaning Data (Eliminate duplicate data, handling missing values, handling inaccurate values).
    Exploratory Data Analysis (EDA)

In [60]:
#Gathering Data
#Import Library
import pandas as pd



In [61]:
#Load file/table from github.com
df = pd.read_csv("https://raw.githubusercontent.com/agusrusmawan/GameSales/main/games_sales.csv")

In [62]:
#Assessing Data
df.head()

Unnamed: 0,Name,Sales,Series,Release,Genre,Developer,Publisher
0,PlayerUnknown's Battlegrounds,42.0,,12/1/2017,Battle royale,PUBG Studios,Krafton
1,Minecraft,33.0,Minecraft,11/1/2011,"Sandbox, survival",Mojang Studios,Mojang Studios
2,Diablo III,20.0,Diablo,5/1/2012,Action role-playing,Blizzard Entertainment,Blizzard Entertainment
3,Garry's Mod,20.0,,11/1/2006,Sandbox,Facepunch Studios,Valve
4,Terraria,17.2,,5/1/2011,Action-adventure,Re-Logic,Re-Logic


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       177 non-null    object 
 1   Sales      177 non-null    float64
 2   Series     141 non-null    object 
 3   Release    177 non-null    object 
 4   Genre      177 non-null    object 
 5   Developer  177 non-null    object 
 6   Publisher  177 non-null    object 
dtypes: float64(1), object(6)
memory usage: 9.8+ KB


In [64]:
#From the results above, it can be seen that the amount of data is incomplete (there are missing values) and for the release data type the data type is object, it should be datetime. These errors will be corrected during the cleaning process.
#ensure the number of missing values
df.isna().sum()

Name          0
Sales         0
Series       36
Release       0
Genre         0
Developer     0
Publisher     0
dtype: int64

In [65]:
#Checking data duplication
print("Number of duplications: ", df.duplicated().sum())

Number of duplications:  2


In [66]:
#From the data above, there are 2 duplications in the table
df.describe()

Unnamed: 0,Sales
count,177.0
mean,3.116949
std,4.937466
min,1.0
25%,1.0
50%,1.5
75%,3.0
max,42.0


In [67]:
#From the above results, it shows that there is no oddity in the summary statistical parameters

In [68]:
#Cleaning Data
#Based on the results of the data assessing process, there is a data type error in the release column, where the data type is object, it should be datetime. We'll fix it
datetime_columns = ["Release"]

for column in datetime_columns:
  df[column] = pd.to_datetime(df[column])

In [69]:
#The code above will change the data type in the Release column to datetime
#To make sure this is working as expected, check the data type again using the info() method.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Name       177 non-null    object        
 1   Sales      177 non-null    float64       
 2   Series     141 non-null    object        
 3   Release    177 non-null    datetime64[ns]
 4   Genre      177 non-null    object        
 5   Developer  177 non-null    object        
 6   Publisher  177 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 9.8+ KB


In [70]:
#From the table above, you can see that the Release data type column has become datetime
#Handle missing values ​​in the Series column
df.Series.value_counts()

Series
StarCraft                6
Command & Conquer        5
Civilization             4
Age of Empires           4
Warcraft                 3
                        ..
Alice                    1
Ark: Survival Evolved    1
BioShock                 1
Commandos                1
Zoo Tycoon               1
Name: count, Length: 91, dtype: int64

In [71]:
#Based on the results above, it can be seen that the most dominant value in the Series column is StarCraft.
#We will then use this value as a replacement for the missing value. This replacement process can be done using the fillna() method.
df.Series.fillna(value="StarCraft", inplace=True)

In [72]:
#To ensure the above process runs properly, we can run the code again to identify missing values ​​as follows.
df.isna().sum()

Name         0
Sales        0
Series       0
Release      0
Genre        0
Developer    0
Publisher    0
dtype: int64

In [73]:
#Eliminate duplicate data
df.drop_duplicates(inplace=True)

In [74]:
#Check again whether there are still any duplications in the data
print("Number of duplications: ", df.duplicated().sum())

Number of duplications:  0


In [75]:
#There is no longer any duplicate data

In [76]:
#Answer The business Question
#1a. Which game is oldest in that dataset?
oldest_game = df.loc[df['Release'].idxmin()]

print("The oldest game in the dataset is:", oldest_game['Name'])

The oldest game in the dataset is: Hydlide


In [77]:
#1b. Which game is newest in that dataset?
newest_game = df.loc[df['Release'].idxmax()]

print("The newest game in the dataset is:", newest_game['Name'])

The newest game in the dataset is: Valheim


In [78]:
#2. Which publisher published most of games?
# Count the number of games published by each publisher
publisher_counts = df['Publisher'].value_counts()

# Get the publisher with the largest number of games
most_common_publisher = publisher_counts.idxmax()
most_common_publisher_count = publisher_counts.max()

print("The publisher that published most of the games is:", most_common_publisher)
print("Number of games published by this publisher:", most_common_publisher_count)

The publisher that published most of the games is: Electronic Arts
Number of games published by this publisher: 19


In [79]:
#3. Which developer developed most of games?
# Counting the number of games developed by each developer
developer_counts = df['Developer'].value_counts()

# Get the developer with the largest number of games
most_common_developer = developer_counts.idxmax()
most_common_developer_count = developer_counts.max()

print("The developer that developed most of the games is:", most_common_developer)
print("Number of games developed by this developer:", most_common_developer_count)

The developer that developed most of the games is: Blizzard Entertainment
Number of games developed by this developer: 8


In [80]:
#4. Which series has the most sales?
# Calculate total sales for each series
series_sales = df.groupby('Series')['Sales'].sum()

# Get the series with the most sales
most_sold_series = series_sales.idxmax()
most_sold_series_sales = series_sales.max()

print("The series with the most sales is:", most_sold_series)
print("Total sales of this series:", most_sold_series_sales)

The series with the most sales is: StarCraft
Total sales of this series: 175.1


In [81]:
#5. Which series have the most games?
# Count the number of games for each series
series_games = df['Series'].value_counts()

# Get the series with the most number of games
most_game_series = series_games.idxmax()
most_game_series_count = series_games.max()

print("The series with the most games is:", most_game_series)
print("Number of games in this series:", most_game_series_count)

The series with the most games is: StarCraft
Number of games in this series: 40
