# Integrated Project 1

In this project, I will be using historical game sales data from 2016 to create a forecast for the following sales year by:<br>
- Exploring game releases and sales across different years, examining platform popularity and their periods of relevancy. 
<br>
- Identifying leading platforms, track sales trends, and pinpoint potentially profitable options for the company.
<br> 
- Analyzing the impact of reviews on sales for a chosen platform and compare game sales across platforms. 
- Investigating genre distribution to track trends in profitability.
<br>

### Data description
The dataset contains the following details:
- Name 

- Platform 

- Year_of_Release 

- Genre 

- NA_sales (North American sales in USD million) 

- EU_sales (sales in Europe in USD million) 

- JP_sales (sales in Japan in USD million) 

- ther_sales (sales in other countries in USD million) 

- Critic_Score (maximum of 100) 

- User_Score (maximum of 10) 

- Rating (ESRB)

This data may be incomplete and will be processed before analysis.

In [None]:
# importing 
import pandas as pd
import matplotlib as plt
import seaborn as sns
import plotly.express as px

# Read in the data and view for any abnormalities
games = pd.read_csv('/Users/angeneris/Desktop/integrated_project_1/games.csv')
games.head(20)

In [None]:
# Data cleaning- checking for any issues with the data 
games.info()

### Issues found: 

- Lots of null values, and not all columns match the amount of null values
- Year of release should be a datetime object, currently is float 
- Critic score should be int, currently is float 
- User score should be float, is object 
<br>



## Data Cleaning

In [None]:
# Changing columns to lowercase and saving back to games 
games.columns = games.columns.str.lower()

games.head()

Dropping all null values from 'year_of_release' first because this column is extremely important to the analysis in this project. I'm also dropping them because there isn't another meaningful way to display this data alternatively if we proceed with the null values. 
<br>

- Why are the values missing? <br>
The values may be missing from the year_of_release because they are either so old or new that they may not have been processed correctly. We can find out more about this by reviewing all the other column data that shows up when we view a NaN year_of_release

<br> 
We will also cut out the rest of the null values for most of the columns to match the exact number of non- null values as year_of_release. The following columns will continue with null values that will be changes to a placeholder: 'critic_score', 'user_score' and 'rating'. These columns offer a bit more flexibility with continuing with null values.

In [None]:
# List of columns to clean by dropping null values
columns_to_clean = ['year_of_release', 'name', 'platform', 'genre', 'eu_sales', 'jp_sales', 'other_sales']

# Dropping null values from specified columns
games = games.dropna(subset=columns_to_clean)

games.info()

Looks good, all columns except for 'critic_score', 'user_score' and 'rating' now have 16444 non-null values. 

<br> 
Next, we'll move on to cleaning the column 'critic_score'
Then 'user_score' and finally, 'rating'

<br> 

- Why are there missing values?<br>
In these columns, the missing values have a more obvious reason- they just don't have a rating/ score yet. This could either be because the rating system started after the tracking of this data or because the rating has yet to processed or has never been completed. We can find more about these differences by reviewing just these columns with the year_of_release column to see if there are any patterns.

In [None]:
# Converting 'critic_score' to integer and handling missing values
games['critic_score'] = pd.to_numeric(games['critic_score'], errors='coerce').astype('Int64')

# Converting 'user_score' to float and handling missing values
games['user_score'] = pd.to_numeric(games['user_score'], errors='coerce')

# Converting year_of_release to int as datetime may be unecessary with only the year given

games['year_of_release'] = pd.to_numeric(games['year_of_release'], errors='coerce') 

# Filling missing values with placeholder -1 for ease of filtering later 
cols_to_fill = ['user_score', 'critic_score', 'rating']
games= games.fillna(value={'user_score': -1, 'critic_score': -1, 'rating': 'no rating'})

# Checking 
games.info()

In [None]:
# Checking that there are no outliers in the data for 'critic_score' and 'user_score'
# critic_score should have a max of 100 and user_score max of 10
games[['critic_score','user_score']].max()

In [None]:
# Creating a new column for total_sales. This will combine sales from all regions and add a new column to the dataframe
games['total_sales'] = games[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)

games

In [None]:
# Checking 
games.head()

The data has now been processed and is ready for analysis. 
<br>

## Data Analysis 

In [None]:
# Checking for amount of games released each year 
#games['']