# Ice Store â€“ Video Game Sales Analysis

## Business Context
Ice is an online video game store operating worldwide. 
The objective of this analysis is to identify patterns that determine whether a game becomes successful, in order to support marketing campaign planning for 2017.

The dataset contains historical sales data up to 2016, including:
- Platforms
- Genres
- Regional sales
- Critic and user scores
- ESRB ratings


## Analytical Objectives

- Prepare and clean the dataset
- Identify sales trends across platforms and years
- Analyze the impact of reviews on sales
- Build regional user profiles (NA, EU, JP)
- Test statistical hypotheses regarding user ratings


## 1. Data Preparation

In this section, we clean and prepare the dataset for analysis.
This includes standardizing column names, converting data types, handling missing values, and creating additional relevant features.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

games = pd.read_csv('../datasets/games.csv')


FileNotFoundError: [Errno 2] No such file or directory: '../datasets/games.csv'

### 1.1 Standardizing Column Names

Column names are converted to lowercase to avoid referencing issues during analysis.

In [None]:
games.columns = games.columns.str.lower()


### 1.2 Data Type Conversion

We converted columns to appropriate data types to allow correct numerical analysis.

- **year_of_release**: Converted from `float64` to `Int64`.  
  The column contained missing values, so we used the pandas nullable integer type to preserve them while keeping the year as an integer.

- **user_score**: Converted from `object` to `float64`.  
  The column included non-numeric values (such as "TBD"), so we used `pd.to_numeric()` with `errors='coerce'` to convert valid values and replace invalid ones with `NaN`.


In [None]:
games['year_of_release'] = pd.to_numeric(games['year_of_release'], errors='coerce').astype('Int64')
games['user_score'] = pd.to_numeric(games['user_score'], errors='coerce')

### 1.3 Handling Missing Values

Missing values were identified in the following columns: `name`, `genre`, `year_of_release`, `critic_score`, `user_score`, and `rating`.

Rows with missing values in `name`, `genre`, and `year_of_release` were removed, as these fields are essential for identification and temporal trend analysis.

The value "TBD" in `user_score` was converted to `NaN`, as it indicates that a rating had not yet been determined.

Possible reasons for missing values:

- `name`: Data entry or extraction errors during dataset compilation.
- `genre`: Incomplete metadata or classification issues in the original source.
- `year_of_release`: Missing historical records or incomplete archival data.
- `critic_score`: Some games may not have received professional reviews.
- `user_score`: Games with insufficient user ratings or pending evaluations ("TBD").
- `rating`: ESRB classification may be absent for older titles, games released outside North America, or due to incomplete records.

Missing values in `critic_score`, `user_score`, and `rating` were preserved, since imputing them could introduce bias into correlation analysis, regional comparisons, and hypothesis testing.

This approach maintains analytical integrity while preserving as much valid data as possible.


In [None]:
games = games.dropna(subset=['name'])
games = games.dropna(subset=['genre'])
games = games.dropna(subset=['year_of_release'])


### 1.4 Calculating Total Global Sales

We create a new column representing total global sales by summing regional sales.


In [None]:
games['total_sales'] = (games['na_sales'] + games['eu_sales'] + games['jp_sales'] + games['other_sales'])

### 1.5 ESRB Rating Standardization

The ESRB rating categories were reviewed for consistency. 

- "K-A" was replaced with "E", as it represents an older classification equivalent to "Everyone".
- "RP" (Rating Pending) was converted to NaN, since it does not represent a finalized rating.

Other ESRB categories (E, E10+, T, M, AO, EC) were preserved as distinct classifications.


In [None]:
games['rating'] = games['rating'].replace('K-A', 'E')
games['rating'] = games['rating'].replace('RP', np.nan)