
# Exploratory Data Analysis

In this notebook, we perform an exploratory data analysis (EDA) on a video game sales dataset. The aim is to clean, prepare, and analyze the data to uncover patterns and insights.

## Data Cleaning and Preparation


In [None]:

import pandas as pd

# Load the dataset
file_path = '/mnt/data/vgsales.csv'
data = pd.read_csv(file_path)

# Display the first few rows
data.head()



### Handling Missing Values

We will check for missing values in the dataset and address them appropriately.


In [None]:

# Check for missing values
missing_values = data.isnull().sum()
missing_values

# Fill missing 'Year' values with the median year
data['Year'].fillna(data['Year'].median(), inplace=True)

# Fill missing 'Publisher' values with 'Unknown'
data['Publisher'].fillna('Unknown', inplace=True)

# Verify if all missing values have been addressed
data.isnull().sum()



### Outlier Removal

We will define a function to remove outliers based on the Interquartile Range (IQR) method and apply it to sales columns.


In [None]:

def remove_outliers(df, column_list):
    for column in column_list:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return df

# Columns to check for outliers
sales_columns = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']

# Remove outliers from the dataset
cleaned_data = remove_outliers(data, sales_columns)

# Display the shape of the dataset before and after outlier removal
original_shape, cleaned_shape = data.shape, cleaned_data.shape
(original_shape, cleaned_shape)



## Descriptive Analysis

Performing summary statistics and visualizations to understand the distribution and central tendencies of the sales data.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Format the 'Year' column as integer
cleaned_data['Year'] = cleaned_data['Year'].astype(int)

# Perform summary statistics
summary_stats = cleaned_data[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].describe()

# Visualizations: Histograms and Box Plots for Sales Data
sns.set_style("whitegrid")

# Histograms for sales data
plt.figure(figsize=(15, 10))
for i, column in enumerate(['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'], start=1):
    plt.subplot(2, 3, i)
    sns.histplot(cleaned_data[column], kde=True, bins=30)
    plt.title(f'Distribution of {column}')
plt.tight_layout()

# Box plots for sales data
plt.figure(figsize=(10, 6))
sns.boxplot(data=cleaned_data[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']])
plt.title('Box Plots of Sales Data')
plt.xticks(ticks=[0, 1, 2, 3, 4], labels=['NA Sales', 'EU Sales', 'JP Sales', 'Other Sales', 'Global Sales'])
plt.ylabel('Sales in Millions')

summary_stats
