<a href="https://colab.research.google.com/github/d-tomas/transform4europe/blob/main/notebooks/descriptive_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive statistics

In this *notebook* we will review some of the concepts of descriptive statistics: central tendency and dispersion. We will use Python's Pandas library to easily manipulate data tables and obtain these statistics.

## Initial setup

In [None]:
# Import the Python libraries used in this notebook

import pandas as pd  # Pandas allows to manipulate large tables

# Round the output to two decimal points
%precision %.2f

We are going to work with a dataset in CSV (comma separated values) format containing statistics on historical video game sales. Each row contains the following information:

* `Rank`: total sales ranking
* `Name`: name of de video game
* `Platform`: platform where the game was published (e.g. PC, PS4, ...)
* `Year`: year of publication
* `Genre`: genre of the game (e.g. action, shooter, sports, ...)
* `Publisher`: publishing company
* `NA_Sales`: sales in North America (in millions of copies)
* `EU_Sales`: sales in Europe (in millions of copies)
* `JP_Sales`: sales in Japan (in millions of copies)
* `Other_Sales`: sales in the rest of the world (in millions of copies)
* `Global_Sales`: global sales wordlwide (in millions of copies)

In [None]:
# Load the data in a Pandas DataFrame
# The file is in CSV format

data = pd.read_csv('https://raw.githubusercontent.com/d-tomas/transform4europe/main/datasets/video_game_sales.csv')
data  # Entering the name of the variable displays its content on the screen

## Central tendency

In [None]:
# How many games are there for each platform

data['Platform'].value_counts()

In [None]:
# Average number of copies of video games sold worldwide

data['Global_Sales'].mean() * 1000000  # Multiply by 1 million

In [None]:
# Median number of video game copies sold worldwide
# If we sort the list of values, the 'median' is the value right in the middle

data['Global_Sales'].median() * 1000000

In [None]:
# Mode of video game copies sold worldwide
# 'Mode' is the most repeated value

data['Global_Sales'].mode() * 1000000

## Dispersion

In [None]:
# Minimum value of each column

data.min()

In [None]:
# Game with less global sales
# With 'argmin' we can get the index where the minimum value of a column is

data.iloc[data['Global_Sales'].argmin()]  # 'iloc' locates an element by its index

In [None]:
# Maximum value of each column

data.max()

In [None]:
# Best-selling game in Europe
# With 'argmax' we can get the index where the maximum value of each column is

data.iloc[data['EU_Sales'].argmax()]

In [None]:
# Interquartile range (IQR) of wordlwide sales
# Indicates the 'distance' between the 50% of samples that occupy the centre of the distribution
# We remove the 25% with the least sales (Q1) and the 25% with the most sales (Q4)

(data['Global_Sales'].quantile(0.75) - data['Global_Sales'].quantile(0.25)) * 1000000

In [None]:
# Standar deviation of all columns

data.std()

# References

* [Video Game Sales](https://www.kaggle.com/gregorut/videogamesales)