# Descriptive statistics

## Setup

In [1]:
import pandas as pd

## Creation

Creation of an example DataFrame (starting from a dictionary of dictionaries):

In [2]:
data = {
    "Capital": {
        "Spain": "Madrid",
        "Belgium": "Brussels",
        "France": "Paris",
        "Italy": "Roma",
        "Germany": "Berlin",
        "Portugal": "Lisbon",
        "Norway": "Oslo",
        "Greece": "Athens",
    },
    "Population": {
        "Spain": 46733038,
        "Belgium": 11449656,
        "France": 67076000,
        "Italy": 60390560,
        "Germany": 83122889,
        "Portugal": 10295909,
        "Norway": 5391369,
        "Greece": 10718565,
    },
    "Monarch": {
        "Spain": "Felipe VI",
        "Belgium": "Philippe",
        "Norway": "Harald V",
    },
    "Area": {
        "Spain": 505990,
        "Belgium": 30688,
        "France": 640679,
        "Italy": 301340,
        "Germany": 357022,
        "Portugal": 92212,
        "Norway": None,
        "Greece": 131957,
    },
}

In [3]:
# For now, let's forget about these steps:
df = pd.DataFrame(data)
df["Capital"] = df["Capital"].astype("string")
df["Monarch"] = df["Monarch"].astype("string")
df["Area"] = df["Area"].astype("Int64")

Apple stock data, taken from the [`matplotlib` sample datasets](https://github.com/matplotlib/sample_data/blob/master/aapl.csv)

In [4]:
# For now, let's forget about these steps:
apple = pd.read_csv("AAPL.csv")
apple["Date"] = apple["Date"].astype("datetime64[ns]")
apple = apple.set_index("Date")
apple = apple.sort_index()

## Demo 1: Descriptive statistics

In [5]:
df

Unnamed: 0,Capital,Population,Monarch,Area
Spain,Madrid,46733038,Felipe VI,505990.0
Belgium,Brussels,11449656,Philippe,30688.0
France,Paris,67076000,,640679.0
Italy,Roma,60390560,,301340.0
Germany,Berlin,83122889,,357022.0
Portugal,Lisbon,10295909,,92212.0
Norway,Oslo,5391369,Harald V,
Greece,Athens,10718565,,131957.0


Get the sum of a column:

In [6]:
df["Population"].sum()

295177986

Get the mean of a column:

In [7]:
df["Population"].mean()

36897248.25

Get the minimum value of a column:

In [8]:
df["Area"].min()

30688

Get the maximum value of a column:

In [9]:
df["Area"].max()

640679

Calculate the sum of all columns:

In [10]:
df.isnull()

Unnamed: 0,Capital,Population,Monarch,Area
Spain,False,False,False,False
Belgium,False,False,False,False
France,False,False,True,False
Italy,False,False,True,False
Germany,False,False,True,False
Portugal,False,False,True,False
Norway,False,False,False,True
Greece,False,False,True,False


In [11]:
df.isnull().sum()

Capital       0
Population    0
Monarch       5
Area          1
dtype: int64

Calculate the sum of all rows:

In [12]:
df.isnull().sum(axis="columns")

Spain       0
Belgium     0
France      1
Italy       1
Germany     1
Portugal    1
Norway      1
Greece      1
dtype: int64

Calculate the sum over both dimensions:

In [13]:
df.isnull().sum().sum()

6

Get descriptive statistics about numerical columns:

In [14]:
df.describe()

Unnamed: 0,Population,Area
count,8.0,7.0
mean,36897250.0,294269.714286
std,31005530.0,225632.728175
min,5391369.0,30688.0
25%,10612900.0,112084.5
50%,29091350.0,301340.0
75%,62061920.0,431506.0
max,83122890.0,640679.0


## Exercise 1

In [15]:
apple.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-07,26.5,26.87,26.25,26.5,2981600,3.02
1984-09-10,26.5,26.62,25.87,26.37,2346400,3.01
1984-09-11,26.62,27.37,26.62,26.87,5444000,3.07
1984-09-12,26.87,27.0,26.12,26.12,4773600,2.98
1984-09-13,27.5,27.62,27.5,27.5,7429600,3.14


Get the sum of the "Volume" column:

In [16]:
apple.Volume.sum()

82944013400

Get the mean of the "Open" column:

In [17]:
apple.Open.mean()

46.82351093570125

Get the median of the "Close" column:

In [18]:
apple.Close.median()

38.13

Get the minimum value of the "Low" column:

In [19]:
apple.Low.min()

12.72

Get the maximum value of the "High" column:

In [20]:
apple.High.max()

202.96

Calculate the number of null values in the entire DataFrame:

In [21]:
apple.isnull().sum().sum()

0

Get descriptive statistics about numerical columns:

In [22]:
apple.describe()

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
count,6081.0,6081.0,6081.0,6081.0,6081.0,6081.0
mean,46.823511,47.681506,45.913595,46.798619,13639860.0,23.529794
std,33.993517,34.578077,33.273106,33.947235,13521070.0,37.375601
min,12.88,13.19,12.72,12.94,88800.0,1.65
25%,24.73,25.01,24.2,24.69,5530000.0,7.38
50%,38.25,38.88,37.46,38.13,8976400.0,9.91
75%,53.5,54.55,52.5,53.61,16319200.0,14.36
max,200.59,202.96,197.8,199.83,265069000.0,199.83
