# Summary Statistics

Provides various methods to calculate summary statistics on DataFrame columns or rows. Summary statistics help to get a quick overview of the data and understand its characteristics. Here are some common methods for computing summary statistics in pandas:

In [3]:
import pandas as pd

In [4]:
path = r"C:\Users\Alysson\Downloads\database.xlsx"

data = pd.read_excel(path)
data.head()

Unnamed: 0,name,breed,color,height_cm,weight_kg,age
0,Paçoca,Labrador,Brown,56,25,5
1,Ivo,Poodle,Black,43,22,4
2,Lola,Schnauzer,Gray,49,23,4
3,Maracatu,King Cavalier,Brown,43,21,3
4,Chantal,Labrador,Black,59,29,6


In [5]:
data['weight_kg'].mean()

24.0

In [6]:
data[['weight_kg']].mode()

Unnamed: 0,weight_kg
0,21
1,22
2,23
3,25
4,29


In [7]:
data['weight_kg'].min()

21

In [8]:
data['weight_kg'].max()

29

In [9]:
data['weight_kg'].var()

10.0

In [10]:
data['weight_kg'].std()

3.1622776601683795

In [11]:
def pct30(column):
    return column.quantile(0.3)

data['weight_kg'].agg(pct30)

22.2

In [12]:
data[['weight_kg','height_cm']].agg(pct30)

weight_kg    22.2
height_cm    44.2
dtype: float64

In [13]:
def pct40(column):
    return column.quantile(0.4)

data['weight_kg'].agg([pct30,pct40])

pct30    22.2
pct40    22.6
Name: weight_kg, dtype: float64

## Cumulative Statistics

The resulting series or DataFrame will display the intermediate cumulative results at each row.

In [15]:
data['weight_kg'].cumsum()

0     25
1     47
2     70
3     91
4    120
Name: weight_kg, dtype: int64

In [16]:
data['weight_kg'].cummax()

0    25
1    25
2    25
3    25
4    29
Name: weight_kg, dtype: int64

In [17]:
data['weight_kg'].cummin()

0    25
1    22
2    22
3    21
4    21
Name: weight_kg, dtype: int64

In [18]:
data['weight_kg'].cumprod()

0         25
1        550
2      12650
3     265650
4    7703850
Name: weight_kg, dtype: int64

## Counting

In [21]:
data["breed"].value_counts()

Labrador         2
Poodle           1
Schnauzer        1
King Cavalier    1
Name: breed, dtype: int64

In [22]:
data["breed"].value_counts(sort=True)

Labrador         2
Poodle           1
Schnauzer        1
King Cavalier    1
Name: breed, dtype: int64

In [20]:
data.drop_duplicates(subset="breed")

Unnamed: 0,name,breed,color,height_cm,weight_kg,age
0,Paçoca,Labrador,Brown,56,25,5
1,Ivo,Poodle,Black,43,22,4
2,Lola,Schnauzer,Gray,49,23,4
3,Maracatu,King Cavalier,Brown,43,21,3


In [23]:
data["breed"].value_counts(normalize=True)

Labrador         0.4
Poodle           0.2
Schnauzer        0.2
King Cavalier    0.2
Name: breed, dtype: float64

## Group Summary Statistics

Using the groupby() function to group data based on specific columns and then applying various aggregation functions to calculate summary statistics for each group

In [24]:
data[data["color"] == "Black"]['weight_kg'].mean()

25.5

In [25]:
##using groupby function

data.groupby("color")['weight_kg'].mean()

color
Black    25.5
Brown    23.0
Gray     23.0
Name: weight_kg, dtype: float64

In [26]:
data.groupby("color")['weight_kg'].agg([min, max, sum])

Unnamed: 0_level_0,min,max,sum
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Black,22,29,51
Brown,21,25,46
Gray,23,23,23


In [28]:
data.groupby(["color","breed"])[['weight_kg','height_cm']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_kg,height_cm
color,breed,Unnamed: 2_level_1,Unnamed: 3_level_1
Black,Labrador,29.0,59.0
Black,Poodle,22.0,43.0
Brown,King Cavalier,21.0,43.0
Brown,Labrador,25.0,56.0
Gray,Schnauzer,23.0,49.0


## Pivot tables

Powerful tool for reshaping and summarizing data. They allow you to transform a DataFrame by specifying which columns should become the new index, which columns should become new columns, and which values should be aggregated.

In [29]:
data.groupby("color")['weight_kg'].mean()

color
Black    25.5
Brown    23.0
Gray     23.0
Name: weight_kg, dtype: float64

In [41]:
## using pivot tables
## uses the mean as default

data.pivot_table(values="weight_kg",index="color")

Unnamed: 0_level_0,weight_kg
color,Unnamed: 1_level_1
Black,25.5
Brown,23.0
Gray,23.0


In [34]:
import numpy as np

## computing diferent stats

data.pivot_table(values="weight_kg",index="color", aggfunc=[np.mean,np.median,np.sum])

Unnamed: 0_level_0,mean,median,sum
Unnamed: 0_level_1,weight_kg,weight_kg,weight_kg
color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Black,25.5,25.5,51
Brown,23.0,23.0,46
Gray,23.0,23.0,23


In [38]:
## grouping by column

data.pivot_table(values="weight_kg",index="color",columns="breed", aggfunc=np.sum)

breed,King Cavalier,Labrador,Poodle,Schnauzer
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Black,,29.0,22.0,
Brown,21.0,25.0,,
Gray,,,,23.0


In [39]:
## filling NaN values

data.pivot_table(values="weight_kg",index="color",columns="breed",fill_value=0, aggfunc=np.sum)

breed,King Cavalier,Labrador,Poodle,Schnauzer
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Black,0,29,22,0
Brown,21,25,0,0
Gray,0,0,0,23


In [40]:
## Adding margins (totalizers)

data.pivot_table(values="weight_kg",index="color",columns="breed",fill_value=0, margins=True,  aggfunc=np.sum)

breed,King Cavalier,Labrador,Poodle,Schnauzer,All
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Black,0,29,22,0,51
Brown,21,25,0,0,46
Gray,0,0,0,23,23
All,21,54,22,23,120
