# Aggregation and Grouping Data

An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives insight into the nature of a potentially large dataset.

Descriptive or Summary Statistics in pandas – describe()

Describe Function gives the mean, std and IQR values.

Generally describe() function excludes the character columns and gives summary statistics of numeric columns
We need to add a variable named include=’all’ to get the summary statistics or descriptive statistics of both numeric and character column.


In [1]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Alisa','Bobby','Cathrine','Madonna','Rocky','Sebastian','Jaqluine',
   'Rahul','David','Andrew','Ajay','Teresa']),
   'Age':pd.Series([26,27,25,24,31,27,25,33,42,32,51,47]),
   'Score':pd.Series([89,87,67,55,47,72,76,79,44,92,99,69])}
 
#Create a DataFrame
df = pd.DataFrame(d)

In [2]:
df

Unnamed: 0,Age,Name,Score
0,26,Alisa,89
1,27,Bobby,87
2,25,Cathrine,67
3,24,Madonna,55
4,31,Rocky,47
5,27,Sebastian,72
6,25,Jaqluine,76
7,33,Rahul,79
8,42,David,44
9,32,Andrew,92


In [3]:
#Descriptive or Summary Statistic 

df.describe()

Unnamed: 0,Age,Score
count,12.0,12.0
mean,32.5,73.0
std,9.209679,17.653225
min,24.0,44.0
25%,25.75,64.0
50%,29.0,74.0
75%,35.25,87.5
max,51.0,99.0


In [4]:
#Descriptive or Summary Statistic of the character columns:

df.describe(include=['object'])

Unnamed: 0,Name
count,12
unique,12
top,David
freq,1


#  Mean Function in Python pandas

mean() – Mean Function in python pandas is used to calculate the arithmetic mean of a given set of numbers, mean of a data frame ,mean of column and mean of rows , lets see an example of each . We need to use the package name “statistics” in calculation of mean

In [8]:
import statistics
 
print(statistics.mean([1,9,5,6,6,7]))
print(statistics.mean([4,-11,-5,16,5,7]))

5.666666666666667
2.6666666666666665


In [9]:
d = {
    'Name':['Alisa','Bobby','Cathrine','Madonna','Rocky','Sebastian','Jaqluine',
   'Rahul','David','Andrew','Ajay','Teresa'],
   'Score1':[62,47,55,74,31,77,85,63,42,32,71,57],
   'Score2':[89,87,67,55,47,72,76,79,44,92,99,69]}
 
 
df = pd.DataFrame(d)

In [10]:
df

Unnamed: 0,Name,Score1,Score2
0,Alisa,62,89
1,Bobby,47,87
2,Cathrine,55,67
3,Madonna,74,55
4,Rocky,31,47
5,Sebastian,77,72
6,Jaqluine,85,76
7,Rahul,63,79
8,David,42,44
9,Andrew,32,92


In [11]:
df.mean()

Score1    58.0
Score2    73.0
dtype: float64

In [12]:
# column mean of the dataframe

#axis=0 argument calculates the column wise mean of the dataframe so the result will be

df.mean(axis=0)


Score1    58.0
Score2    73.0
dtype: float64

In [13]:
# Row mean of the dataframe
# axis=1 argument calculates the row wise mean of the dataframe so the result will be

df.mean(axis=1)

0     75.5
1     67.0
2     61.0
3     64.5
4     39.0
5     74.5
6     80.5
7     71.0
8     43.0
9     62.0
10    85.0
11    63.0
dtype: float64

In [14]:
# mean of the specific column

df.loc[:,"Score1"].mean()

58.0

# Median Function in Python pandas 

median() – Median Function in python pandas is used to calculate the median or middle value of a given set of numbers, Median of a data frame, median of column and median of rows, let’s see an example of each.

In [15]:
import statistics
 
print(statistics.median([1,9,5,6,8,7]))
print(statistics.median([4,-11,-5,16,5,7,9]))

6.5
5


In [16]:
#  #Create a DataFrame
d = {
    'Name':['Alisa','Bobby','Cathrine','Madonna','Rocky','Sebastian','Jaqluine',
   'Rahul','David','Andrew','Ajay','Teresa'],
   'Score1':[62,47,55,74,31,77,85,63,42,32,71,57],
   'Score2':[89,87,67,55,47,72,76,79,44,92,99,69],
   'Score3':[56,86,77,45,73,62,74,89,71,67,97,68]}
 
df = pd.DataFrame(d)
df

Unnamed: 0,Name,Score1,Score2,Score3
0,Alisa,62,89,56
1,Bobby,47,87,86
2,Cathrine,55,67,77
3,Madonna,74,55,45
4,Rocky,31,47,73
5,Sebastian,77,72,62
6,Jaqluine,85,76,74
7,Rahul,63,79,89
8,David,42,44,71
9,Andrew,32,92,67


In [17]:
df.median()

Score1    59.5
Score2    74.0
Score3    72.0
dtype: float64

In [18]:
# column median of the dataframe

df.median(axis=0)

Score1    59.5
Score2    74.0
Score3    72.0
dtype: float64

# Mode Function in pandas

Mode Function in python pandas is used to calculate the mode or most repeated value of a given set of numbers. Mode of a data frame, mode of column and mode of rows, let’s see an example of each We need to use the package name “statistics” in calculation of mode

In [19]:
print(statistics.mode([1,5,5,7,5,6,8,7]))
print(statistics.mode(['lion', 'cat', 'cat','dog','tiger']))

5
cat


In [20]:
d = {
    'Name':['Alisa','Bobby','Cathrine','Madonna','Rocky','Sebastian','Jaqluine',
   'Rahul','David','Andrew','Ajay','Teresa'],
   'Score1':[62,47,55,74,47,77,85,63,42,32,71,57],
   'Score2':[89,87,67,55,47,72,76,79,44,67,99,69],
   'Score3':[56,86,77,45,73,62,74,89,71,67,97,68]}
 
 
 
df = pd.DataFrame(d)
df

Unnamed: 0,Name,Score1,Score2,Score3
0,Alisa,62,89,56
1,Bobby,47,87,86
2,Cathrine,55,67,77
3,Madonna,74,55,45
4,Rocky,47,47,73
5,Sebastian,77,72,62
6,Jaqluine,85,76,74
7,Rahul,63,79,89
8,David,42,44,71
9,Andrew,32,67,67


In [21]:
# mode of the dataframe

df.mode()

Unnamed: 0,Name,Score1,Score2,Score3
0,Ajay,47.0,67.0,45
1,Alisa,,,56
2,Andrew,,,62
3,Bobby,,,67
4,Cathrine,,,68
5,David,,,71
6,Jaqluine,,,73
7,Madonna,,,74
8,Rahul,,,77
9,Rocky,,,86


In [22]:
# mode of the specific column

df.loc[:,"Score1"].mode()

0    47
dtype: int64