# Data Grouping, Aggregation, and Merging with Pandas
Often, you need to group data based on a particular column value.
Take the example of the Titanic dataset that you have seen in
previous chapters. What if you want to find the maximum fare paid by
male and female passengers? Or, you want to find which embarked
town had the oldest passengers? In such cases, you will need to
create groups of records based on the column values. You can then
find the maximum, minimum, mean, or other information within that
group.
Furthermore, you might want to merge or concatenate dataframes if
your dataset is distributed among multiple files. Finally, you might
need to change the orientation of your dataframe and discretize
certain columns.

## 1. Grouping Data with GroupBy
You will be using the Titanic dataset for various GroupBy functions in
this section.

In [1]:
# Import necessary libraries for data analysis and visualization
import matplotlib.pyplot as plt       # For creating visualizations
import seaborn as sns                 # For statistical data visualization and built-in datasets
import pandas as pd                   # For data manipulation and analysis

# Set the aesthetic style of the plots to 'darkgrid'
sns.set_style("darkgrid")

# Load the Titanic dataset from seaborn's built-in dataset library
titanic_data = sns.load_dataset('titanic')

# Display the first 5 rows of the Titanic dataset
titanic_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
# Group the Titanic dataset by the values in the "class" column.
# This creates a DataFrameGroupBy object, which can be used to perform
# operations (like aggregation) on each group separately.
# Explicitly set observed=False to keep current behavior and silence the warning
titanic_gbclass = titanic_data.groupby("class", observed=False)

# Check the type of the resulting grouped object.
# This will return: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
type(titanic_gbclass)

pandas.core.groupby.generic.DataFrameGroupBy

In [4]:
# Get the number of groups in the GroupBy object 'titanic_gbclass'
# This is useful when you've grouped a DataFrame and want to know how many distinct groups were formed
titanic_gbclass.ngroups

3

In [5]:
# Assuming 'titanic_gbclass' is a GroupBy object, likely grouped by 'Pclass' or another column.
# The .size() method returns the size (i.e., number of rows) in each group.

titanic_gbclass.size()

class
First     216
Second    184
Third     491
dtype: int64

In [6]:
# Access the group of passengers in the "First" class from a GroupBy object named 'titanic_gbclass'
# 'titanic_gbclass.groups' is a dictionary where keys are the group names (e.g., "First", "Second", "Third")
# and values are lists or Index objects containing the row indices of each group.
titanic_gbclass.groups["First"]

Index([  1,   3,   6,  11,  23,  27,  30,  31,  34,  35,
       ...
       853, 856, 857, 862, 867, 871, 872, 879, 887, 889],
      dtype='int64', length=216)

In [7]:
# 'titanic_gbclass' is assumed to be a GroupBy object, likely grouped by the 'Pclass' column or similar.
# The 'last()' function returns the **last row** of each group in the GroupBy object.
# This is useful when you want to inspect the final record for each group.

titanic_gbclass.last()

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,adult_male,deck,embark_town,alive,alone
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
First,1,1,male,26.0,0,0,30.0,C,man,True,C,Cherbourg,yes,True
Second,0,2,male,27.0,0,0,13.0,S,man,True,E,Southampton,no,True
Third,0,3,male,32.0,0,0,7.75,Q,man,True,E,Queenstown,no,True


In [8]:
# Retrie rows from the 'titanic_gbclass' GroupBy object where the class is "Second"
titanic_second_class = titanic_gbclass.get_group("Second")

# Display the first 5 rows of the 'Second' class passengers
titanic_second_class.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
15,1,2,female,55.0,0,0,16.0,S,Second,woman,False,,Southampton,yes,True
17,1,2,male,,0,0,13.0,S,Second,man,True,,Southampton,yes,True
20,0,2,male,35.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
21,1,2,male,34.0,0,0,13.0,S,Second,man,True,D,Southampton,yes,True


In [9]:
# Get the maximum age value from the 'age' column of the 'titanic_gbclass' DataFrame or GroupBy object
titanic_gbclass.age.max()

class
First     80.0
Second    70.0
Third     74.0
Name: age, dtype: float64

In [10]:
# Apply aggregation functions to the 'fare' column of the 'titanic_gbclass' DataFrameGroupBy object.
# The aggregations performed are: maximum, minimum, count, median, and mean.
titanic_gbclass.fare.agg([
    'max',    # Maximum fare in each group
    'min',    # Minimum fare in each group
    'count',  # Number of non-null fare entries in each group
    'median', # Median fare value in each group
    'mean'    # Average fare in each group
])

Unnamed: 0_level_0,max,min,count,median,mean
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
First,512.3292,0.0,216,60.2875,84.154687
Second,73.5,0.0,184,14.25,20.662183
Third,69.55,0.0,491,8.05,13.67555


## 2. Concatenating and Merging Data

### 2.1 Concatenating Data

In [12]:
# Importing the matplotlib.pyplot module for creating visualizations
import matplotlib.pyplot as plt

# Importing the seaborn library, which is built on top of matplotlib and provides a high-level interface for drawing attractive statistical graphics
import seaborn as sns

# Loading the Titanic dataset directly from seaborn's built-in datasets
# This dataset contains information about the passengers on the Titanic
titanic_data = sns.load_dataset('titanic')

In [13]:
# Filter the rows where the "class" column has the value "First"
# This gives us all passengers in the First class
titanic_pclass1_data = titanic_data[titanic_data["class"] == "First"]

# Print the shape (number of rows and columns) of the First class dataset
print(titanic_pclass1_data.shape)

# Filter the rows where the "class" column has the value "Second"
# This gives us all passengers in the Second class
titanic_pclass2_data = titanic_data[titanic_data["class"] == "Second"]

# Print the shape (number of rows and columns) of the Second class dataset
print(titanic_pclass2_data.shape)

(216, 15)
(184, 15)


In [15]:
# Combine the data from first-class and second-class Titanic passengers into one DataFrame
# `ignore_index=True` resets the index in the resulting DataFrame
final_data = pd.concat([titanic_pclass1_data, titanic_pclass2_data], ignore_index=True)

# Print the shape (number of rows and columns) of the combined DataFrame
print(final_data.shape)

(400, 15)


In [17]:
# Select the first 200 rows from final_data and store it in df1
df1 = final_data[:200]
# Print the shape (rows, columns) of df1
print(df1.shape)

# Select all rows from index 200 onwards and store it in df2
df2 = final_data[200:]
# Print the shape (rows, columns) of df2
print(df2.shape)

# Concatenate df1 and df2 **column-wise** (axis=1), and reset column index (ignore_index=True)
final_data2 = pd.concat([df1, df2], axis=1, ignore_index=True)
# Print the shape of the newly combined DataFrame
print(final_data2.shape)

(200, 15)
(200, 15)
(400, 30)


### 2.2. Merging Data
You can merge multiple dataframes based on common values
between any columns of the two dataframes.

In [18]:
# Import the pandas library and give it the alias 'pd'
import pandas as pd

# Define a list of dictionaries representing the first set of subject scores
scores1 = [
    {'Subject': 'Mathematics', 'Score': 85, 'Grade': 'B', 'Remarks': 'Good'},
    {'Subject': 'History', 'Score': 98, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'English', 'Score': 76, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Chemistry', 'Score': 72, 'Grade': 'C', 'Remarks': 'Fair'},
]

# Define another list of dictionaries representing the second set of subject scores
scores2 = [
    {'Subject': 'Arts', 'Score': 70, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'Physics', 'Score': 75, 'Grade': 'C', 'Remarks': 'Fair'},
    {'Subject': 'English', 'Score': 92, 'Grade': 'A', 'Remarks': 'Excellent'},
    {'Subject': 'Chemistry', 'Score': 91, 'Grade': 'A', 'Remarks': 'Excellent'},
]

# Convert the first list of dictionaries into a Pandas DataFrame
scores1_df = pd.DataFrame(scores1)

# Convert the second list of dictionaries into a Pandas DataFrame
scores2_df = pd.DataFrame(scores2)

In [19]:
# Display the first 5 rows of the DataFrame named scores1_df
scores1_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Mathematics,85,B,Good
1,History,98,A,Excellent
2,English,76,C,Fair
3,Chemistry,72,C,Fair


In [20]:
# Display the first 5 rows of the DataFrame named scores2_df to preview its contents
scores2_df.head()

Unnamed: 0,Subject,Score,Grade,Remarks
0,Arts,70,C,Fair
1,Physics,75,C,Fair
2,English,92,A,Excellent
3,Chemistry,91,A,Excellent


In [21]:
# Perform an inner join on two DataFrames (scores1_df and scores2_df) using the 'Subject' column as the key.
# An inner join returns only the rows with matching values in both DataFrames.
join_inner_df = scores1_df.merge(scores2_df, on='Subject', how='inner')

# Display the first 5 rows of the resulting DataFrame to inspect the result of the join operation.
join_inner_df.head()

Unnamed: 0,Subject,Score_x,Grade_x,Remarks_x,Score_y,Grade_y,Remarks_y
0,English,76,C,Fair,92,A,Excellent
1,Chemistry,72,C,Fair,91,A,Excellent


In [22]:
# Perform a left join (merge) on two DataFrames: scores1_df and scores2_df
# The join is based on the 'Subject' column in both DataFrames
# 'how="left"' means all rows from scores1_df will be kept,
# and matching rows from scores2_df will be added where available
join_inner_df = scores1_df.merge(scores2_df, on='Subject', how='left')

# Display the first 5 rows of the resulting merged DataFrame
join_inner_df.head()

Unnamed: 0,Subject,Score_x,Grade_x,Remarks_x,Score_y,Grade_y,Remarks_y
0,Mathematics,85,B,Good,,,
1,History,98,A,Excellent,,,
2,English,76,C,Fair,92.0,A,Excellent
3,Chemistry,72,C,Fair,91.0,A,Excellent


In [23]:
# Perform a right join (right outer join) between scores1_df and scores2_df on the 'Subject' column.
# This means the resulting DataFrame will include all rows from scores2_df (the right DataFrame),
# and only the matching rows from scores1_df (the left DataFrame) based on the 'Subject' column.
join_inner_df = scores1_df.merge(scores2_df, on='Subject', how='right')

# Display the first 5 rows of the resulting DataFrame to preview the join result.
join_inner_df.head()

Unnamed: 0,Subject,Score_x,Grade_x,Remarks_x,Score_y,Grade_y,Remarks_y
0,Arts,,,,70,C,Fair
1,Physics,,,,75,C,Fair
2,English,76.0,C,Fair,92,A,Excellent
3,Chemistry,72.0,C,Fair,91,A,Excellent


In [24]:
# Perform an outer join on the two DataFrames: scores1_df and scores2_df
# The join is based on the 'Subject' column, meaning it will match rows where the 'Subject' values are the same
# An outer join returns all rows from both DataFrames, filling in NaNs where there is no match
join_inner_df = scores1_df.merge(scores2_df, on='Subject', how='outer')

# Display the first 5 rows of the resulting merged DataFrame
join_inner_df.head()

Unnamed: 0,Subject,Score_x,Grade_x,Remarks_x,Score_y,Grade_y,Remarks_y
0,Arts,,,,70.0,C,Fair
1,Chemistry,72.0,C,Fair,91.0,A,Excellent
2,English,76.0,C,Fair,92.0,A,Excellent
3,History,98.0,A,Excellent,,,
4,Mathematics,85.0,B,Good,,,


## 3. Removing Duplicates
Your datasets will often contain duplicate values, and frequently, you
will need to remove these duplicate values. In this section, you will
see how to remove duplicate values from your Pandas dataframes.

In [25]:
# Import the pandas library and give it an alias 'pd'
import pandas as pd

# Define a list of lists where each sublist contains:
# [Subject name, Score, Another subject name]
scores = [
    ['Mathematics', 85, 'Science'],
    ['English', 91, 'Arts'],
    ['History', 95, 'Chemistry'],
    ['History', 95, 'Chemistry'],
    ['English', 95, 'Chemistry'],
]

# Create a DataFrame from the list of scores
# The DataFrame has 3 columns: 'Subject', 'Score', and 'Subject' again (duplicate column names)
# Pandas allows duplicate column names, but it's generally not recommended
my_df = pd.DataFrame(scores, columns=['Subject', 'Score', 'Subject'])

# Display the first 5 rows of the DataFrame
my_df.head()

Unnamed: 0,Subject,Score,Subject.1
0,Mathematics,85,Science
1,English,91,Arts
2,History,95,Chemistry
3,History,95,Chemistry
4,English,95,Chemistry


### 3.1. Removing Duplicate Rows
To remove duplicate rows, you can call the drop_duplicates() method,
which keeps the first instance and removes all the duplicate rows.

In [26]:
# Remove duplicate rows from the DataFrame `my_df`
result = my_df.drop_duplicates()

# Display the first 5 rows of the resulting DataFrame
result.head()

Unnamed: 0,Subject,Score,Subject.1
0,Mathematics,85,Science
1,English,91,Arts
2,History,95,Chemistry
4,English,95,Chemistry


In [27]:
# Remove duplicate rows from the DataFrame `my_df`, keeping the **last occurrence** of each duplicate
result = my_df.drop_duplicates(keep='last')

# Display the first 5 rows of the resulting DataFrame to preview the cleaned data
result.head()

Unnamed: 0,Subject,Score,Subject.1
0,Mathematics,85,Science
1,English,91,Arts
3,History,95,Chemistry
4,English,95,Chemistry


In [28]:
# Remove all rows that have duplicate values across all columns.
# 'keep=False' means **drop all** instances of a duplicate, not just the later or earlier one.
result = my_df.drop_duplicates(keep=False)

# Display the first 5 rows of the resulting DataFrame after duplicates have been removed.
result.head()

Unnamed: 0,Subject,Score,Subject.1
0,Mathematics,85,Science
1,English,91,Arts
4,English,95,Chemistry


In [29]:
# Remove duplicate rows from the DataFrame `my_df` based on the 'Score' column.
# Only the first occurrence of each unique 'Score' is kept.
result = my_df.drop_duplicates(subset=['Score'])

# Display the first 5 rows of the resulting DataFrame
result.head()

Unnamed: 0,Subject,Score,Subject.1
0,Mathematics,85,Science
1,English,91,Arts
2,History,95,Chemistry


### 3.2. Removing Duplicate Columns
There are two main ways to remove duplicate columns in Pandas.
You can remove two columns with the duplicate name, or you can
remove two columns containing duplicate values for all the rows.

In [30]:
# Import the pandas library and give it the alias 'pd'
import pandas as pd

# Define a list of scores, where each element is a list containing:
# two subjects and their corresponding scores.
scores = [
    ['Mathematics', 85, 'Science', 85],
    ['English', 91, 'Arts', 91],
    ['History', 95, 'Chemistry', 95],
    ['History', 95, 'Chemistry', 95],
    ['English', 95, 'Chemistry', 95],
]

# Create a DataFrame from the scores list
# The DataFrame will have four columns, with two of them having the same name: 'Subject'
# The columns are: 'Subject', 'Score', 'Subject', 'Percentage'
# Note: Having duplicate column names is allowed in pandas but can lead to confusion
my_df = pd.DataFrame(scores, columns=['Subject', 'Score', 'Subject', 'Percentage'])

# Display the first five rows of the DataFrame
my_df.head()

Unnamed: 0,Subject,Score,Subject.1,Percentage
0,Mathematics,85,Science,85
1,English,91,Arts,91
2,History,95,Chemistry,95
3,History,95,Chemistry,95
4,English,95,Chemistry,95


In [31]:
# Remove duplicate columns from the DataFrame `my_df`
# `.duplicated()` returns a boolean Series indicating whether each column name is a duplicate
# `~` negates the boolean Series to select only non-duplicated columns
# `.loc[:, ...]` selects all rows (`:`) and only the non-duplicated columns
result = my_df.loc[:, ~my_df.columns.duplicated()]

# Display the first 5 rows of the resulting DataFrame
result.head()

Unnamed: 0,Subject,Score,Percentage
0,Mathematics,85,85
1,English,91,91
2,History,95,95
3,History,95,95
4,English,95,95


In [32]:
# Transpose the DataFrame (swap rows and columns)
transposed_df = my_df.T

# Drop duplicate columns in the transposed DataFrame (which correspond to duplicate rows in the original)
transposed_df_no_duplicates = transposed_df.drop_duplicates()

# Transpose the DataFrame back to its original orientation (columns and rows are restored)
result = transposed_df_no_duplicates.T

# Display the first 5 rows of the resulting DataFrame
result.head()

Unnamed: 0,Subject,Score,Subject.1
0,Mathematics,85,Science
1,English,91,Arts
2,History,95,Chemistry
3,History,95,Chemistry
4,English,95,Chemistry


## 4. Pivot and Crosstab
You can pivot a Pandas dataframe using a specific column or row.
With pivoting, you can set values in columns as index values, as well
as column headers.

In [33]:
# Import the matplotlib.pyplot module for creating plots
import matplotlib.pyplot as plt

# Import the seaborn library for statistical data visualization
import seaborn as sns

# Load the built-in "flights" dataset from seaborn
# This dataset contains the number of passengers flying each month from 1949 to 1960
flights_data = sns.load_dataset('flights')

# Display the first five rows of the dataset to get an overview of the data
flights_data.head()

Unnamed: 0,year,month,passengers
0,1949,Jan,112
1,1949,Feb,118
2,1949,Mar,132
3,1949,Apr,129
4,1949,May,121


In [35]:
# Create a pivot table from the 'flights_data' DataFrame
# Explicitly specify observed=False to avoid FutureWarning
# - observed=False retains all combinations of categorical groupers
flights_data_pivot = flights_data.pivot_table(index='month',
                                               columns='year',
                                               values='passengers',
                                               observed=False)  

# Display the first 5 rows of the pivot table
flights_data_pivot.head()

year,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Jan,112.0,115.0,145.0,171.0,196.0,204.0,242.0,284.0,315.0,340.0,360.0,417.0
Feb,118.0,126.0,150.0,180.0,196.0,188.0,233.0,277.0,301.0,318.0,342.0,391.0
Mar,132.0,141.0,178.0,193.0,236.0,235.0,267.0,317.0,356.0,362.0,406.0,419.0
Apr,129.0,135.0,163.0,181.0,235.0,227.0,269.0,313.0,348.0,348.0,396.0,461.0
May,121.0,125.0,172.0,183.0,229.0,234.0,270.0,318.0,355.0,363.0,420.0,472.0


In [37]:
# Import necessary libraries
import matplotlib.pyplot as plt  # For creating static, animated, and interactive visualizations in Python
import seaborn as sns            # For making statistical graphics built on top of matplotlib
import pandas as pd              # For data manipulation and analysis

# Set Seaborn's default plot style to dark grid background
sns.set_style("darkgrid")

# Load the Titanic dataset from Seaborn's built-in datasets
titanic_data = sns.load_dataset('titanic')

In [38]:
# Create a cross-tabulation table (contingency table) using Pandas
# This shows the frequency distribution of two categorical variables: 'class' and 'age'

pd.crosstab(
    titanic_data["class"],    # Rows: Passenger class (e.g., First, Second, Third)
    titanic_data["age"],      # Columns: Age of passengers
    margins=True              # Adds a row and column labeled "All" that show the total counts
)

age,0.42,0.67,0.75,0.83,0.92,1.0,2.0,3.0,4.0,5.0,...,63.0,64.0,65.0,66.0,70.0,70.5,71.0,74.0,80.0,All
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
First,0,0,0,0,1,0,1,0,1,0,...,1,2,2,0,1,0,2,0,1,186
Second,0,1,0,2,0,2,2,3,2,1,...,0,0,0,1,1,0,0,0,0,173
Third,1,0,2,0,0,5,7,3,7,3,...,1,0,1,0,0,1,0,1,0,355
All,1,1,2,2,1,7,10,6,10,4,...,2,2,3,1,2,1,2,1,1,714


## 5. Discretization and Binning
Discretization or binning refers to creating categories or bins using
numeric data. For instance, based on age, you may want to assign
categories such as toddler, young, adult, and senior to the
passengers in the Titanic dataset. You can do this using binning.

In [39]:
# Importing the matplotlib library for plotting graphs and visualizations
import matplotlib.pyplot as plt

# Importing seaborn, a statistical data visualization library built on top of matplotlib
import seaborn as sns

# Loading the Titanic dataset using seaborn's built-in dataset loader
# This dataset contains information about Titanic passengers (e.g., age, sex, class, survival)
titanic_data = sns.load_dataset('titanic')

In [40]:
# Create a new column 'age_group' by categorizing the 'age' values into defined bins
# The bins are: 0–5 (toddler), 6–20 (young), 21–60 (adult), 61–100 (senior)
# 'pd.cut' segments the data into these intervals and labels them accordingly
titanic_data['age_group'] = pd.cut(
    x=titanic_data['age'],  # the column to be binned
    bins=[0, 5, 20, 60, 100],  # edges of the age groups
    labels=["toddler", "young", "adult", "senior"]  # labels for each bin
)

# Count the number of passengers in each age group and display the result
titanic_data['age_group'].value_counts()

age_group
adult      513
young      135
toddler     44
senior      22
Name: count, dtype: int64