# Pandas Datetime, Aggregation, and Merges


▶️ Import `pandas` and `numpy`.


In [1]:
import pandas as pd
import numpy as np

▶️ Create a new `DataFrame` named `df_emp`.


In [2]:
df_emp = pd.DataFrame(
    {
        "emp_id": [30, 40, 10, 20],
        "name": ["Toby", "Jim", "Pam", "Kelly"],
        "dept": ["HR", "Sales", "Sales", "Customer Service"],
        "office_phone": ["(217)123-4500", np.nan, np.nan, "(217)987-6600"],
        "start_date": ["2017-05-01", "2018-02-01", "2020-08-01", "2019-12-01"],
        "salary": [202000, 185000, 240000, 160500],
    }
)

# Create a backup copy of the original DataFrame
df_emp_backup = df_emp.copy()

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Toby,HR,(217)123-4500,2017-05-01,202000
1,40,Jim,Sales,,2018-02-01,185000
2,10,Pam,Sales,,2020-08-01,240000
3,20,Kelly,Customer Service,(217)987-6600,2019-12-01,160500


▶️ Run `df_emp.info()` for a summary of the DataFrame, including the data types of each column and the number of non-null entries.


In [3]:
df_emp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   emp_id        4 non-null      int64 
 1   name          4 non-null      object
 2   dept          4 non-null      object
 3   office_phone  2 non-null      object
 4   start_date    4 non-null      object
 5   salary        4 non-null      int64 
dtypes: int64(2), object(4)
memory usage: 324.0+ bytes


---

## 🗓️ Working with Datetime Values

Datetime values are common in datasets. They can represent dates, times, or both. Pandas provides powerful tools to work with datetime data.

A few common datetime formats include:

- `YYYY-MM-DD` (e.g., `2021-03-15`)
- `MM/DD/YYYY` (e.g., `03/15/2021`)
- `DD-Mon-YYYY` (e.g., `15-Mar-2021`)
- `YYYYMMDD` (e.g., `20210315`)

When you read a CSV file, pandas does not automatically recognize datetime columns.

▶️ You can verify this by checking the data type of the `"start_date"` column.


In [4]:
display(df_emp["start_date"])
print(str(df_emp["start_date"].dtype))  # Check the data type of the "start_date" column

0    2017-05-01
1    2018-02-01
2    2020-08-01
3    2019-12-01
Name: start_date, dtype: object

object


The `"start_date"` column is currently of type `object`, which means it is treated as a string.

To use datetime functionalities, we need to convert this column to a datetime type. There are two ways to convert a column to datetime:

1. **During CSV Import**: Use the `parse_dates` parameter in `pd.read_csv()`.

   ```python
   df = pd.read_csv('data.csv', parse_dates=['date_column'])
   ```

2. **After CSV Import**: Use `pd.to_datetime()` to convert a column.

   ```python
   df['date_column'] = pd.to_datetime(df['date_column'])
   ```


---

**🎯 Example: Parse the `start_date` column as datetime**

We will create a new column `"start_date_parsed"` that contains the parsed datetime values.


In [5]:
df_emp["start_date_parsed"] = pd.to_datetime(df_emp["start_date"])

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary,start_date_parsed
0,30,Toby,HR,(217)123-4500,2017-05-01,202000,2017-05-01
1,40,Jim,Sales,,2018-02-01,185000,2018-02-01
2,10,Pam,Sales,,2020-08-01,240000,2020-08-01
3,20,Kelly,Customer Service,(217)987-6600,2019-12-01,160500,2019-12-01


▶️ Run `str(df_emp["start_date_parsed"].dtype)` to check the data type of the new `"start_date_parsed"` column.


In [6]:
str(df_emp["start_date_parsed"].dtype)

'datetime64[ns]'

▶️ Drop the `"start_date"` column and rename `"start_date_parsed"` to `"start_date"`.

This effectively replaces the original string column with the new datetime column.


In [7]:
df_emp.drop(columns=["start_date"], inplace=True)
df_emp.rename(columns={"start_date_parsed": "start_date"}, inplace=True)

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,salary,start_date
0,30,Toby,HR,(217)123-4500,202000,2017-05-01
1,40,Jim,Sales,,185000,2018-02-01
2,10,Pam,Sales,,240000,2020-08-01
3,20,Kelly,Customer Service,(217)987-6600,160500,2019-12-01


▶️ Check the data types of the columns to confirm the change.


In [8]:
df_emp.dtypes

emp_id                   int64
name                    object
dept                    object
office_phone            object
salary                   int64
start_date      datetime64[ns]
dtype: object

Now the `"start_date"` column is of type `datetime64[ns]`, which allows us to perform datetime operations on it.


---

### 🔢 Extract Date Components

We can easily extract components like year, month, and day from datetime columns using the `.dt` accessor.

```python
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day'] = df['date_column'].dt.day
df['weekday'] = df['date_column'].dt.weekday     # e.g., 0=Monday, 6=Sunday
df['weekday'] = df['date_column'].dt.day_name()  # e.g., 'Monday'
```

We can also extract more specific components like hour, minute, and second if the datetime includes time.

```python
df['hour'] = df['date_column'].dt.hour
df['minute'] = df['date_column'].dt.minute
df['second'] = df['date_column'].dt.second
```


---

▶️ Extract the year, month, and day from the `"start_date"` column into new columns `"start_year"`, `"start_month"`, and `"start_day"`.


In [9]:
df_emp["start_year"] = df_emp["start_date"].dt.year
df_emp["start_month"] = df_emp["start_date"].dt.month
df_emp["start_day"] = df_emp["start_date"].dt.day

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,salary,start_date,start_year,start_month,start_day
0,30,Toby,HR,(217)123-4500,202000,2017-05-01,2017,5,1
1,40,Jim,Sales,,185000,2018-02-01,2018,2,1
2,10,Pam,Sales,,240000,2020-08-01,2020,8,1
3,20,Kelly,Customer Service,(217)987-6600,160500,2019-12-01,2019,12,1


If you're working with quarterly or weekly data, you can extract those components as well.

```python
df['quarter'] = df['date_column'].dt.quarter          # e.g., 1, 2, 3, 4
df['week'] = df['date_column'].dt.isocalendar().week  # e.g., 1-52
```

▶️ Extract the quarter and week from the `"start_date"` column. This time, we will not create new columns but will display the results directly.


In [10]:
df_emp["start_date"].dt.quarter

0    2
1    1
2    3
3    4
Name: start_date, dtype: int32

In [11]:
df_emp["start_date"].dt.isocalendar().week

0    18
1     5
2    31
3    48
Name: week, dtype: UInt32

---

## 🔬 Grouping and Aggregating Data with Pandas

A common task in data analysis is to summarize data by certain criteria, such as calculating averages or totals for different groups within the data.

In the employees dataset, you might want to find out the average salary by department, or the total number of employees hired each year.

Pandas allows you to use the `groupby()` method to group data by one or more columns, and then apply aggregation functions like `mean()`, `sum()`, `count()`, etc.

![groupby](images/pandas/groupby-01.png)

This follows the [_split-apply-combine_](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) pattern used in many data analysis workflows.

The split-apply-combine consists of three steps:

1. **Split**: Divide the data into groups based on one or more keys (columns).
2. **Apply**: Apply a function (e.g., aggregation, transformation) to each group.
3. **Combine**: Combine the results back into a DataFrame.


---

To illustrate the split-apply-combine process, we will create a new `DataFrame`.

▶️ Create a new `DataFrame` named `df`.


In [12]:
df = pd.DataFrame(
    {
        "name": ["Mary", "Roy", "John", "Joe", "Paul", "Erin"],
        "dept": ["Finance", "Purchase", "Finance", "Purchase", "Finance", "Purchase"],
        "salary": [240000, 160000, 250000, 170000, 260000, 180000],
    }
)

df

Unnamed: 0,name,dept,salary
0,Mary,Finance,240000
1,Roy,Purchase,160000
2,John,Finance,250000
3,Joe,Purchase,170000
4,Paul,Finance,260000
5,Erin,Purchase,180000


---

▶️ Group the DataFrame `df` by the `"dept"` column without applying any aggregation function.

This creates a `DataFrameGroupBy` object, which represents the grouped data but does not perform any computations yet.


In [13]:
df.groupby("dept")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002031B0B2270>

:::{tip} What just happened?

- Internally, Pandas creates one group per department when you run `.groupby('dept')`.
- However, you won't be able to see the groups until we apply aggregation function(s) to each group.
- The strange-looking output (in the form of `<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012345678910>`) tells us that the result is a `DataFrameGroupBy` object.
  :::

This diagram illustrates the `DataFrameGroupBy` object created by `df.groupby('dept')`.

![groupby object](images/pandas/df-groupby-object-01.png)


---

### 🗂️ Aggregate a `DataFrameGroupBy` object

▶️ Group the DataFrame `df` by the `"dept"` column and calculate the mean salary for each department using the `agg()` method.


In [14]:
df_salary_by_dept = df.groupby("dept").agg({"salary": "mean"})

df_salary_by_dept

Unnamed: 0_level_0,salary
dept,Unnamed: 1_level_1
Finance,250000.0
Purchase,170000.0


The result is a new `DataFrame` that shows the average salary for each department.

---

#### Columns and Index (Row Labels) in Aggregated DataFrames

▶️ Check the columns of the resulting `DataFrame`.


In [15]:
print(df_salary_by_dept.columns)

Index(['salary'], dtype='object')


Although the output shows two columns (`dept` and `salary`), printing the columns only show the `salary` column. This is because the output of `df_salary_by_dept` uses the `"dept"` column as the index.

An index in pandas is a special column that uniquely identifies each row in a DataFrame. It is not considered a regular column and is not included in the `columns` attribute. When you group by a column, that column is set as the index of the resulting DataFrame by default.

![groupby agg result](images/pandas/df-groupby-agg-as-index-true-01.png)


---

### 📂 Aggregate a `DataFrameGroupBy` object with optional `index=False`

▶️ Specify `as_index=False` in the `groupby()` method to keep the grouping column as a regular column instead of setting it as the index.


In [16]:
df_salary_by_dept2 = df.groupby("dept", as_index=False).agg({"salary": "mean"})

df_salary_by_dept2

Unnamed: 0,dept,salary
0,Finance,250000.0
1,Purchase,170000.0


▶️ Check the columns of the resulting `DataFrame`.


In [17]:
print(df_salary_by_dept2.columns)

Index(['dept', 'salary'], dtype='object')


Since we specified `as_index=False`, the `"dept"` column is retained as a regular column in the resulting DataFrame. Now, printing the columns shows both `dept` and `salary`.

![groupby agg result](images/pandas/df-groupby-agg-as-index-false-01.png)


---

### ➗ Calculate multiple statistics

▶️ Instead of calculating just the mean salary, you can calculate multiple statistics at once by passing a list of aggregation functions to the `agg()` method's dictionary.


In [18]:
df_salary_by_dept3 = df.groupby("dept", as_index=False).agg(
    {"salary": ["min", "max", "mean", "sum", "count", "std"]}
)

df_salary_by_dept3

Unnamed: 0_level_0,dept,salary,salary,salary,salary,salary,salary
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,sum,count,std
0,Finance,240000,260000,250000.0,750000,3,10000.0
1,Purchase,160000,180000,170000.0,510000,3,10000.0


▶️ Check the columns of the resulting `DataFrame`.


In [19]:
print(df_salary_by_dept3.columns)

MultiIndex([(  'dept',      ''),
            ('salary',   'min'),
            ('salary',   'max'),
            ('salary',  'mean'),
            ('salary',   'sum'),
            ('salary', 'count'),
            ('salary',   'std')],
           )


Notice that the columns now have a hierarchical structure (a MultiIndex) because we applied multiple aggregation functions to the `"salary"` column. The first level is the original column name (`salary`), and the second level contains the names of the aggregation functions (`min`, `max`, `mean`, `sum`, `count`, `std`).

:::{tip} What are the data types of the multiple-statistics columns?

The columns as a whole are of type `MultiIndex`, which allows for hierarchical indexing.

```python
type(df_salary_by_dept3.columns) # Output: pandas.core.indexes.multi.MultiIndex
```

Each column is represented as a tuple, where the first element is the original column name and the second element is the aggregation function name.

```python
type(df_salary_by_dept3.columns[0]) # Output: tuple
```

To access specific columns, you can use tuples inside the square brackets. For example, to access the mean salary column:

```python
df_salary_by_dept3[('salary', 'mean')]
```

:::


---

### 🪄 Flatten multi-level index columns

It is perfectly fine to work with multi-level columns, but sometimes you may want to flatten them for easier access. You can manually assign the column names after aggregation to flatten the columns.

```python
df_salary_by_dept4 = df.groupby('dept', as_index=False).agg({'salary': ['min', 'max', 'mean', 'sum', 'count', 'std']})

display(df_salary_by_dept4)
print('Columns before (multi-level, not flat):')
print(df_salary_by_dept4.columns)

# manually assign column names
df_salary_by_dept4.columns = ['dept', 'min_salary', 'max_salary', 'mean_salary', 'total_salary', 'num_employees', 'std_dev']

display(df_salary_by_dept4)
print('Columns after (flat-level):')
print(df_salary_by_dept4.columns)
```

▶️ Copy the provided code to the code cell below and run it.


In [20]:
df_salary_by_dept4 = df.groupby("dept", as_index=False).agg(
    {"salary": ["min", "max", "mean", "sum", "count", "std"]}
)

df_salary_by_dept4

Unnamed: 0_level_0,dept,salary,salary,salary,salary,salary,salary
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,sum,count,std
0,Finance,240000,260000,250000.0,750000,3,10000.0
1,Purchase,160000,180000,170000.0,510000,3,10000.0


▶️ Check the columns of the `DataFrame` before flattening.


In [21]:
print("Columns before flattening:")

df_salary_by_dept4.columns

Columns before flattening:


MultiIndex([(  'dept',      ''),
            ('salary',   'min'),
            ('salary',   'max'),
            ('salary',  'mean'),
            ('salary',   'sum'),
            ('salary', 'count'),
            ('salary',   'std')],
           )

▶️ Flatten the columns by manually assigning new column names.


In [22]:
df_salary_by_dept4.columns = [
    "dept",
    "min_salary",
    "max_salary",
    "mean_salary",
    "total_salary",
    "num_employees",
    "std_dev",
]

df_salary_by_dept4

Unnamed: 0,dept,min_salary,max_salary,mean_salary,total_salary,num_employees,std_dev
0,Finance,240000,260000,250000.0,750000,3,10000.0
1,Purchase,160000,180000,170000.0,510000,3,10000.0


▶️ Check the columns of the `DataFrame` after flattening.


In [23]:
print("Columns after flattening:")

df_salary_by_dept4.columns

Columns after flattening:


Index(['dept', 'min_salary', 'max_salary', 'mean_salary', 'total_salary',
       'num_employees', 'std_dev'],
      dtype='object')

---

## 📞 Examples Using Bank Marketing Calls Data

Let's apply what we've learned to a real-world dataset. We'll use a dataset related to direct marketing campaigns (phone calls) of a banking institution.

**Data Source**: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

### Data Dictionary

| Column Name     | Type        | Description                                                                                                                                                 |
| --------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `age`           | Numeric     | Age                                                                                                                                                         |
| `job`           | Categorical | admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown' |
| `marital`       | Categorical | single', 'married', 'divorced', 'unknown'                                                                                                                   |
| `education`     | Categorical | basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown'                                       |
| `contact_type`  | Categorical | cellular', 'telephone'                                                                                                                                      |
| `num_contacts`  | Numeric     | Number of contacts performed during this campaign for this client                                                                                           |
| `prev_outcome`  | Categorical | Outcome of the previous marketing campaign - 'failure', 'nonexistent', 'success'                                                                            |
| `place_deposit` | Numeric     | Did the client subscribe to a term deposit? This column indicates whether the campaign was successful (1) or not (0) for each client.                       |

Analyze the dataset to discover relationships between personal factors and marketing campaign result of each individual.

**`place_deposit`** column indicates whether a marketing campaign was successful.

- ✅ If `1`, the individual has placed a deposit within the bank. This is considered a **successful campaign**.
- 🚫 If `0`, the individual has not placed a deposit within the bank. This is considered an **unsuccessful campaign**.


▶️ Load the bank marketing calls dataset into a new `DataFrame` named `df_bank`.


In [24]:
df_bank = pd.read_csv(
    "https://github.com/bdi475/datasets/blob/main/bank-direct-marketing.csv?raw=true"
)
df_bank

Unnamed: 0,age,job,marital,education,contact_type,num_contacts,prev_outcome,place_deposit
0,56,housemaid,married,basic.4y,telephone,1,nonexistent,False
1,57,services,married,high.school,telephone,1,nonexistent,False
2,37,services,married,high.school,telephone,1,nonexistent,False
3,40,admin.,married,basic.6y,telephone,1,nonexistent,False
4,56,services,married,high.school,telephone,1,nonexistent,False
...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,cellular,1,nonexistent,True
41184,46,blue-collar,married,professional.course,cellular,1,nonexistent,False
41185,56,retired,married,university.degree,cellular,2,nonexistent,False
41186,44,technician,married,professional.course,cellular,1,nonexistent,True


---

**🎯 Example: Marketing success rate by marital status**

Find out which marital status group has the highest marketing success rate.

▶️ Group the DataFrame by the `"marital"` column and calculate the mean of the `"place_deposit"` column for each group.


In [25]:
df_by_marital = df_bank.groupby("marital", as_index=False).agg(
    {"place_deposit": "mean"}
)
df_by_marital


df_by_marital

Unnamed: 0,marital,place_deposit
0,divorced,0.103209
1,married,0.101573
2,single,0.140041
3,unknown,0.15


:::{note} Why does finding the mean of `place_deposit` give us the success rate?

The `place_deposit` column is binary (0 or 1), where 1 indicates a successful marketing campaign. By calculating the mean, we get the proportion of successful campaigns for each group, which is the definition of success rate.

For example, if a group has 5 individuals and 3 of them placed a deposit (i.e., `place_deposit` is 1 for those 3 individuals), the mean would be:

```
(1 + 1 + 1 + 0 + 0) / 5 = 0.6
```

This means the success rate for this group is 60%.

:::


▶️ Rename the `"place_deposit"` column to `"success_rate"`.


In [26]:
df_by_marital.rename(columns={"place_deposit": "success_rate"}, inplace=True)

df_by_marital

Unnamed: 0,marital,success_rate
0,divorced,0.103209
1,married,0.101573
2,single,0.140041
3,unknown,0.15


▶️ Sort the resulting DataFrame by `"success_rate"` in descending order to see which marital status has the highest success rate.


In [27]:
df_by_marital.sort_values("success_rate", ascending=False, inplace=True)

df_by_marital

Unnamed: 0,marital,success_rate
3,unknown,0.15
2,single,0.140041
0,divorced,0.103209
1,married,0.101573


---

**🎯 Example: Marketing success rate by job**

Find out which job category has the highest marketing success rate.


In [28]:
df_by_job = df_bank.groupby("job", as_index=False).agg({"place_deposit": "mean"})
df_by_job.rename(columns={"place_deposit": "success_rate"}, inplace=True)
df_by_job.sort_values("success_rate", ascending=False, inplace=True)

df_by_job

Unnamed: 0,job,success_rate
8,student,0.314286
5,retired,0.252326
10,unemployed,0.142012
0,admin.,0.129726
4,management,0.112175
11,unknown,0.112121
9,technician,0.10826
6,self-employed,0.104856
3,housemaid,0.1
2,entrepreneur,0.085165


---

**🎯 Example: Marketing success rate by contact type with count**

▶️ Group the `df_bank` DataFrame by the `"contact_type"` column and calculate both the mean of the `"place_deposit"` column (to get the success rate) and the count of records for each contact type. Use `as_index=False` to keep `"contact_type"` as a regular column.


In [29]:
df_by_contact_type = df_bank.groupby("contact_type", as_index=False).agg(
    {"place_deposit": ["count", "mean"]}
)

# flatten columns
df_by_contact_type.columns = ["contact_type", "count", "success_rate"]

# sort by success_rate in descending order
df_by_contact_type.sort_values("success_rate", ascending=False, inplace=True)

df_by_contact_type

Unnamed: 0,contact_type,count,success_rate
0,cellular,26144,0.147376
1,telephone,15044,0.052313


---

**🎯 Example: Marketing success rate by marital status and contact type**

You might want to see how the success rate varies not just by marital status, but also by the type of contact method used (cellular or telephone). This will help the bank understand which combinations of marital status and contact type are most effective.

To group by both marital status and contact type, you can pass a list of the two columns to the `groupby()` method.

▶️ Group the data by both `"marital"` status and `"contact_type"` to see how the success rate varies across these two dimensions.


In [30]:
df_by_type_and_marital = df_bank.groupby(
    ["marital", "contact_type"], as_index=False
).agg({"place_deposit": "mean"})
df_by_type_and_marital.rename(columns={"place_deposit": "success_rate"}, inplace=True)
df_by_type_and_marital.sort_values("success_rate", ascending=False, inplace=True)

df_by_type_and_marital

Unnamed: 0,marital,contact_type,success_rate
6,unknown,cellular,0.207547
4,single,cellular,0.173875
0,divorced,cellular,0.13652
2,married,cellular,0.135341
5,single,telephone,0.064884
3,married,telephone,0.048755
1,divorced,telephone,0.046362
7,unknown,telephone,0.037037


---

## 🧲 Merging Two DataFrames (Joins)

When working with real-world data, it's common to have related information spread across multiple tables. This is especially true in relational databases, where data is normalized to reduce redundancy.

There are several types of joins you can perform when merging DataFrames. In this section, we won't worry about the different types of joins, but focus on the basic concept of merging using a _left join_.

To demonstrate how merging works, we'll work a record of transactions from a small food stand selling only two items - sweetcorns 🌽 and beers 🍺. The tables associated with the food stand's transactions are shown below.

**Products (`df_products`)**

| product_id | product_name | price |
| ---------- | ------------ | ----- |
| SC         | Sweetcorn    | 3.0   |
| CB         | Beer         | 5.0   |

**Transactions (`df_transactions`)**

| transaction_id | product_id |
| -------------- | ---------- |
| 1              | SC         |
| 2              | SC         |
| 3              | CB         |
| 4              | SC         |
| 5              | SC         |
| 6              | SC         |
| 7              | CB         |
| 8              | SC         |
| 9              | CB         |
| 10             | SC         |


▶️ Create the two tables as DataFrames.


In [31]:
df_products = pd.DataFrame(
    {
        "product_id": ["SC", "CB"],
        "product_name": ["Sweetcorn", "Beer"],
        "price": [3.0, 5.0],
    }
)

df_products

Unnamed: 0,product_id,product_name,price
0,SC,Sweetcorn,3.0
1,CB,Beer,5.0


In [32]:
df_transactions = pd.DataFrame(
    {
        "transaction_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "product_id": ["SC", "SC", "CB", "SC", "SC", "SC", "CB", "SC", "CB", "SC"],
    }
)

df_transactions

Unnamed: 0,transaction_id,product_id
0,1,SC
1,2,SC
2,3,CB
3,4,SC
4,5,SC
5,6,SC
6,7,CB
7,8,SC
8,9,CB
9,10,SC


---

**🎯 Example: Merge products into transactions**


In [33]:
df_merged = pd.merge(
    left=df_transactions, right=df_products, on="product_id", how="left"
)

df_merged

Unnamed: 0,transaction_id,product_id,product_name,price
0,1,SC,Sweetcorn,3.0
1,2,SC,Sweetcorn,3.0
2,3,CB,Beer,5.0
3,4,SC,Sweetcorn,3.0
4,5,SC,Sweetcorn,3.0
5,6,SC,Sweetcorn,3.0
6,7,CB,Beer,5.0
7,8,SC,Sweetcorn,3.0
8,9,CB,Beer,5.0
9,10,SC,Sweetcorn,3.0


---

**🎯 Example: Find total sales by product**


In [34]:
df_sales_by_product = df_merged.groupby("product_id", as_index=False).agg(
    {"price": "sum"}
)

df_sales_by_product

Unnamed: 0,product_id,price
0,CB,15.0
1,SC,21.0


While the output is useful, it would be more informative to see the product names alongside their total sales. We can achieve this by grouping the merged DataFrame by both `"product_id"` and `"product_name"`.


In [35]:
df_sales_by_id_name = df_merged.groupby(
    ["product_id", "product_name"], as_index=False
).agg({"price": "sum"})

df_sales_by_id_name

Unnamed: 0,product_id,product_name,price
0,CB,Beer,15.0
1,SC,Sweetcorn,21.0


:::{seealso} Why does Pandas use `merge()` instead of `join()`?

Pandas provides both `merge()` and `join()` methods for combining DataFrames, but they serve slightly different purposes and have different use cases.

`merge()` is a more general method that allows you to specify the columns to join on, and it can perform various types of joins (inner, outer, left, right) based on the keys you provide. It's similar to SQL joins and is very flexible.

On the other hand, `join()` is primarily used for joining on the index of the DataFrames. It's a simpler and more concise method when you want to join two DataFrames based on their indices.

In this example, we used `merge()` because we wanted to join the two DataFrames based on a specific column (`product_id`) rather than their indices. This gives us more control over the join operation and allows us to specify the type of join we want (in this case, a left join).

:::
