# Lecture 14 - Pandas Datetime, Grouping, Aggregating, Merging

Thursday 2021/10/07

## Lecture Notes and in-class exercises

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [None]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly import Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly import NumPy with an alias.')

---

## 🗓️ Working with Datetime Values

You will often see date-looking strings in your data. A few examples are:

- `20210315`
- `Mar 15, 2021`
- `2020-03-15`
- `2020/3/15`

In the first part of today's lecture, we'll discuss how we can *parse* and utilize datetime values.

---

### 📌 Load employees data

▶️ Run the code cell below to create a new `DataFrame` named `df_emp`.

In [None]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_emp = pd.DataFrame({
    'emp_id': [30, 40, 10, 20],
    'name': ['Talal', 'Josh', 'Anika', 'Aishani'],
    'dept': ['Finance', 'Purchase', 'Finance', 'Purchase'],
    'office_phone': ['(217)123-4500', np.nan, np.nan, '(217)987-6600'],
    'start_date': ['2017-05-01', '2018-02-01', '2020-08-01', '2019-12-01'],
    'salary': [202000, 185000, 240000, 160500]
})

# Used for intermediate checks
df_emp_backup = df_emp.copy()

df_emp

---

### 📌 Concise summary of a `DataFrame`

▶️ Run `df_emp.info()` below to see a concise summary of the `DataFrame`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

**Question**: What is the data type of the `start_date` column?

▶️ Run `str(df_emp['start_date'].dtype)` below to see the data type of the `start_date` column.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

While `object` can refer to many different types, you can safely assume that all `object` data types you see in this course refer to strings.

---

### 🎯 Mini-exercise: Parse a string column as datetime

#### 👇 Tasks

- ✔️ Parse `start_date` to a `datetime` data type.
- ✔️ Store the result to a new column named `start_date_parsed`.

#### 🚀 Hints

The code below converts `date_str` column to a `datetime`-typed column and stores the converted result to a new column named `date_parsed`.

```python
my_dataframe['date_parsed'] = pd.to_datetime(my_dataframe['date_str'])
```

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_emp

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
# Check result
tc.assertEqual(set(df_emp.columns), set(df_emp_backup.columns.tolist() + ['start_date_parsed']))
pd.testing.assert_series_equal(df_emp['start_date_parsed'].reset_index(drop=True),
                               pd.to_datetime(df_emp_backup['_'.join(['sTarT', 'DaTe']).lower()])
                                  .reset_index(drop=True),
                               check_names=False)

---

▶️ Run `str(df_emp['start_date_parsed'].dtype)` below to see the data type of the `start_date` column.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

---

### 🎯 Mini-exercise: Drop `start_date` column *in-place*

We no longer need the `start_date` column. We'll work with the new `start_date_parsed` column from this point on.

#### 👇 Tasks

- ✔️ Drop `start_date` column from `df_emp` *in-place*.

#### 🚀 Hints

The code below drops `col1` from `my_dataframe` *in-place* without creating a new variable.

```python
my_dataframe.drop(columns=['col1'], inplace=True)
```

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_emp

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_emp_backup.copy()
df_check['_'.join(['sTarT', 'DaTe', 'pArSeD']).lower()] = pd.to_datetime(df_check['start_date'])
df_check = df_check.drop(columns=['start_date'])

# Check result
tc.assertEqual(set(df_emp.columns), set(['start_date_parsed', 'salary', 'office_phone', 'dept', 'name', 'emp_id']))
pd.testing.assert_frame_equal(df_emp.sort_values(df_emp.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Rename `start_date_parsed` to `start_date`

#### 👇 Tasks

- ✔️ Rename `start_date_parsed` to `start_date` in `df_emp` *in-place*.

#### 🚀 Hints

The code below renames the `name_before` column to `name_after` in `my_dataframe` *in-place* without creating a new variable.

```python
my_dataframe.rename(columns={'name_before': 'name_after'}, inplace=True)
```

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_emp

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_emp_backup.copy()
df_check['_'.join(['sTarT', 'DaTe', 'pArSeD']).lower()] = pd.to_datetime(df_check['start_date'])
df_check = df_check.drop(columns=['start_date']).rename(columns={'start_date_parsed': 'start_date'})

# Check result
pd.testing.assert_frame_equal(df_emp.sort_values(df_emp.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Extract year from a datetime column

#### 👇 Tasks

- ✔️ Create a new column named `start_year` in `df_emp` that contains the starting years in integers (e.g., `2017`, `2018`).
- ✔️ Extract values from `df_emp['start_date']`.

#### 🚀 Hints

The code extracts the year of a datetime column `my_date` and stores it to a new column named `my_year`.

```python
my_dataframe['my_year'] = my_dataframe['my_date'].dt.year
```

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_emp

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_emp_backup.copy()
df_check['_'.join(['sTarT', 'DaTe', 'pArSeD']).lower()] = pd.to_datetime(df_check['start_date'])
df_check = df_check.drop(columns=['start_date']).rename(columns={'start_date_parsed': 'start_date'})
df_check['_'.join(['sTarT', 'yEaR']).lower()] = df_check['_'.join(['sTarT', 'dAtE']).lower()].dt.year

# Check result
pd.testing.assert_frame_equal(df_emp.sort_values(df_emp.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Extract month, day from a datetime column

#### 👇 Tasks

- ✔️ Create new columns named `start_month` and `start_day` in `df_emp` that contain the starting months and days in integers.
- ✔️ Extract values from `df_emp['start_date']`.

#### 🚀 Hints

The code extracts the months and days of a datetime column `my_date` and stores it to two new columns.

```python
my_dataframe['my_month'] = my_dataframe['my_date'].dt.month
my_dataframe['my_day'] = my_dataframe['my_date'].dt.day
```

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

df_emp

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_emp_backup.copy()
df_check['_'.join(['sTarT', 'DaTe', 'pArSeD']).lower()] = pd.to_datetime(df_check['start_date'])
df_check = df_check.drop(columns=['start_date']).rename(columns={'start_date_parsed': 'start_date'})
df_check['_'.join(['sTarT', 'yEaR']).lower()] = df_check['_'.join(['sTarT', 'dAtE']).lower()].dt.year
df_check['_'.join(['sTarT', 'mOnTh']).lower()] = df_check['_'.join(['sTarT', 'dAtE']).lower()].dt.month
df_check['_'.join(['sTarT', 'dAy']).lower()] = df_check['_'.join(['sTarT', 'dAtE']).lower()].dt.day

# Check result
pd.testing.assert_frame_equal(df_emp.sort_values(df_emp.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

▶️ Run `df_emp['start_date'].dt.weekday` below.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

Note that Monday is 0 and Sunday returns 6.

---

▶️ Run `df_emp['start_date'].dt.quarter` below.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

This returns the quarter of a year (e.g., Q1, Q2, Q3, Q4).

---

## 🔬 Grouping and Aggregating Data

### 📌 Load employees data (a simpler one

▶️ Run the code cell below to create a new `DataFrame` named `df`.

In [None]:
df = pd.DataFrame({
    'name': ['Mary', 'Roy', 'John', 'Joe', 'Paul', 'Erin'],
    'dept': ['Finance', 'Purchase', 'Finance', 'Purchase', 'Finance', 'Purchase'],
    'salary': [240000, 160000, 250000, 170000, 260000, 180000]}
)

df

👉 A very common task in working with data is to *group* your data by one or more criteria. As an example, you want to see the average salary of employees in your `df_emp` **by department**.

---

### 📌 Creating a `DataFrameGroupBy` object

▶️ Run `df.groupby('dept')` below.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

![groupby object](https://github.com/bdi475/images/blob/main/pandas/df-groupby-object-01.png?raw=true)

---

### 📌 Aggregating a `DataFrameGroupBy` object

▶️ Run `df.groupby('dept').agg({'salary': 'mean'})` below.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

👉 Your resulting `DataFrame` now displays average salary by `dept`.

```python
df_salary_by_dept = df.groupby('dept').agg({'salary': 'mean'})

display(df_salary_by_dept)
print(df_salary_by_dept.columns)
```

▶️ Copy the provided code above to the code cell below and run it.

In [None]:
# YOUR CODE BEGINS




# YOUR CODE ENDS

👉 There is only one column shown when you print out `df_salary_by_dept.columns`! 🙀

This is because the column(s) you use to create groups are used as **index** by default.

![groupby agg result](https://github.com/bdi475/images/blob/main/pandas/df-groupby-agg-as-index-true-01.png?raw=true)

---

### 📌 Aggregating a `DataFrameGroupBy` object with optional `index=False`

```python
df_salary_by_dept2 = df.groupby('dept', as_index=False).agg({'salary': 'mean'})

display(df_salary_by_dept2)
print(df_salary_by_dept2.columns)
```

▶️ Copy the provided code to the code cell below and run it.

In [None]:
# YOUR CODE BEGINS




# YOUR CODE ENDS

👉 Now, printing out the columns show both `dept` and `salary`. Supplying `as_index=False` to `groupby()` keeps the columns you use as groupby criteria as regular columns.

![groupby agg result](https://github.com/bdi475/images/blob/main/pandas/df-groupby-agg-as-index-false-01.png?raw=true)

---

### 📌 Creating multiple aggregation measures

```python
df_salary_by_dept3 = df.groupby('dept', as_index=False).agg({'salary': ['min', 'max', 'mean', 'sum', 'count', 'std']})

display(df_salary_by_dept3)
print(df_salary_by_dept3.columns)
```

▶️ Copy the provided code to the code cell below and run it.

In [None]:
df_salary_by_dept3 = df.groupby('dept', as_index=False).agg({'salary': ['min', 'max', 'mean', 'sum', 'count', 'std']})

display(df_salary_by_dept3)
print(df_salary_by_dept3.columns)

---

## 📞 Exercises Using Bank Marketing Calls Data

For the next part of this lecture, you'll work with a dataset related with direct marketing campaigns (phone calls) of a banking institution. 

**Data Source**: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

| Column Name     | Type        | Description                                                                                                                                                     |
|-----------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `age`           | Numeric     | Age                                                                                                                                                             |
| `job`           | Categorical | admin.', 'blue-collar',   'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed',   'services', 'student', 'technician', 'unemployed', 'unknown' |
| `marital`       | Categorical | single', 'married', 'divorced', 'unknown'                                                                                                                       |
| `education`     | Categorical | basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate',   'professional.course', 'university.degree', 'unknown'                                         |
| `contact_type`  | Categorical | cellular', 'telephone'                                                                                                                                          |
| `num_contacts`  | Numeric     | Number of contacts performed during this campaign for this client                                                                                               |
| `prev_outcome`  | Categorical | Outcome of the previous marketing campaign - 'failure', 'nonexistent',   'success'                                                                              |
| `place_deposit` | Boolean     | Did the client subscribe to a term deposit? This column indicates whether   the campaign was successful for each client.                                        |

---


Your goal is to analyze the dataset to discover relationships between personal factors and marketing campaign result of each individual.

**`place_deposit`** column indicates whether a marketing campaign was successful.

- ✅ If `True`, the individual has placed a deposit within the bank. This is considered a **successful campaign**.
- 🚫 If `False`, the individual has not placed a deposit within the bank. This is considered an **unsuccessful campaign**.

---

### 📌 Load data

▶️ Run the code cell below to create a new `DataFrame` named `df_m`.

In [None]:
df_m = pd.read_csv('https://github.com/bdi475/datasets/raw/main/bank-direct-marketing.csv')
df_m_backup = df_m.copy()
df_m

---

### 🎯 Mini-exercise: Summing a boolean column

#### 👇 Tasks

- ✔️ Using `df_m`, sum up all values in `place_deposit` column.
- ✔️ Store the result to a new variable named `num_success`.

#### 🚀 Hints

`my_series.sum()` sums up all values in a `Series`.

To sum up `my_column` of `my_dataframe`, use `my_dataframe['my_column'].sum()`.

👉 **Wait, how can you add boolean values?** Boolean `True` values are converted to `1` and `False` values are converted to `0` before running arithmetic operations. This is common in many programming languages.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

num_success

In [None]:
tc.assertEqual(num_success, np.sum(df_m['_'.join(['pLaCe', 'dePoSIt']).lower()]))

---

### 🎯 Mini-exercise: Marketing success by marital status

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_marital` to display whether there is a difference in average success rate in direct marketing campaigns by marital status.
    - Use `index=False` option.
- ✔️ We will give you the code for this one.

![code](https://github.com/bdi475/images/blob/main/lecture-notes/df_m-groupby-marital-status-average-success.png?raw=true)

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_marital

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_m_backup.groupby('marital').agg({'place_deposit': np.mean}).reset_index()
df_check = df_check[['place_deposit', 'marital'][::-1]].copy()

# Check result
pd.testing.assert_frame_equal(df_by_marital.sort_values(df_by_marital.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Marketing success by job

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_job` to display whether there is a difference in average success rate in direct marketing campaigns by job.
    - Use `index=False` option.
- ✔️ `df_by_job` should only have the following two columns in the same order.
    - `marital`: Marital status (e.g., divorced, married, single, unknown)
    - `place_deposit`: Average success ratio (between 0-1)
- ✔️ Both columns should not be used as an index column.
    - Printing `df_by_job.columns.tolist()` should print out `['job', 'place_deposit']`.
- ✔️ Sort `df_by_job` by `place_deposit` in descending order *in-place*.

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

df_by_job

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
# DO NOT CHANGE THE CODE BELOW
df_check = df_m_backup.groupby('job').agg({'place_deposit': np.mean}).reset_index()
df_check = df_check.sort_values('place_deposit', ascending=False)
df_check = df_check[['place_deposit', 'job'][::-1]].copy()

pd.testing.assert_frame_equal(df_by_job.reset_index(drop=True),
                              df_check.reset_index(drop=True))

---

### 📌 Marketing success by job with more details

![code](https://github.com/bdi475/images/blob/main/lecture-notes/df_m-groupby-job-success-details.png?raw=true)

▶️ Copy the code above to the code cell below and run your code.

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

---

## 🧲 Merging two DataFrames (Joins)

Another common operation with tables is to merge two or more tables into one larger table.

To demonstrate how merging works, we'll work a record of transactions from a small food stand selling only two items - sweetcorns 🌽 and beers 🍺. The tables associated with the food stand's transactions are shown below.

### Products (`df_products`)

| product_id | product_name | price |
|---|---|---|
| SC | Sweetcorn | 3.0 |
| CB | Beer | 5.0 |

### Transactions (`df_transactions`)

| transaction_id | product_id |
|---|---|
| 1 | SC |
| 2 | SC |
| 3 | CB |
| 4 | SC |
| 5 | SC |
| 6 | SC |
| 7 | CB |
| 8 | SC |
| 9 | CB |
| 10 | SC |

▶️ Run the code below to create the two tables as DataFrames.

In [None]:
# DO NOT CHANGE THE CODE BELOW
df_products = pd.DataFrame({
    'product_id': ['SC', 'CB'],
    'product_name': ['Sweetcorn', 'Beer'],
    'price': [3.0, 5.0]
})

df_transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'product_id': ['SC', 'SC', 'CB', 'SC', 'SC', 'SC', 'CB', 'SC', 'CB', 'SC']
})

df_products_backup = df_products.copy()
df_transactions_backup = df_transactions.copy()

display(df_products)
display(df_transactions)

---

### 🎯 Mini-exercise: Merge products into transactions

#### 👇 Tasks

- ✔️ Using `df_products` and `df_transactions`, create a merged table as shown below.
- ✔️ Use a left merge.
- ✔️ Name the merged DataFrame `df_merged`.

#### 🚀 Hints

The code below merges `right_dataframe` into `left_dataframe` using `shared_key_column`. The resulting type of the merge is a left-merge.

```python
merged_dataframe = pd.merge(
    left=left_dataframe,
    right=right_dataframe,
    on='shared_key_column',
    how='left'
)
```

#### 🧭 Expected Output of `df_merged`

|  | transaction_id | product_id | product_name | price |
|---|---|---|---|---|
| 0 | 1 | SC | Sweetcorn | 3.0 |
| 1 | 2 | SC | Sweetcorn | 3.0 |
| 2 | 3 | CB | Beer | 5.0 |
| 3 | 4 | SC | Sweetcorn | 3.0 |
| 4 | 5 | SC | Sweetcorn | 3.0 |
| 5 | 6 | SC | Sweetcorn | 3.0 |
| 6 | 7 | CB | Beer | 5.0 |
| 7 | 8 | SC | Sweetcorn | 3.0 |
| 8 | 9 | CB | Beer | 5.0 |
| 9 | 10 | SC | Sweetcorn | 3.0 |

In [None]:
# YOUR CODE BEGINS






# YOUR CODE ENDS

display(df_merged)

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
# DO NOT CHANGE THE CODE BELOW
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup,
    on= '_'.join(['product', 'id']),
    how='inner'
).sort_values('_'.join(['transaction', 'id']))

pd.testing.assert_frame_equal(df_merged.reset_index(drop=True),
                              df_merged_SOL.reset_index(drop=True))

---

### 🎯 Mini-exercise: Total sales by product

#### 👇 Tasks

- ✔️ Using `df_merged` from the previous exercise, find the total sales by product.
- ✔️ Store the grouped result (a DataFrame) to `df_sales_by_product`.
- ✔️ Use the `groupby()` method on the `product_id` column.
- ✔️ `df_sales_by_product` should contain flat-level columns.
    - Printing `df_sales_by_product.columns.to_list()` should print out `['product_id', 'price']`.

#### 🧭 Expected Output of `df_sales_by_product`

|  | product_id | price |
|---|---|---|
| 0 | CB | 15.0 |
| 1 | SC | 21.0 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_sales_by_product)

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
# DO NOT CHANGE THE CODE BELOW
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup,
    on='product_id',
    how='left'
)
df_sales_by_product_SOL = df_merged_SOL.groupby('PRODUCT_ID'.lower()).agg({'price': np.sum}).reset_index()

pd.testing.assert_frame_equal(df_sales_by_product, df_sales_by_product_SOL)

---

### 🎯 Challenge of the day: Total sales by product (OPTIONAL)

⚠️ This challenge is a chance to showcase your problem-solving abilities. The code examples used in the lecture may not be sufficient to perform this task.

#### 👇 Tasks

- ✔️ Using `df_merged` from the previous exercise, find the total sales by product.
- ✔️ Store the grouped result (a DataFrame) to `df_sales_by_id_name`.
    - This time, include the `product_name` information in addition to the `product_id` column.
- ✔️ Use the `groupby()` method.
- ✔️ `df_sales_by_id_name` should contain flat-level columns in the order shown below.
    - Printing `df_sales_by_id_name.columns.to_list()` should print out `['product_id', 'product_name', 'price']`.

#### 🧭 Expected Output of `df_sales_by_id_name`

|  | product_id | product_name | price |
|---|---|---|---|
| 0 | CB | Beer | 15 |
| 1 | SC | Sweetcorn | 21 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_sales_by_id_name)

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
# DO NOT CHANGE THE CODE BELOW
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup,
    on='product_id',
    how='left'
)

df_sales_by_id_name_SOL = df_merged_SOL.groupby(['product_id', 'product_name'], as_index=False).agg({'price': 'sum'})


pd.testing.assert_frame_equal(df_sales_by_id_name, df_sales_by_id_name_SOL)