# Lecture 13 - Aggregations and Intro to Merges

Tuesday 2022/03/01

## Lecture Notes and in-class exercises

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [None]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [None]:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly import Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly import NumPy with an alias.')

---

## 📞 Exercises Using Bank Marketing Calls Data

For the next part of this lecture, you'll work with a dataset related with direct marketing campaigns (phone calls) of a banking institution. 

**Data Source**: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

| Column Name     | Type        | Description                                                                                                                                                     |
|-----------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `age`           | Numeric     | Age                                                                                                                                                             |
| `job`           | Categorical | admin.', 'blue-collar',   'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed',   'services', 'student', 'technician', 'unemployed', 'unknown' |
| `marital`       | Categorical | single', 'married', 'divorced', 'unknown'                                                                                                                       |
| `education`     | Categorical | basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate',   'professional.course', 'university.degree', 'unknown'                                         |
| `contact_type`  | Categorical | cellular', 'telephone'                                                                                                                                          |
| `num_contacts`  | Numeric     | Number of contacts performed during this campaign for this client                                                                                               |
| `prev_outcome`  | Categorical | Outcome of the previous marketing campaign - 'failure', 'nonexistent',   'success'                                                                              |
| `place_deposit` | Numeric     | Did the client subscribe to a term deposit? This column indicates whether the campaign was successful (1) or not (0) for each client.                                        |

---


Your goal is to analyze the dataset to discover relationships between personal factors and marketing campaign result of each individual.

**`place_deposit`** column indicates whether a marketing campaign was successful.
z
- ✅ If `1`, the individual has placed a deposit within the bank. This is considered a **successful campaign**.
- 🚫 If `0`, the individual has not placed a deposit within the bank. This is considered an **unsuccessful campaign**.

---

### 📌 Load data

▶️ Run the code cell below to create a new `DataFrame` named `df_m`.

In [4]:
df_m = pd.read_csv('bank-direct-marketing.csv')
df_m_backup = df_m.copy()
df_m

Unnamed: 0,age,job,marital,education,contact_type,num_contacts,prev_outcome,place_deposit
0,56,housemaid,married,basic.4y,telephone,1,nonexistent,0
1,57,services,married,high.school,telephone,1,nonexistent,0
2,37,services,married,high.school,telephone,1,nonexistent,0
3,40,admin.,married,basic.6y,telephone,1,nonexistent,0
4,56,services,married,high.school,telephone,1,nonexistent,0
...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,cellular,1,nonexistent,1
41184,46,blue-collar,married,professional.course,cellular,1,nonexistent,0
41185,56,retired,married,university.degree,cellular,2,nonexistent,0
41186,44,technician,married,professional.course,cellular,1,nonexistent,1


---

### 🎯 Challenge 1: Marketing success rate by marital status

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_marital` that lists the success rate (average of the `place_deposit` column) by marital status.
- ✔️ Use the `as_index=False` option.
- ✔️ The aggregated DataFrame should have two columns - "marital" and "success_rate".
- ✔️ `df_by_marital` should only have the following two columns in the same order.
    - `marital`: Marital status (e.g., single, divorced, married, unknown)
    - `success_rate`: Average success rate (between 0-1)
- ✔️ Both columns should not be used as an index column.
    - Printing `df_by_marital.columns.to_list()` should print out `['marital', 'success_rate']`.
- ✔️ Sort `df_by_marital` by `success_rate` in descending order *in-place*.

#### 🔑 Expected Output

|    | marital   |   success_rate |
|---:|:----------|---------------:|
|  3 | unknown   |       0.15     |
|  2 | single    |       0.140041 |
|  0 | divorced  |       0.103209 |
|  1 | married   |       0.101573 |

In [5]:
### BEGIN SOLUTION
df_by_marital = df_m.groupby('marital', as_index=False).agg({'place_deposit': 'mean'})
df_by_marital.rename(columns={'place_deposit': 'success_rate'}, inplace=True)
df_by_marital.sort_values('success_rate', ascending=False, inplace=True)
### END SOLUTION

df_by_marital

Unnamed: 0,marital,success_rate
3,unknown,0.15
2,single,0.140041
0,divorced,0.103209
1,married,0.101573


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [6]:
# Challenge 1 Autograder
df_check = df_m_backup.groupby('marital', as_index=bool(0)).agg({'place_deposit': np.mean}) \
    .rename(columns={'_'.join(['place', 'deposit']): 'success_rate'}) \
    .sort_values('success_rate').iloc[::-1]
df_check = df_check[['success_rate', 'marital'][::-1]].copy()

# Check result
pd.testing.assert_frame_equal(df_by_marital.reset_index(drop=True),
                              df_check.reset_index(drop=True))

---

### 🎯 Challenge 2: Marketing success rate by job

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_job` to that lists the average success rate in direct marketing campaigns by job.
- ✔️ Use the `as_index=False` option.
- ✔️ `df_by_job` should only have the following two columns in the same order.
    - `job`: Job (e.g., student, technician, housemaid, etc)
    - `success_rate`: Average success rate (between 0-1)
- ✔️ Both columns should not be used as an index column.
    - Printing `df_by_job.columns.to_list()` should print out `['job', 'success_rate']`.
- ✔️ Sort `df_by_job` by `success_rate` in descending order.

#### 🔑 Expected Output

|    | job           |   success_rate |
|---:|:--------------|---------------:|
|  8 | student       |      0.314286  |
|  5 | retired       |      0.252326  |
| 10 | unemployed    |      0.142012  |
|  0 | admin.        |      0.129726  |
|  4 | management    |      0.112175  |
| 11 | unknown       |      0.112121  |
|  9 | technician    |      0.10826   |
|  6 | self-employed |      0.104856  |
|  3 | housemaid     |      0.1       |
|  2 | entrepreneur  |      0.0851648 |
|  7 | services      |      0.0813807 |
|  1 | blue-collar   |      0.0689432 |

In [7]:
### BEGIN SOLUTION
df_by_job = df_m.groupby('job', as_index=False).agg({'place_deposit': 'mean'})
df_by_job.rename(columns={'place_deposit': 'success_rate'}, inplace=True)
df_by_job.sort_values('success_rate', ascending=False, inplace=True)
### END SOLUTION

df_by_job

Unnamed: 0,job,success_rate
8,student,0.314286
5,retired,0.252326
10,unemployed,0.142012
0,admin.,0.129726
4,management,0.112175
11,unknown,0.112121
9,technician,0.10826
6,self-employed,0.104856
3,housemaid,0.1
2,entrepreneur,0.085165


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [8]:
# Challenge 2 Autograder
df_check = df_m_backup.groupby('job').agg({'place_deposit': np.mean}).reset_index() \
    .rename(columns={'_'.join(['place', 'deposit']): 'success_rate'}) \
    .sort_values('success_rate').iloc[::-1]
df_check = df_check[['success_rate', 'job'][::-1]].copy()

pd.testing.assert_frame_equal(df_by_job.reset_index(drop=True),
                              df_check.reset_index(drop=True))

---

### 🎯 Challenge 3: Marketing success rate by contact type with count

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_contact_type` that lists the number of campaigns and the average success rate by contact type.
- ✔️ Use the `as_index=False` option.
- ✔️ `df_by_contact_type` should only have the following three columns in the same order.
    - `contact_type`: Contact method (e.g., cellular, telephone)
    - `count`: Number of potential customers that were contacted using the corresponding contact type
    - `success_rate`: Average success rate (between 0-1)
- ✔️ All three columns should not be used as an index column.
    - Printing `df_by_contact_type.columns.to_list()` should print out `['contact_type', 'count', 'success_rate']`.
- ✔️ Sort `df_by_contact_type` by `success_rate` in descending order.

#### 🔑 Expected Output

|    | contact_type   |   count |   success_rate |
|---:|:---------------|--------:|---------------:|
|  0 | cellular       |   26144 |      0.147376  |
|  1 | telephone      |   15044 |      0.0523132 |

In [9]:
### BEGIN SOLUTION
df_by_contact_type = df_m.groupby('contact_type', as_index=False).agg({
    'place_deposit': ['count', 'mean']
})

# flatten columns
df_by_contact_type.columns = ['contact_type', 'count', 'success_rate']

df_by_contact_type.sort_values('success_rate', ascending=False, inplace=True)
### END SOLUTION
df_by_contact_type

Unnamed: 0,contact_type,count,success_rate
0,cellular,26144,0.147376
1,telephone,15044,0.052313


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [10]:
# Challenge 3 Autograder
df_check = df_m_backup.groupby('_'.join(['contact', 'type'])).agg({'place_deposit': ['count', np.mean]}).reset_index()
df_check.columns = ['SUCCESS_RATE'.lower(), 'COUNT'.lower(), 'CONTACT_TYPE'.lower()][::-1]
df_check = df_check \
    .sort_values('success_rate').iloc[::-1]

pd.testing.assert_frame_equal(df_by_contact_type.reset_index(drop=True),
                              df_check.reset_index(drop=True))

---

### 📌 Grouping by multiple columns

You can run a `groupby()` with two or more columns by supplying a `list` of columns to the `groupby()` function.

```python
# example
df.groupby(['column1', 'column2'], as_index=False).agg({ 'column3': ... })
```

---

### 🎯 Challenge 4: Marketing success rate by marital status and contact type

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_type_and_marital` that lists the average success rate in direct marketing campaigns by marital status and contact type.
- ✔️ Use the `as_index=False` option.
- ✔️ `df_by_type_and_marital` should only have the following three columns in the same order.
    - `marital`: Marital status (e.g., single, divorced, married, unknown)
    - `contact_type`: Contact method (e.g., cellular, telephone)
    - `success_rate`: Average success rate (between 0-1)
- ✔️ All three columns should not be used as an index column.
    - Printing `df_by_type_and_marital.columns.to_list()` should print out `['marital', 'contact_type', 'success_rate']`.
- ✔️ Sort `df_by_type_and_marital` by `success_rate` in descending order.

#### 🔑 Expected Output

|    | marital   | contact_type   |   success_rate |
|---:|:----------|:---------------|---------------:|
|  6 | unknown   | cellular       |      0.207547  |
|  4 | single    | cellular       |      0.173875  |
|  0 | divorced  | cellular       |      0.13652   |
|  2 | married   | cellular       |      0.135341  |
|  5 | single    | telephone      |      0.0648844 |
|  3 | married   | telephone      |      0.0487554 |
|  1 | divorced  | telephone      |      0.0463615 |
|  7 | unknown   | telephone      |      0.037037  |

In [11]:
### BEGIN SOLUTION
df_by_type_and_marital = df_m.groupby(['marital', 'contact_type'], as_index=False).agg({
    'place_deposit': 'mean'
})
df_by_type_and_marital.rename(columns={'place_deposit': 'success_rate'}, inplace=True)
df_by_type_and_marital.sort_values('success_rate', ascending=False, inplace=True)
### END SOLUTION

df_by_type_and_marital

Unnamed: 0,marital,contact_type,success_rate
6,unknown,cellular,0.207547
4,single,cellular,0.173875
0,divorced,cellular,0.13652
2,married,cellular,0.135341
5,single,telephone,0.064884
3,married,telephone,0.048755
1,divorced,telephone,0.046362
7,unknown,telephone,0.037037


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [12]:
# Challenge 4 Autograder
df_check = df_m_backup.groupby(['MARITAL'.lower(), '_'.join(['contact', 'type'])]) \
    .agg({'place_deposit': np.mean}).reset_index()
df_check.rename(columns={'place_deposit': '_'.join(['SUCCESS', 'RATE']).lower()}, inplace=bool(1))
df_check = df_check \
    .sort_values('success_rate').iloc[::-1]

pd.testing.assert_frame_equal(df_by_type_and_marital.reset_index(drop=True),
                              df_check.reset_index(drop=True))

---

## 🧲 Merging two DataFrames (Joins)

Another common operation with tables is to merge two or more tables into one larger table.

To demonstrate how merging works, we'll work a record of transactions from a small food stand selling only two items - sweetcorns 🌽 and beers 🍺. The tables associated with the food stand's transactions are shown below.

### Products (`df_products`)

| product_id | product_name | price |
|---|---|---|
| SC | Sweetcorn | 3.0 |
| CB | Beer | 5.0 |

### Transactions (`df_transactions`)

| transaction_id | product_id |
|---|---|
| 1 | SC |
| 2 | SC |
| 3 | CB |
| 4 | SC |
| 5 | SC |
| 6 | SC |
| 7 | CB |
| 8 | SC |
| 9 | CB |
| 10 | SC |

▶️ Run the code below to create the two tables as DataFrames.

In [13]:
# DO NOT CHANGE THE CODE BELOW
df_products = pd.DataFrame({
    'product_id': ['SC', 'CB'],
    'product_name': ['Sweetcorn', 'Beer'],
    'price': [3.0, 5.0]
})

df_transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'product_id': ['SC', 'SC', 'CB', 'SC', 'SC', 'SC', 'CB', 'SC', 'CB', 'SC']
})

df_products_backup = df_products.copy()
df_transactions_backup = df_transactions.copy()

display(df_products)
display(df_transactions)

Unnamed: 0,product_id,product_name,price
0,SC,Sweetcorn,3.0
1,CB,Beer,5.0


Unnamed: 0,transaction_id,product_id
0,1,SC
1,2,SC
2,3,CB
3,4,SC
4,5,SC
5,6,SC
6,7,CB
7,8,SC
8,9,CB
9,10,SC


---

### 🎯 Challenge 5: Merge products into transactions

#### 👇 Tasks

- ✔️ Using `df_products` and `df_transactions`, create a merged table as shown below.
- ✔️ Use a left merge.
- ✔️ Name the merged DataFrame `df_merged`.

#### 🚀 Hints

The code below merges `right_dataframe` into `left_dataframe` using `shared_key_column`. The resulting type of the merge is a left-merge.

```python
merged_dataframe = pd.merge(
    left=left_dataframe,
    right=right_dataframe,
    on='shared_key_column',
    how='left'
)
```

#### 🧭 Expected Output of `df_merged`

|  | transaction_id | product_id | product_name | price |
|---|---|---|---|---|
| 0 | 1 | SC | Sweetcorn | 3.0 |
| 1 | 2 | SC | Sweetcorn | 3.0 |
| 2 | 3 | CB | Beer | 5.0 |
| 3 | 4 | SC | Sweetcorn | 3.0 |
| 4 | 5 | SC | Sweetcorn | 3.0 |
| 5 | 6 | SC | Sweetcorn | 3.0 |
| 6 | 7 | CB | Beer | 5.0 |
| 7 | 8 | SC | Sweetcorn | 3.0 |
| 8 | 9 | CB | Beer | 5.0 |
| 9 | 10 | SC | Sweetcorn | 3.0 |

In [14]:
### BEGIN SOLUTION
df_merged = pd.merge(
    left=df_transactions,
    right=df_products,
    on='product_id',
    how='left'
)
### END SOLUTION

display(df_merged)

Unnamed: 0,transaction_id,product_id,product_name,price
0,1,SC,Sweetcorn,3.0
1,2,SC,Sweetcorn,3.0
2,3,CB,Beer,5.0
3,4,SC,Sweetcorn,3.0
4,5,SC,Sweetcorn,3.0
5,6,SC,Sweetcorn,3.0
6,7,CB,Beer,5.0
7,8,SC,Sweetcorn,3.0
8,9,CB,Beer,5.0
9,10,SC,Sweetcorn,3.0


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [15]:
# Challenge 5 Autograder
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup,
    on= '_'.join(['product', 'id']),
    how='inner'
).sort_values('_'.join(['transaction', 'id']))

pd.testing.assert_frame_equal(df_merged.reset_index(drop=True),
                              df_merged_SOL.reset_index(drop=True))

---

### 🎯 Challenge 6: Total sales by product

#### 👇 Tasks

- ✔️ Using `df_merged` from the previous exercise, find the total sales by product.
- ✔️ Store the grouped result (a DataFrame) to `df_sales_by_product`.
- ✔️ Use the `groupby()` method on the `product_id` column.
- ✔️ `df_sales_by_product` should contain flat-level columns.
    - Printing `df_sales_by_product.columns.to_list()` should print out `['product_id', 'price']`.

#### 🧭 Expected Output of `df_sales_by_product`

|  | product_id | price |
|---|---|---|
| 0 | CB | 15.0 |
| 1 | SC | 21.0 |

In [16]:
### BEGIN SOLUTION
df_sales_by_product = df_merged.groupby('product_id', as_index=False).agg({ 'price': 'sum' })
### END SOLUTION

display(df_sales_by_product)

Unnamed: 0,product_id,price
0,CB,15.0
1,SC,21.0


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [17]:
# Challenge 6 Autograder
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup,
    on='product_id',
    how='left'
)
df_sales_by_product_SOL = df_merged_SOL.groupby('PRODUCT_ID'.lower()).agg({'price': np.sum}).reset_index()

pd.testing.assert_frame_equal(df_sales_by_product, df_sales_by_product_SOL)

---

### 🎯 Challenge 7: Total sales by product

#### 👇 Tasks

- ✔️ Using `df_merged` from the previous exercise, find the total sales by product.
- ✔️ Store the grouped result (a DataFrame) to `df_sales_by_id_name`.
    - This time, include the `product_name` information in addition to the `product_id` column.
- ✔️ Use the `groupby()` method.
- ✔️ `df_sales_by_id_name` should contain flat-level columns in the order shown below.
    - Printing `df_sales_by_id_name.columns.to_list()` should print out `['product_id', 'product_name', 'price']`.

#### 🧭 Expected Output of `df_sales_by_id_name`

|  | product_id | product_name | price |
|---|---|---|---|
| 0 | CB | Beer | 15 |
| 1 | SC | Sweetcorn | 21 |

In [18]:
### BEGIN SOLUTION
df_sales_by_id_name = df_merged.groupby(['product_id', 'product_name'], as_index=False).agg({'price': 'sum'})
### END SOLUTION

display(df_sales_by_id_name)

Unnamed: 0,product_id,product_name,price
0,CB,Beer,15.0
1,SC,Sweetcorn,21.0


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [19]:
# Challenge 7 Autograder
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup,
    on='product_id',
    how='left'
)

df_sales_by_id_name_SOL = df_merged_SOL.groupby(['product_id', 'product_name'], as_index=False).agg({'price': 'sum'})

pd.testing.assert_frame_equal(df_sales_by_id_name, df_sales_by_id_name_SOL)