# Pandas Missing Values, Datetime, Aggregation, and Merging


▶️ Import `pandas` and `numpy`.


In [2]:
import pandas as pd
import numpy as np

### 📌 Load employees data

▶️ Run the code cell below to create a new `DataFrame` named `df_emp`.


In [None]:
df_emp = pd.DataFrame(
    {
        "emp_id": [30, 40, 10, 20],
        "name": ["Toby", "Jim", "Pam", "Kelly"],
        "dept": ["HR", "Sales", "Sales", "Customer Service"],
        "office_phone": ["(217)123-4500", np.nan, np.nan, "(217)987-6600"],
        "start_date": ["2017-05-01", "2018-02-01", "2020-08-01", "2019-12-01"],
        "salary": [202000, 185000, 240000, 160500],
    }
)

# Create a backup copy of the original DataFrame
df_emp_backup = df_emp.copy()

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Toby,HR,(217)123-4500,2017-05-01,202000
1,40,Jim,Sales,,2018-02-01,185000
2,10,Pam,Sales,,2020-08-01,240000
3,20,Kelly,Customer Service,(217)987-6600,2019-12-01,160500


▶️ Run `df_emp.info()` for a summary of the DataFrame, including the data types of each column and the number of non-null entries.


In [4]:
df_emp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   emp_id        4 non-null      int64 
 1   name          4 non-null      object
 2   dept          4 non-null      object
 3   office_phone  2 non-null      object
 4   start_date    4 non-null      object
 5   salary        4 non-null      int64 
dtypes: int64(2), object(4)
memory usage: 324.0+ bytes


---

## 🗓️ Working with Datetime Values

Datetime values are common in datasets. They can represent dates, times, or both. Pandas provides powerful tools to work with datetime data.

A few common datetime formats include:

- `YYYY-MM-DD` (e.g., `2021-03-15`)
- `MM/DD/YYYY` (e.g., `03/15/2021`)
- `DD-Mon-YYYY` (e.g., `15-Mar-2021`)
- `YYYYMMDD` (e.g., `20210315`)

When you read a CSV file, pandas does not automatically recognize datetime columns. We can verify this by checking the data type of the "start_date" column.


In [6]:
df_emp["start_date"]

0    2017-05-01
1    2018-02-01
2    2020-08-01
3    2019-12-01
Name: start_date, dtype: object

▶️ Run `str(df_emp["start_date"].dtype)` to check the data type of the "start_date" column.


In [5]:
str(df_emp["start_date"].dtype)

'object'

The "start_date" column is currently of type `object`, which means it is treated as a string.

To use datetime functionalities, we need to convert this column to a datetime type. There are two ways to convert a column to datetime:

1. **During CSV Import**: Use the `parse_dates` parameter in `pd.read_csv()`.

   ```python
   df = pd.read_csv('data.csv', parse_dates=['date_column'])
   ```

2. **After CSV Import**: Use `pd.to_datetime()` to convert a column.

   ```python
   df['date_column'] = pd.to_datetime(df['date_column'])
   ```


---

**🎯 Example: Parse the `start_date` column as datetime**

We will create a new column "start_date_parsed" that contains the parsed datetime values.


In [7]:
df_emp["start_date_parsed"] = pd.to_datetime(df_emp["start_date"])

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary,start_date_parsed
0,30,Toby,HR,(217)123-4500,2017-05-01,202000,2017-05-01
1,40,Jim,Sales,,2018-02-01,185000,2018-02-01
2,10,Pam,Sales,,2020-08-01,240000,2020-08-01
3,20,Kelly,Customer Service,(217)987-6600,2019-12-01,160500,2019-12-01


▶️ Run `str(df_emp["start_date_parsed"].dtype)` to check the data type of the new "start_date_parsed" column.


In [None]:
str(df_emp["start_date_parsed"].dtype)

'datetime64[ns]'

▶️ Drop the "start_date" column and rename "start_date_parsed" to "start_date".

These steps could be done in a single line, but we will break them down to illustrate the process.


In [8]:
df_emp.drop(columns=["start_date"], inplace=True)
df_emp.rename(columns={"start_date_parsed": "start_date"}, inplace=True)

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,salary,start_date
0,30,Toby,HR,(217)123-4500,202000,2017-05-01
1,40,Jim,Sales,,185000,2018-02-01
2,10,Pam,Sales,,240000,2020-08-01
3,20,Kelly,Customer Service,(217)987-6600,160500,2019-12-01


▶️ Check the data types of the columns to confirm the change.


In [9]:
df_emp.dtypes

emp_id                   int64
name                    object
dept                    object
office_phone            object
salary                   int64
start_date      datetime64[ns]
dtype: object

Now the "start_date" column is of type `datetime64[ns]`, which allows us to perform datetime operations on it.


---

### Extracting Date Components

We can easily extract components like year, month, and day from datetime columns using the `.dt` accessor.

```python
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day'] = df['date_column'].dt.day
df['weekday'] = df['date_column'].dt.weekday     # e.g., 0=Monday, 6=Sunday
df['weekday'] = df['date_column'].dt.day_name()  # e.g., 'Monday'
```

We can also extract more specific components like hour, minute, and second if the datetime includes time.

```python
df['hour'] = df['date_column'].dt.hour
df['minute'] = df['date_column'].dt.minute
df['second'] = df['date_column'].dt.second
```


---

▶️ Extract the year, month, and day from the "start_date" column into new columns "start_year", "start_month", and "start_day".

```python
df['start_year'] = df['start_date'].dt.year
df['start_month'] = df['start_date'].dt.month
df['start_day'] = df['start_date'].dt.day
```


In [11]:
df_emp["start_year"] = df_emp["start_date"].dt.year
df_emp["start_month"] = df_emp["start_date"].dt.month
df_emp["start_day"] = df_emp["start_date"].dt.day

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,salary,start_date,start_year,start_month,start_day
0,30,Toby,HR,(217)123-4500,202000,2017-05-01,2017,5,1
1,40,Jim,Sales,,185000,2018-02-01,2018,2,1
2,10,Pam,Sales,,240000,2020-08-01,2020,8,1
3,20,Kelly,Customer Service,(217)987-6600,160500,2019-12-01,2019,12,1


If you're working with quarterly or weekly data, you can extract those components as well.

```python
df['quarter'] = df['date_column'].dt.quarter          # e.g., 1, 2, 3, 4
df['week'] = df['date_column'].dt.isocalendar().week  # e.g., 1-52
```


In [15]:
df_emp["start_date"].dt.quarter

0    2
1    1
2    3
3    4
Name: start_date, dtype: int32

In [16]:
df_emp["start_date"].dt.isocalendar().week

0    18
1     5
2    31
3    48
Name: week, dtype: UInt32

---

## 🔬 Grouping and Aggregating Data with Pandas

A common task in data analysis is to summarize data by certain criteria, such as calculating averages or totals for different groups within the data.

In the employees dataset, you might want to find out the average salary by department, or the total number of employees hired each year.

Pandas allows you to use the `groupby()` method to group data by one or more columns, and then apply aggregation functions like `mean()`, `sum()`, `count()`, etc. This follows the [_split-apply-combine_](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) pattern used in many data analysis workflows.

The split-apply-combine consists of three steps:

1. **Split**: Divide the data into groups based on one or more keys (columns).
2. **Apply**: Apply a function (e.g., aggregation, transformation) to each group.
3. **Combine**: Combine the results back into a DataFrame.


---

To illustrate the split-apply-combine process, we will create a new `DataFrame`.

▶️ Create a new `DataFrame` named `df`.


In [17]:
df = pd.DataFrame(
    {
        "name": ["Mary", "Roy", "John", "Joe", "Paul", "Erin"],
        "dept": ["Finance", "Purchase", "Finance", "Purchase", "Finance", "Purchase"],
        "salary": [240000, 160000, 250000, 170000, 260000, 180000],
    }
)

df

Unnamed: 0,name,dept,salary
0,Mary,Finance,240000
1,Roy,Purchase,160000
2,John,Finance,250000
3,Joe,Purchase,170000
4,Paul,Finance,260000
5,Erin,Purchase,180000


---

▶️ Run `df.groupby('dept')`.

This creates a `DataFrameGroupBy` object, which represents the grouped data but does not perform any computations yet.


In [18]:
df.groupby("dept")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000020205FD1B80>

:::{tip} What just happened?

- Internally, Pandas creates one group per department when you run `.groupby('dept')`.
- However, you won't be able to see the groups until we apply aggregation function(s) to each group.
- The strange-looking output (in the form of `<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012345678910>`) tells us that the result is a `DataFrameGroupBy` object.
  :::

This diagram illustrates the `DataFrameGroupBy` object created by `df.groupby('dept')`.

![groupby object](https://github.com/bdi475/images/blob/main/pandas/df-groupby-object-01.png?raw=true)


---

### 📌 Aggregating a `DataFrameGroupBy` object

▶️ Run `df.groupby('dept').agg({'salary': 'mean'})` below.


In [None]:
### BEGIN SOLUTION
df.groupby("dept").agg({"salary": "mean"})
### END SOLUTION

Unnamed: 0_level_0,salary
dept,Unnamed: 1_level_1
Finance,250000.0
Purchase,170000.0


👉 Your resulting `DataFrame` now displays average salary by `dept`.

#### Columns used as "pivots"

```python
df_salary_by_dept = df.groupby('dept').agg({'salary': 'mean'})

display(df_salary_by_dept)
print(df_salary_by_dept.columns)
```

▶️ Copy the provided code above to the code cell below and run it.


In [None]:
### BEGIN SOLUTION
df_salary_by_dept = df.groupby("dept").agg({"salary": "mean"})

display(df_salary_by_dept)
print(df_salary_by_dept.columns)
### END SOLUTION

Unnamed: 0_level_0,salary
dept,Unnamed: 1_level_1
Finance,250000.0
Purchase,170000.0


Index(['salary'], dtype='object')


👉 There is only one column shown when you print out `df_salary_by_dept.columns`! 🙀

This is because the column(s) you use to create groups are used as **index** by default.

![groupby agg result](https://github.com/bdi475/images/blob/main/pandas/df-groupby-agg-as-index-true-01.png?raw=true)


---

### 📌 Aggregating a `DataFrameGroupBy` object with optional `index=False`

```python
df_salary_by_dept2 = df.groupby('dept', as_index=False).agg({'salary': 'mean'})

display(df_salary_by_dept2)
print(df_salary_by_dept2.columns)
```

▶️ Copy the provided code to the code cell below and run it.


In [None]:
### BEGIN SOLUTION
df_salary_by_dept2 = df.groupby("dept", as_index=False).agg({"salary": "mean"})

display(df_salary_by_dept2)
print(df_salary_by_dept2.columns)
### END SOLUTION

Unnamed: 0,dept,salary
0,Finance,250000.0
1,Purchase,170000.0


Index(['dept', 'salary'], dtype='object')


👉 Now, printing out the columns show both `dept` and `salary`. Supplying `as_index=False` to `groupby()` keeps the columns you use as the "pivot" as regular columns.

![groupby agg result](https://github.com/bdi475/images/blob/main/pandas/df-groupby-agg-as-index-false-01.png?raw=true)


---

### 📌 Creating multiple aggregation measures

```python
df_salary_by_dept3 = df.groupby('dept', as_index=False).agg({'salary': ['min', 'max', 'mean', 'sum', 'count', 'std']})

display(df_salary_by_dept3)
print(df_salary_by_dept3.columns)
```

▶️ Copy the provided code to the code cell below and run it.


In [None]:
### BEGIN SOLUTION
df_salary_by_dept3 = df.groupby("dept", as_index=False).agg(
    {"salary": ["min", "max", "mean", "sum", "count", "std"]}
)

display(df_salary_by_dept3)
print(df_salary_by_dept3.columns)
### END SOLUTION

Unnamed: 0_level_0,dept,salary,salary,salary,salary,salary,salary
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,sum,count,std
0,Finance,240000,260000,250000.0,750000,3,10000.0
1,Purchase,160000,180000,170000.0,510000,3,10000.0


MultiIndex([(  'dept',      ''),
            ('salary',   'min'),
            ('salary',   'max'),
            ('salary',  'mean'),
            ('salary',   'sum'),
            ('salary', 'count'),
            ('salary',   'std')],
           )


---

### 📌 Flattening multi-level index columns

When you apply two or more aggregation functions to a column, your DataFrame creates a hierarchically structured columns. It is often easier to work with a DataFrame if you have a flat-level columns. You can manually assign the column names after `.agg()` in these cases to _flatten_ the columns.

```python
df_salary_by_dept4 = df.groupby('dept', as_index=False).agg({'salary': ['min', 'max', 'mean', 'sum', 'count', 'std']})

display(df_salary_by_dept4)
print('Columns before (multi-level, not flat):')
print(df_salary_by_dept4.columns)

# manually assign column names
df_salary_by_dept4.columns = ['dept', 'min_salary', 'max_salary', 'mean_salary', 'total_salary', 'num_employees', 'std_dev']

display(df_salary_by_dept4)
print('Columns after (flat-level):')
print(df_salary_by_dept4.columns)
```

▶️ Copy the provided code to the code cell below and run it.


In [None]:
# YOUR CODE BEGINS
df_salary_by_dept4 = df.groupby("dept", as_index=False).agg(
    {"salary": ["min", "max", "mean", "sum", "count", "std"]}
)

display(df_salary_by_dept4)
print("Columns before (multi-level, not flat):")
print(df_salary_by_dept4.columns)

# manually assign column names
df_salary_by_dept4.columns = [
    "dept",
    "min_salary",
    "max_salary",
    "mean_salary",
    "total_salary",
    "num_employees",
    "std_dev",
]

display(df_salary_by_dept4)
print("Columns after (flat-level):")
print(df_salary_by_dept4.columns)
# YOUR CODE ENDS

Unnamed: 0_level_0,dept,salary,salary,salary,salary,salary,salary
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,sum,count,std
0,Finance,240000,260000,250000.0,750000,3,10000.0
1,Purchase,160000,180000,170000.0,510000,3,10000.0


Columns before (multi-level, not flat):
MultiIndex([(  'dept',      ''),
            ('salary',   'min'),
            ('salary',   'max'),
            ('salary',  'mean'),
            ('salary',   'sum'),
            ('salary', 'count'),
            ('salary',   'std')],
           )


Unnamed: 0,dept,min_salary,max_salary,mean_salary,total_salary,num_employees,std_dev
0,Finance,240000,260000,250000.0,750000,3,10000.0
1,Purchase,160000,180000,170000.0,510000,3,10000.0


Columns after (flat-level):
Index(['dept', 'min_salary', 'max_salary', 'mean_salary', 'total_salary',
       'num_employees', 'std_dev'],
      dtype='object')


---

## 📞 Exercises Using Bank Marketing Calls Data

For the next part of this lecture, you'll work with a dataset related with direct marketing campaigns (phone calls) of a banking institution. 

**Data Source**: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

| Column Name     | Type        | Description                                                                                                                                                     |
|-----------------|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `age`           | Numeric     | Age                                                                                                                                                             |
| `job`           | Categorical | admin.', 'blue-collar',   'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed',   'services', 'student', 'technician', 'unemployed', 'unknown' |
| `marital`       | Categorical | single', 'married', 'divorced', 'unknown'                                                                                                                       |
| `education`     | Categorical | basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate',   'professional.course', 'university.degree', 'unknown'                                         |
| `contact_type`  | Categorical | cellular', 'telephone'                                                                                                                                          |
| `num_contacts`  | Numeric     | Number of contacts performed during this campaign for this client                                                                                               |
| `prev_outcome`  | Categorical | Outcome of the previous marketing campaign - 'failure', 'nonexistent',   'success'                                                                              |
| `place_deposit` | Numeric     | Did the client subscribe to a term deposit? This column indicates whether the campaign was successful (1) or not (0) for each client.                                        |

---

Your goal is to analyze the dataset to discover relationships between personal factors and marketing campaign result of each individual.

**`place_deposit`** column indicates whether a marketing campaign was successful.
z

- ✅ If `1`, the individual has placed a deposit within the bank. This is considered a **successful campaign**.
- 🚫 If `0`, the individual has not placed a deposit within the bank. This is considered an **unsuccessful campaign**.


---

### 📌 Load data


▶️ Run the code cell below to create a new `DataFrame` named `df_m`.


In [None]:
df_m = pd.read_csv(
    "https://github.com/bdi475/datasets/blob/main/bank-direct-marketing.csv?raw=true"
)
df_m_backup = df_m.copy()
df_m

Unnamed: 0,age,job,marital,education,contact_type,num_contacts,prev_outcome,place_deposit
0,56,housemaid,married,basic.4y,telephone,1,nonexistent,False
1,57,services,married,high.school,telephone,1,nonexistent,False
2,37,services,married,high.school,telephone,1,nonexistent,False
3,40,admin.,married,basic.6y,telephone,1,nonexistent,False
4,56,services,married,high.school,telephone,1,nonexistent,False
...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,cellular,1,nonexistent,True
41184,46,blue-collar,married,professional.course,cellular,1,nonexistent,False
41185,56,retired,married,university.degree,cellular,2,nonexistent,False
41186,44,technician,married,professional.course,cellular,1,nonexistent,True


---

### 🎯 Challenge 11: Marketing success rate by marital status

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_marital` that lists the success rate (average of the `place_deposit` column) by marital status.
- ✔️ Use the `as_index=False` option.
- ✔️ The aggregated DataFrame should have two columns - "marital" and "success_rate".
- ✔️ `df_by_marital` should only have the following two columns in the same order.
  - `marital`: Marital status (e.g., single, divorced, married, unknown)
  - `success_rate`: Average success rate (between 0-1)
- ✔️ Both columns should not be used as an index column.
  - Printing `df_by_marital.columns.to_list()` should print out `['marital', 'success_rate']`.
- ✔️ Sort `df_by_marital` by `success_rate` in descending order _in-place_.

#### 🔑 Expected Output

|     | marital  | success_rate |
| --: | :------- | -----------: |
|   3 | unknown  |         0.15 |
|   2 | single   |     0.140041 |
|   0 | divorced |     0.103209 |
|   1 | married  |     0.101573 |


In [None]:
### BEGIN SOLUTION
df_by_marital = df_m.groupby("marital", as_index=False).agg({"place_deposit": "mean"})
df_by_marital.rename(columns={"place_deposit": "success_rate"}, inplace=True)
df_by_marital.sort_values("success_rate", ascending=False, inplace=True)
### END SOLUTION

df_by_marital

Unnamed: 0,marital,success_rate
3,unknown,0.15
2,single,0.140041
0,divorced,0.103209
1,married,0.101573


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.


In [None]:
# Challenge 11 Autograder
df_check = (
    df_m_backup.groupby("marital", as_index=bool(0))
    .agg({"place_deposit": "mean"})
    .rename(columns={"_".join(["place", "deposit"]): "success_rate"})
    .sort_values("success_rate")
    .iloc[::-1]
)
df_check = df_check[["success_rate", "marital"][::-1]].copy()

# Check result
pd.testing.assert_frame_equal(
    df_by_marital.reset_index(drop=True), df_check.reset_index(drop=True)
)

---

### 🎯 Challenge 12: Marketing success rate by job

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_job` to that lists the average success rate in direct marketing campaigns by job.
- ✔️ Use the `as_index=False` option.
- ✔️ `df_by_job` should only have the following two columns in the same order.
  - `job`: Job (e.g., student, technician, housemaid, etc)
  - `success_rate`: Average success rate (between 0-1)
- ✔️ Both columns should not be used as an index column.
  - Printing `df_by_job.columns.to_list()` should print out `['job', 'success_rate']`.
- ✔️ Sort `df_by_job` by `success_rate` in descending order.

#### 🔑 Expected Output

|     | job           | success_rate |
| --: | :------------ | -----------: |
|   8 | student       |     0.314286 |
|   5 | retired       |     0.252326 |
|  10 | unemployed    |     0.142012 |
|   0 | admin.        |     0.129726 |
|   4 | management    |     0.112175 |
|  11 | unknown       |     0.112121 |
|   9 | technician    |      0.10826 |
|   6 | self-employed |     0.104856 |
|   3 | housemaid     |          0.1 |
|   2 | entrepreneur  |    0.0851648 |
|   7 | services      |    0.0813807 |
|   1 | blue-collar   |    0.0689432 |


In [None]:
### BEGIN SOLUTION
df_by_job = df_m.groupby("job", as_index=False).agg({"place_deposit": "mean"})
df_by_job.rename(columns={"place_deposit": "success_rate"}, inplace=True)
df_by_job.sort_values("success_rate", ascending=False, inplace=True)
### END SOLUTION

df_by_job

Unnamed: 0,job,success_rate
8,student,0.314286
5,retired,0.252326
10,unemployed,0.142012
0,admin.,0.129726
4,management,0.112175
11,unknown,0.112121
9,technician,0.10826
6,self-employed,0.104856
3,housemaid,0.1
2,entrepreneur,0.085165


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.


In [None]:
# Challenge 12 Autograder
df_check = (
    df_m_backup.groupby("job")
    .agg({"place_deposit": "mean"})
    .reset_index()
    .rename(columns={"_".join(["place", "deposit"]): "success_rate"})
    .sort_values("success_rate")
    .iloc[::-1]
)
df_check = df_check[["success_rate", "job"][::-1]].copy()

pd.testing.assert_frame_equal(
    df_by_job.reset_index(drop=True), df_check.reset_index(drop=True)
)

---

### 🎯 Challenge 13: Marketing success rate by contact type with count

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_contact_type` that lists the number of campaigns and the average success rate by contact type.
- ✔️ Use the `as_index=False` option.
- ✔️ `df_by_contact_type` should only have the following three columns in the same order.
  - `contact_type`: Contact method (e.g., cellular, telephone)
  - `count`: Number of potential customers that were contacted using the corresponding contact type
  - `success_rate`: Average success rate (between 0-1)
- ✔️ All three columns should not be used as an index column.
  - Printing `df_by_contact_type.columns.to_list()` should print out `['contact_type', 'count', 'success_rate']`.
- ✔️ Sort `df_by_contact_type` by `success_rate` in descending order.

#### 🔑 Expected Output

|     | contact_type | count | success_rate |
| --: | :----------- | ----: | -----------: |
|   0 | cellular     | 26144 |     0.147376 |
|   1 | telephone    | 15044 |    0.0523132 |


In [None]:
### BEGIN SOLUTION
df_by_contact_type = df_m.groupby("contact_type", as_index=False).agg(
    {"place_deposit": ["count", "mean"]}
)

# flatten columns
df_by_contact_type.columns = ["contact_type", "count", "success_rate"]

df_by_contact_type.sort_values("success_rate", ascending=False, inplace=True)
### END SOLUTION
df_by_contact_type

Unnamed: 0,contact_type,count,success_rate
0,cellular,26144,0.147376
1,telephone,15044,0.052313


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.


In [None]:
# Challenge 13 Autograder
df_check = (
    df_m_backup.groupby("_".join(["contact", "type"]))
    .agg({"place_deposit": ["count", "mean"]})
    .reset_index()
)
df_check.columns = ["SUCCESS_RATE".lower(), "COUNT".lower(), "CONTACT_TYPE".lower()][
    ::-1
]
df_check = df_check.sort_values("success_rate").iloc[::-1]

pd.testing.assert_frame_equal(
    df_by_contact_type.reset_index(drop=True), df_check.reset_index(drop=True)
)

---

### 📌 Grouping by multiple columns

You can run a `groupby()` with two or more columns by supplying a `list` of columns to the `groupby()` function.

```python
# example
df.groupby(['column1', 'column2'], as_index=False).agg({ 'column3': ... })
```


---

### 🎯 Challenge 14: Marketing success rate by marital status and contact type

#### 👇 Tasks

- ✔️ Using `df_m`, create an aggregated table named `df_by_type_and_marital` that lists the average success rate in direct marketing campaigns by marital status and contact type.
- ✔️ Use the `as_index=False` option.
- ✔️ `df_by_type_and_marital` should only have the following three columns in the same order.
  - `marital`: Marital status (e.g., single, divorced, married, unknown)
  - `contact_type`: Contact method (e.g., cellular, telephone)
  - `success_rate`: Average success rate (between 0-1)
- ✔️ All three columns should not be used as an index column.
  - Printing `df_by_type_and_marital.columns.to_list()` should print out `['marital', 'contact_type', 'success_rate']`.
- ✔️ Sort `df_by_type_and_marital` by `success_rate` in descending order.

#### 🔑 Expected Output

|     | marital  | contact_type | success_rate |
| --: | :------- | :----------- | -----------: |
|   6 | unknown  | cellular     |     0.207547 |
|   4 | single   | cellular     |     0.173875 |
|   0 | divorced | cellular     |      0.13652 |
|   2 | married  | cellular     |     0.135341 |
|   5 | single   | telephone    |    0.0648844 |
|   3 | married  | telephone    |    0.0487554 |
|   1 | divorced | telephone    |    0.0463615 |
|   7 | unknown  | telephone    |     0.037037 |


In [None]:
### BEGIN SOLUTION
df_by_type_and_marital = df_m.groupby(["marital", "contact_type"], as_index=False).agg(
    {"place_deposit": "mean"}
)
df_by_type_and_marital.rename(columns={"place_deposit": "success_rate"}, inplace=True)
df_by_type_and_marital.sort_values("success_rate", ascending=False, inplace=True)
### END SOLUTION

df_by_type_and_marital

Unnamed: 0,marital,contact_type,success_rate
6,unknown,cellular,0.207547
4,single,cellular,0.173875
0,divorced,cellular,0.13652
2,married,cellular,0.135341
5,single,telephone,0.064884
3,married,telephone,0.048755
1,divorced,telephone,0.046362
7,unknown,telephone,0.037037


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.


In [None]:
# Challenge 14 Autograder
df_check = (
    df_m_backup.groupby(["MARITAL".lower(), "_".join(["contact", "type"])])
    .agg({"place_deposit": "mean"})
    .reset_index()
)
df_check.rename(
    columns={"place_deposit": "_".join(["SUCCESS", "RATE"]).lower()}, inplace=bool(1)
)
df_check = df_check.sort_values("success_rate").iloc[::-1]

pd.testing.assert_frame_equal(
    df_by_type_and_marital.reset_index(drop=True), df_check.reset_index(drop=True)
)

---

## 🧲 Merging two DataFrames (Joins)

Another common operation with tables is to merge two or more tables into one larger table.

To demonstrate how merging works, we'll work a record of transactions from a small food stand selling only two items - sweetcorns 🌽 and beers 🍺. The tables associated with the food stand's transactions are shown below.

### Products (`df_products`)

| product_id | product_name | price |
| ---------- | ------------ | ----- |
| SC         | Sweetcorn    | 3.0   |
| CB         | Beer         | 5.0   |

### Transactions (`df_transactions`)

| transaction_id | product_id |
| -------------- | ---------- |
| 1              | SC         |
| 2              | SC         |
| 3              | CB         |
| 4              | SC         |
| 5              | SC         |
| 6              | SC         |
| 7              | CB         |
| 8              | SC         |
| 9              | CB         |
| 10             | SC         |

▶️ Run the code below to create the two tables as DataFrames.


In [None]:
# DO NOT CHANGE THE CODE BELOW
df_products = pd.DataFrame(
    {
        "product_id": ["SC", "CB"],
        "product_name": ["Sweetcorn", "Beer"],
        "price": [3.0, 5.0],
    }
)

df_transactions = pd.DataFrame(
    {
        "transaction_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "product_id": ["SC", "SC", "CB", "SC", "SC", "SC", "CB", "SC", "CB", "SC"],
    }
)

df_products_backup = df_products.copy()
df_transactions_backup = df_transactions.copy()

display(df_products)
display(df_transactions)

Unnamed: 0,product_id,product_name,price
0,SC,Sweetcorn,3.0
1,CB,Beer,5.0


Unnamed: 0,transaction_id,product_id
0,1,SC
1,2,SC
2,3,CB
3,4,SC
4,5,SC
5,6,SC
6,7,CB
7,8,SC
8,9,CB
9,10,SC


---

### 🎯 Challenge 15: Merge products into transactions

#### 👇 Tasks

- ✔️ Using `df_products` and `df_transactions`, create a merged table as shown below.
- ✔️ Use a left merge.
- ✔️ Name the merged DataFrame `df_merged`.

#### 🚀 Hints

The code below merges `right_dataframe` into `left_dataframe` using `shared_key_column`. The resulting type of the merge is a left-merge.

```python
merged_dataframe = pd.merge(
    left=left_dataframe,
    right=right_dataframe,
    on='shared_key_column',
    how='left'
)
```

#### 🧭 Expected Output of `df_merged`

|     | transaction_id | product_id | product_name | price |
| --- | -------------- | ---------- | ------------ | ----- |
| 0   | 1              | SC         | Sweetcorn    | 3.0   |
| 1   | 2              | SC         | Sweetcorn    | 3.0   |
| 2   | 3              | CB         | Beer         | 5.0   |
| 3   | 4              | SC         | Sweetcorn    | 3.0   |
| 4   | 5              | SC         | Sweetcorn    | 3.0   |
| 5   | 6              | SC         | Sweetcorn    | 3.0   |
| 6   | 7              | CB         | Beer         | 5.0   |
| 7   | 8              | SC         | Sweetcorn    | 3.0   |
| 8   | 9              | CB         | Beer         | 5.0   |
| 9   | 10             | SC         | Sweetcorn    | 3.0   |


In [None]:
### BEGIN SOLUTION
df_merged = pd.merge(
    left=df_transactions, right=df_products, on="product_id", how="left"
)
### END SOLUTION

display(df_merged)

Unnamed: 0,transaction_id,product_id,product_name,price
0,1,SC,Sweetcorn,3.0
1,2,SC,Sweetcorn,3.0
2,3,CB,Beer,5.0
3,4,SC,Sweetcorn,3.0
4,5,SC,Sweetcorn,3.0
5,6,SC,Sweetcorn,3.0
6,7,CB,Beer,5.0
7,8,SC,Sweetcorn,3.0
8,9,CB,Beer,5.0
9,10,SC,Sweetcorn,3.0


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.


In [None]:
# Challenge 15 Autograder
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup, on="_".join(["product", "id"]), how="inner"
).sort_values("_".join(["transaction", "id"]))

pd.testing.assert_frame_equal(
    df_merged.reset_index(drop=True), df_merged_SOL.reset_index(drop=True)
)

---

### 🎯 Challenge 16: Total sales by product

#### 👇 Tasks

- ✔️ Using `df_merged` from the previous exercise, find the total sales by product.
- ✔️ Store the grouped result (a DataFrame) to `df_sales_by_product`.
- ✔️ Use the `groupby()` method on the `product_id` column.
- ✔️ `df_sales_by_product` should contain flat-level columns.
  - Printing `df_sales_by_product.columns.to_list()` should print out `['product_id', 'price']`.

#### 🧭 Expected Output of `df_sales_by_product`

|     | product_id | price |
| --- | ---------- | ----- |
| 0   | CB         | 15.0  |
| 1   | SC         | 21.0  |


In [None]:
### BEGIN SOLUTION
df_sales_by_product = df_merged.groupby("product_id", as_index=False).agg(
    {"price": "sum"}
)
### END SOLUTION

display(df_sales_by_product)

Unnamed: 0,product_id,price
0,CB,15.0
1,SC,21.0


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.


In [None]:
# Challenge 16 Autograder
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup, on="product_id", how="left"
)
df_sales_by_product_SOL = (
    df_merged_SOL.groupby("PRODUCT_ID".lower()).agg({"price": "sum"}).reset_index()
)

pd.testing.assert_frame_equal(df_sales_by_product, df_sales_by_product_SOL)

---

### 🎯 Challenge of the day: Total sales by product (OPTIONAL)

⚠️ This challenge is a chance to showcase your problem-solving abilities. The code examples used in the lecture may not be sufficient to perform this task.

#### 👇 Tasks

- ✔️ Using `df_merged` from the previous exercise, find the total sales by product.
- ✔️ Store the grouped result (a DataFrame) to `df_sales_by_id_name`.
  - This time, include the `product_name` information in addition to the `product_id` column.
- ✔️ Use the `groupby()` method.
- ✔️ `df_sales_by_id_name` should contain flat-level columns in the order shown below.
  - Printing `df_sales_by_id_name.columns.to_list()` should print out `['product_id', 'product_name', 'price']`.

#### 🧭 Expected Output of `df_sales_by_id_name`

|     | product_id | product_name | price |
| --- | ---------- | ------------ | ----- |
| 0   | CB         | Beer         | 15    |
| 1   | SC         | Sweetcorn    | 21    |


In [None]:
### BEGIN SOLUTION
df_sales_by_id_name = df_merged.groupby(
    ["product_id", "product_name"], as_index=False
).agg({"price": "sum"})
### END SOLUTION

display(df_sales_by_id_name)

Unnamed: 0,product_id,product_name,price
0,CB,Beer,15.0
1,SC,Sweetcorn,21.0


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.


In [None]:
# Challenge 17 Autograder (OPTIONAL)
# DO NOT CHANGE THE CODE BELOW
df_merged_SOL = df_transactions_backup.merge(
    df_products_backup, on="product_id", how="left"
)

df_sales_by_id_name_SOL = df_merged_SOL.groupby(
    ["product_id", "product_name"], as_index=False
).agg({"price": "sum"})


pd.testing.assert_frame_equal(df_sales_by_id_name, df_sales_by_id_name_SOL)