# Pandas Merge and String Methods

## Lecture Notes and in-class exercises

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [1]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [2]:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [3]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly import Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly import NumPy with an alias.')

---

### 📌 Load employees and work laptops data

For the first part, we're going to work with a small DataFrame to see how we merge two DataFrames together.

▶️ Run the code cell below to create `df_employees` and `df_laptops`.

In [4]:
df_employees = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'name': ['Jasper', 'Gary', 'Sally'],
    'laptop_id': ['A', 'B', np.nan]
})

df_laptops = pd.DataFrame({
    'laptop_id': ['A', 'B', 'C', 'D'],
    'model': ['Red Touchbook', 'BlueGo', 'Eco Green', 'Hackbook Pro']
})

# Used for 🧭 Check Your Work sections
df_employees_check = df_employees.copy()
df_laptops_check = df_laptops.copy()
df_join_check = df_employees_check.merge(df_laptops, on='laptop_id', how='outer')

▶️ Run the code cell below to display `df_employees`.

In [5]:
df_employees

Unnamed: 0,emp_id,name,laptop_id
0,1,Jasper,A
1,2,Gary,B
2,3,Sally,


▶️ Run the code cell below to display `df_laptops`.

In [6]:
df_laptops

Unnamed: 0,laptop_id,model
0,A,Red Touchbook
1,B,BlueGo
2,C,Eco Green
3,D,Hackbook Pro


---

### 🎯 Exercise 1: Inner merge

#### 👇 Tasks

- ✔️ Find employees who have been assigned a work laptop.
- ✔️ In other words, merge `df_employees` and `df_laptop` using an inner merge.
- ✔️ Store the merged result to a new variable named `df_inner`.

#### 🚀 Sample Code

```python
df_inner = pd.merge(
    left=...,
    right=...,
    on='...',
    how='...'
)
```

#### 🔑 Expected Output

|    |   emp_id | name   | laptop_id   | model         |
|---:|---------:|:-------|:------------|:--------------|
|  0 |        1 | Jasper | A           | Red Touchbook |
|  1 |        2 | Gary   | B           | BlueGo        |

In [7]:
### BEGIN SOLUTION
df_inner = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='inner'
)
### END SOLUTION

display(df_inner)

Unnamed: 0,emp_id,name,laptop_id,model
0,1,Jasper,A,Red Touchbook
1,2,Gary,B,BlueGo


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [8]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['emp_id'].notna() & df_jc['laptop_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_inner.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

---

### 🎯 Exercise 2: Left merge

#### 👇 Tasks

- ✔️ List all employees and their assigned work laptops - if they are assigned one.
- ✔️ If an employee has not been assigned a work laptop, leave `'laptop_id'` and `'model'` as np.NaN (or any other null-like value).
- ✔️ In other words, merge `df_employees` and `df_laptop` using a left merge.
- ✔️ Store the merged result to a new variable named `df_left`.

#### 🔑 Expected Output

|    |   emp_id | name   | laptop_id   | model         |
|---:|---------:|:-------|:------------|:--------------|
|  0 |        1 | Jasper | A           | Red Touchbook |
|  1 |        2 | Gary   | B           | BlueGo        |
|  2 |        3 | Sally  | NaN         | NaN           |

In [9]:
### BEGIN SOLUTION
df_left = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='left'
)
### END SOLUTION

display(df_left)

Unnamed: 0,emp_id,name,laptop_id,model
0,1,Jasper,A,Red Touchbook
1,2,Gary,B,BlueGo
2,3,Sally,,


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [10]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['emp_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_left.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

---

### 🎯 Exercise 3: Right merge

#### 👇 Tasks

- ✔️ List all laptops and their associated owners - if they are assigned one.
- ✔️ If a laptop has not been assigned to an employee, leave `'emp_id'` and `'name'` as np.NaN (or any other null-like value).
- ✔️ In other words, merge `df_employees` and `df_laptop` using a right merge.
- ✔️ Store the merged result to a new variable named `df_right`.

#### 🔑 Expected Output

|    |   emp_id | name   | laptop_id   | model         |
|---:|---------:|:-------|:------------|:--------------|
|  0 |        1 | Jasper | A           | Red Touchbook |
|  1 |        2 | Gary   | B           | BlueGo        |
|  2 |      NaN | NaN    | C           | Eco Green     |
|  3 |      NaN | NaN    | D           | Hackbook Pro  |

In [11]:
### BEGIN SOLUTION
df_right = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='right'
)
### END SOLUTION

display(df_right)

Unnamed: 0,emp_id,name,laptop_id,model
0,1.0,Jasper,A,Red Touchbook
1,2.0,Gary,B,BlueGo
2,,,C,Eco Green
3,,,D,Hackbook Pro


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [12]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['laptop_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_right.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

---

### 🎯 Exercise 4: Full outer merge

#### 👇 Tasks

- ✔️ List all employees and all work laptops - regardless of whether they are associated with one another.
- ✔️ If an employee has not been assigned a work laptop, leave `'laptop_id'` and `'model'` as np.NaN (or any other null-like value).
- ✔️ If a laptop has not been assigned to an employee, leave `'emp_id'` and `'name'` as np.NaN (or any other null-like value).
- ✔️ In other words, merge `df_employees` and `df_laptop` using an outer merge.
- ✔️ Store the merged result to a new variable named `df_outer`.

#### 🔑 Expected Output

|    |   emp_id | name   | laptop_id   | model         |
|---:|---------:|:-------|:------------|:--------------|
|  0 |        1 | Jasper | A           | Red Touchbook |
|  1 |        2 | Gary   | B           | BlueGo        |
|  2 |        3 | Sally  | NaN         | NaN           |
|  3 |      NaN | NaN    | C           | Eco Green     |
|  4 |      NaN | NaN    | D           | Hackbook Pro  |

In [13]:
### BEGIN SOLUTION
df_outer = pd.merge(
    left=df_employees,
    right=df_laptops,
    on='laptop_id',
    how='outer'
)
### END SOLUTION

display(df_outer)

Unnamed: 0,emp_id,name,laptop_id,model
0,1.0,Jasper,A,Red Touchbook
1,2.0,Gary,B,BlueGo
2,3.0,Sally,,
3,,,C,Eco Green
4,,,D,Hackbook Pro


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [14]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_jc = df_join_check
df_sol = df_jc[df_jc['emp_id'].notna()].reset_index(drop=True)

pd.testing.assert_frame_equal(
    df_left.reset_index(drop=True),
    df_sol.reset_index(drop=True),
    check_dtype=False
)

---

## Pandas string methods

### 📌 Load textual data

▶️ Run the code cell below to create `df_libraries`.

In [15]:
df_libraries = pd.DataFrame({
    'name': ['ACES (Funk)', 'Grainger', 'Law', 'Main'],
    'amenities': [
        'Rooms,Scanner,Printer',
        'Rooms,Scanner,Printer,Cafe',
        'Cafe',
        'Rooms,Scanner,Printer,Cafe'
    ],
})

# Used for 🧭 Check Your Work sections
df_libraries_check = df_libraries.copy()

▶️ Run the code cell below to display `df_libraries`.

In [16]:
df_libraries

Unnamed: 0,name,amenities
0,ACES (Funk),"Rooms,Scanner,Printer"
1,Grainger,"Rooms,Scanner,Printer,Cafe"
2,Law,Cafe
3,Main,"Rooms,Scanner,Printer,Cafe"


---

### 🎯 Exercise 5: Length of library names

#### 👇 Tasks

- ✔️ Find the number of characters (i.e., string length) of each library.
- ✔️ Store the result to a new column named `'name_length'` in `df_libraries`.

#### 🔑 Expected Output

|    | name        | amenities                  |   name_length |
|---:|:------------|:---------------------------|--------------:|
|  0 | ACES (Funk) | Rooms,Scanner,Printer      |            11 |
|  1 | Grainger    | Rooms,Scanner,Printer,Cafe |             8 |
|  2 | Law         | Cafe                       |             3 |
|  3 | Main        | Rooms,Scanner,Printer,Cafe |             4 |

In [17]:
### BEGIN SOLUTION
df_libraries['name_length'] = df_libraries['name'].str.len()
### END SOLUTION

display(df_libraries)

Unnamed: 0,name,amenities,name_length
0,ACES (Funk),"Rooms,Scanner,Printer",11
1,Grainger,"Rooms,Scanner,Printer,Cafe",8
2,Law,Cafe,3
3,Main,"Rooms,Scanner,Printer,Cafe",4


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [18]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_lc = df_libraries_check
df_lc['name_length'] = df_lc['name'].str.len()

pd.testing.assert_frame_equal(
    df_libraries.reset_index(drop=True),
    df_lc.reset_index(drop=True),
    check_dtype=False
)

---

### 🎯 Exercise 6: Uppercase library names

#### 👇 Tasks

- ✔️ Convert the library names to uppercase.
- ✔️ Directly update the `'name'` column in `df_libraries`.

#### 🔑 Expected Output

|    | name        | amenities                  |   name_length |
|---:|:------------|:---------------------------|--------------:|
|  0 | ACES (FUNK) | Rooms,Scanner,Printer      |            11 |
|  1 | GRAINGER    | Rooms,Scanner,Printer,Cafe |             8 |
|  2 | LAW         | Cafe                       |             3 |
|  3 | MAIN        | Rooms,Scanner,Printer,Cafe |             4 |

In [19]:
### BEGIN SOLUTION
df_libraries['name'] = df_libraries['name'].str.upper()
### END SOLUTION

display(df_libraries)

Unnamed: 0,name,amenities,name_length
0,ACES (FUNK),"Rooms,Scanner,Printer",11
1,GRAINGER,"Rooms,Scanner,Printer,Cafe",8
2,LAW,Cafe,3
3,MAIN,"Rooms,Scanner,Printer,Cafe",4


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [20]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_lc = df_libraries_check
df_lc['name_length'] = df_lc['name'].str.len()
df_lc['name'] = df_lc['name'].str.upper()

pd.testing.assert_frame_equal(
    df_libraries.reset_index(drop=True),
    df_lc.reset_index(drop=True),
    check_dtype=False
)

---

### 🎯 Exercise 7: Split amenities into lists

#### 👇 Tasks

- ✔️ Split the items in the `'amenities'` column using the comma (`,`) as a delimiter.
- ✔️ Store the splitted result to a new column named `'amenities_list'` in `df_libraries`.

#### 🔑 Expected Output

|    | name        | amenities                  |   name_length | amenities_list                          |
|---:|:------------|:---------------------------|--------------:|:----------------------------------------|
|  0 | ACES (FUNK) | Rooms,Scanner,Printer      |            11 | ['Rooms', 'Scanner', 'Printer']         |
|  1 | GRAINGER    | Rooms,Scanner,Printer,Cafe |             8 | ['Rooms', 'Scanner', 'Printer', 'Cafe'] |
|  2 | LAW         | Cafe                       |             3 | ['Cafe']                                |
|  3 | MAIN        | Rooms,Scanner,Printer,Cafe |             4 | ['Rooms', 'Scanner', 'Printer', 'Cafe'] |

In [21]:
### BEGIN SOLUTION
df_libraries['amenities_list'] = df_libraries['amenities'].str.split(',')
### END SOLUTION

display(df_libraries)

Unnamed: 0,name,amenities,name_length,amenities_list
0,ACES (FUNK),"Rooms,Scanner,Printer",11,"[Rooms, Scanner, Printer]"
1,GRAINGER,"Rooms,Scanner,Printer,Cafe",8,"[Rooms, Scanner, Printer, Cafe]"
2,LAW,Cafe,3,[Cafe]
3,MAIN,"Rooms,Scanner,Printer,Cafe",4,"[Rooms, Scanner, Printer, Cafe]"


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [22]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_lc = df_libraries_check
df_lc['name_length'] = df_lc['name'].str.len()
df_lc['name'] = df_lc['name'].str.upper()
df_lc['amenities_list'] = df_lc['amenities'].str.split(',')

pd.testing.assert_frame_equal(
    df_libraries.reset_index(drop=True),
    df_lc.reset_index(drop=True),
    check_dtype=False
)