# Exercise 6 - Working with Columns & Missing values

- 🏆 20 points available
- ✏️ Last updated on 02/26/2022

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
import unittest
tc = unittest.TestCase()

---

### 🎯 Part 1: Import Pandas and NumPy

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [2]:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [3]:
part_name = "part-01"
available_points = 2

tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")

---

### 📌 Load data

For the remainder of this exercise, you'll be working with a small DataFrame named `df_emp`.

▶️ Run the code cell below to create `df_emp`.

In [4]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_emp = pd.DataFrame({
    "emp_id": [30, 40, 10, 20],
    "name": ["Colby", "Adam", "Eli", "Dylan"],
    "dept": ["Sales", "Marketing", "Sales", "Marketing"],
    "office_phone": ["(217)123-4500", np.nan, np.nan, "(217)987-6543"],
    "start_date": ["2017-05-01", "2018-02-01", "2020-08-01", "2019-12-01"],
    "salary": [202000, 185000, 240000, 160500]
})

# Used for intermediate checks
df_emp_backup = df_emp.copy()

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
1,40,Adam,Marketing,,2018-02-01,185000
2,10,Eli,Sales,,2020-08-01,240000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


The table below describes the columns in `df_products`.

| Field | Description |
|---|---|
| Product_ID | Product ID |
| Product_Name | Product name |
| Product_Category | Product Category |
| Product_Cost | Product cost (USD) |
| Product_Price | Product retail price (USD) |

---

### 🎯 Part 2: Find the number of rows and columns

#### 👇 Tasks

- ✔️ Store the number of rows in `df_emp` to a new variable named `num_rows`.
- ✔️ Store the number of columns in `df_emp` to a new variable named `num_cols`.
- ✔️ Use `.shape`, not `len()`.

In [5]:
### BEGIN SOLUTION
num_rows = df_emp.shape[0]
num_cols = df_emp.shape[1]
### END SOLUTION

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

Number of rows: 4
Number of columns: 6


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [6]:
part_name = "part-02"
available_points = 2

tc.assertEqual(num_rows, len(df_emp_backup.index), f"Number of rows should be {len(df_emp_backup.index)}")
tc.assertEqual(num_cols, len(df_emp_backup.columns), f"Number of columns should be {len(df_emp_backup.columns)}")

---

### 🎯 Part 3: Find all rows with non-missing phone numbers

#### 👇 Tasks

- ✔️ Using `df_emp`, find rows where `office_phone` contains a non-missing value.
- ✔️ Store the filtered rows to `df_with_phones`.
- ✔️ `df_emp` should remain unaltered.

#### 🔑 Expected Output of `df_with_phones`

|  | emp_id | name | dept | office_phone | start_date | salary |
|---|---|---|---|---|---|---|
| 0 | 30 | Colby | Sales | (217)123-4500 | 2017-05-01 | 202000 |
| 3 | 20 | Dylan | Marketing | (217)987-6543 | 2019-12-01 | 160500 |

In [7]:
# Reset df_emp
df_emp = df_emp_backup.copy()

### BEGIN SOLUTION
df_with_phones = df_emp[df_emp["office_phone"].notna()]
### END SOLUTION

display(df_with_phones)

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [8]:
part_name = "part-03"
available_points = 3

# df_you should remain unaltered
pd.testing.assert_frame_equal(df_emp, df_emp_backup, "The original DataFrame should remain unaltered")

# Check result
pd.testing.assert_frame_equal(df_with_phones.sort_values(df_with_phones.columns.tolist()).reset_index(drop=True),
                              df_emp_backup.query(f"office_phone == office_phone")
                                 .sort_values(df_emp_backup.columns.tolist()).reset_index(drop=True))

---

### 🎯 Part 4: Sort by department and salary

#### 👇 Tasks

- ✔️ Sort `df_emp` by `dept` in **ascending** order and then by `salary` in **ascending** order.
    - Employees within a same department must be sorted by `salary` in ascending order.
    - Store the result to a new variable named `df_by_dept_salary`.
- ✔️ `df_emp` should remain unaltered after your code.

▶️ Run the code cell below to reset your `df_emp`.

#### 🔑 Expected Output of `df_by_dept_salary`

|  | emp_id | name | dept | office_phone | start_date | salary |
|---|---|---|---|---|---|---|
| 3 | 20 | Dylan | Marketing | (217)987-6543 | 2019-12-01 | 160500 |
| 1 | 40 | Adam | Marketing | NaN | 2018-02-01 | 185000 |
| 0 | 30 | Colby | Sales | (217)123-4500 | 2017-05-01 | 202000 |
| 2 | 10 | Eli | Sales | NaN | 2020-08-01 | 240000 |

In [9]:
# Reset df_emp
df_emp = df_emp_backup.copy()

### BEGIN SOLUTION
df_by_dept_salary = df_emp.sort_values(["dept", "salary"])
### END SOLUTION

display(df_by_dept_salary)

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500
1,40,Adam,Marketing,,2018-02-01,185000
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
2,10,Eli,Sales,,2020-08-01,240000


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [10]:
part_name = "part-04"
available_points = 3

df_SOL = df_emp_backup.sort_values(["salary", "dept"][::-1], ascending=[not x for x in [False, False]])

# df_you should remain unaltered
pd.testing.assert_frame_equal(df_emp, df_emp_backup, "The original DataFrame should remain unaltered")

# check result
pd.testing.assert_frame_equal(
    df_by_dept_salary.reset_index(drop=True),
    df_SOL.reset_index(drop=True)
)

---

### 🎯 Part 5: Rename columns

#### 👇 Tasks

- ✔️ Rename the following two columns in `df_emp`.
    1. `name` to `first_name`
    2. `salary` to `base_salary`
- ✔️ Store the result to a new variable named `df_renamed`.
- ✔️ Your original DataFrame (`df_emp`) should remain unaltered.

#### 🚀 Hints

Use the following code to rename columns *out-of-place*.

```python
renamed_dataframe = my_dataframe.rename(columns={
    "col1_before": "col1_after",
    "col2_before": "col2_after"
})
```

In [11]:
# Reset df_emp
df_emp = df_emp_backup.copy()

### BEGIN SOLUTION
df_renamed = df_emp.rename(columns={
    "name": "first_name",
    "salary": "base_salary"
})
### END SOLUTION

display(df_renamed)

Unnamed: 0,emp_id,first_name,dept,office_phone,start_date,base_salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
1,40,Adam,Marketing,,2018-02-01,185000
2,10,Eli,Sales,,2020-08-01,240000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [12]:
part_name = "part-05"
available_points = 3

tc.assertEqual(df_emp.columns.tolist(), df_emp_backup.columns.tolist(), "Did you rename the column in-place? The original DataFrame should not be modified.")
tc.assertEqual(df_renamed.columns.tolist(), ["emp_id", "first_name", "dept", "office_phone", "start_date", "base_salary"])

---

### 🎯 Part 6: Drop columns

#### 👇 Tasks

- ✔️ Drop `emp_id` and `start_date` columns from `df_emp`.
- ✔️ Store the result to a new variable named `df_dropped`.
- ✔️ Your `df_emp` should remain unaltered.

#### 🚀 Hints

Use the following code as a reference.

```python
dropped_dataframe = my_dataframe.drop(columns=["my_column1", "my_column2"])
```

In [13]:
# Reset df_emp
df_emp = df_emp_backup.copy()

### BEGIN SOLUTION
df_dropped = df_emp.drop(columns=["emp_id", "start_date"])
### END SOLUTION

df_dropped

Unnamed: 0,name,dept,office_phone,salary
0,Colby,Sales,(217)123-4500,202000
1,Adam,Marketing,,185000
2,Eli,Sales,,240000
3,Dylan,Marketing,(217)987-6543,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [14]:
part_name = "part-06"
available_points = 3

# df_you should remain unaltered
pd.testing.assert_frame_equal(df_emp, df_emp_backup, "The original DataFrame should remain unaltered")

# Check result
tc.assertEqual(df_emp.columns.tolist(), df_emp_backup.columns.tolist())
tc.assertEqual(df_dropped.columns.tolist(), ["name", "dept", "office_phone", "salary"])

---

### 🎯 Part 7: Add new column

#### 👇 Tasks

- ✔️ Assume all four employees have received a huge bonus of \$20,000.
- ✔️ Append a new column named `salary_with_bonus` that shows the salaries of employees **with** the bonus.
- ✔️ Directly update the `df_emp`.

#### 🚀 Hints

Use the following code as a reference.

```python
my_dataframe["salary_with_bonus"] = my_dataframe["salary"] + 20000
```

In [15]:
# Reset df_emp
df_emp = df_emp_backup.copy()

### BEGIN SOLUTION
df_emp["salary_with_bonus"] = df_emp["salary"] + 20000
### END SOLUTION

display(df_emp)

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary,salary_with_bonus
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000,222000
1,40,Adam,Marketing,,2018-02-01,185000,205000
2,10,Eli,Sales,,2020-08-01,240000,260000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500,180500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [16]:
part_name = "part-07"
available_points = 4

df_SOL = df_emp_backup.copy()
df_SOL["salary_with_bonus"] = df_SOL["salary"] + 1000 * 20

# df_you should remain unaltered
pd.testing.assert_frame_equal(
    df_emp.sort_values(df_emp.columns.to_list()).reset_index(drop=True),
    df_SOL.sort_values(df_SOL.columns.to_list()).reset_index(drop=True)
)