# Lecture 14 - Pandas Sort, Add/Rename/Drop Columns

Monday 2021/03/15

## Lecture Notes and in-class exercises

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [1]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [2]:
# YOUR CODE BEGINS
import pandas as pd
import numpy as np
# YOUR CODE ENDS

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [3]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly import Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly import NumPy with an alias.')

---

### 📌 Load data

▶️ Run the code cell below to create a new `DataFrame` named `df_emp`.

In [4]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_emp = pd.DataFrame({
    'emp_id': [30, 40, 10, 20],
    'name': ['Nicole', 'Erisa', 'Prit', 'Claudia'],
    'dept': ['Sales', 'Marketing', 'Sales', 'Marketing'],
    'office_phone': ['(217)123-4500', np.nan, np.nan, '(217)987-6600'],
    'start_date': ['2017-05-01', '2018-02-01', '2020-08-01', '2019-12-01'],
    'salary': [202000, 185000, 240000, 160500]
})

# Used for intermediate checks
df_emp_backup = df_emp.copy()

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


---

## 👉 Sorting by Column(s)

You can sort a `DataFrame` using `df.sort_values()`.

![sort_values usage](https://github.com/bdi475/images/blob/main/pandas/sort-values-01.png?raw=true)

---

### 🎯 Mini-exercise: Sort by `emp_id` ascending

#### 👇 Tasks

- ✔️ Sort `df_emp` by `emp_id` in **ascending** order.
    - Store the result to a new variable named `df_id_asc`.
- ✔️ `df_emp` should remain unaltered after your code.

#### 🚀 Hints

The code below sorts `my_dataframe` by `some_column` in ascending order and stores the sorted `DataFrame` to a new variable `sorted_dataframe`.

```python
sorted_dataframe = my_dataframe.sort_values('some_column')
```

▶️ Run the code cell below to reset your `df_emp`.

In [5]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [6]:
# YOUR CODE BEGINS
df_id_asc = df_emp.sort_values('emp_id')
# YOUR CODE ENDS

df_id_asc

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [7]:
# Check result
pd.testing.assert_frame_equal(df_id_asc.reset_index(drop=True),
                              df_emp_backup.sort_values('_'.join(['EmP', 'iD']).lower()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Sort by `emp_id` descending

#### 👇 Tasks

- ✔️ Sort `df_emp` by `emp_id` in **descending** order.
    - Store the result to a new variable named `df_id_desc`.
- ✔️ `df_emp` should remain unaltered after your code.

#### 🚀 Hints

The code below sorts `my_dataframe` by `some_column` in descending order and stores the sorted `DataFrame` to a new variable `sorted_dataframe`.

```python
sorted_dataframe = my_dataframe.sort_values('some_column', ascending=False)
```

▶️ Run the code cell below to reset your `df_emp`.

In [8]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [9]:
# YOUR CODE BEGINS
df_id_desc = df_emp.sort_values('emp_id', ascending=False)
# YOUR CODE ENDS

df_id_desc

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
1,40,Erisa,Marketing,,2018-02-01,185000
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500
2,10,Prit,Sales,,2020-08-01,240000


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [10]:
# Check result
pd.testing.assert_frame_equal(df_id_desc.reset_index(drop=True),
                              df_emp_backup.sort_values('_'.join(['EmP', 'iD']).lower(), ascending=bool(0)).reset_index(drop=True))

---

### 🎯 Mini-exercise: Sort by `name` ascending

#### 👇 Tasks

- ✔️ Sort `df_emp` by `name` in **ascending** order.
    - Store the result to a new variable named `df_name_asc`.
- ✔️ `df_emp` should remain unaltered after your code.

#### 🚀 Hints

The code below sorts `my_dataframe` by `some_column` in ascending order and stores the sorted `DataFrame` to a new variable `sorted_dataframe`.

```python
sorted_dataframe = my_dataframe.sort_values('some_column')
```

▶️ Run the code cell below to reset your `df_emp`.

In [11]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [12]:
# YOUR CODE BEGINS
df_name_asc = df_emp.sort_values('name')
# YOUR CODE ENDS

df_name_asc

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500
1,40,Erisa,Marketing,,2018-02-01,185000
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
2,10,Prit,Sales,,2020-08-01,240000


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [13]:
# Check result
pd.testing.assert_frame_equal(df_name_asc.reset_index(drop=True),
                              df_emp_backup.sort_values(''.join(['nA', 'Me']).lower()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Sort by `name` descending

#### 👇 Tasks

- ✔️ Sort `df_emp` by `name` in **descending** order.
    - Store the result to a new variable named `df_name_desc`.
- ✔️ `df_emp` should remain unaltered after your code.

#### 🚀 Hints

The code below sorts `my_dataframe` by `some_column` in ascending order and stores the sorted `DataFrame` to a new variable `sorted_dataframe`.

```python
sorted_dataframe = my_dataframe.sort_values('some_column')
```

▶️ Run the code cell below to reset your `df_emp`.

In [14]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [15]:
# YOUR CODE BEGINS
df_name_desc = df_emp.sort_values('name', ascending=False)
# YOUR CODE ENDS

df_name_desc

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
2,10,Prit,Sales,,2020-08-01,240000
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [16]:
# Check result
pd.testing.assert_frame_equal(df_name_desc.reset_index(drop=True),
                              df_emp_backup.sort_values(''.join(['nA', 'Me']).lower(), ascending=bool(0)).reset_index(drop=True))

---

### 🎯 Mini-exercise: Sort by `dept` descending and then by `start_date` ascending

#### 👇 Tasks

- ✔️ Sort `df_emp` by `dept` in **descending** order and then by `start_date` in **ascending** order.
    - Employees within a same department must be sorted by `start_date` in ascending order.
    - Store the result to a new variable named `df_dept_desc_date_asc`.
- ✔️ `df_emp` should remain unaltered after your code.

#### 🚀 Hints

The code below sorts `my_dataframe` by `some_column` in ascending order and then by `another_column` in descending order. It stores the sorted `DataFrame` to a new variable named `sorted_dataframe`.

```python
sorted_dataframe = my_dataframe.sort_values(['some_column', 'another_column'], ascending=[True, False])
```

▶️ Run the code cell below to reset your `df_emp`.

In [17]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [18]:
# YOUR CODE BEGINS
df_dept_desc_date_asc = df_emp.sort_values(['dept', 'start_date'], ascending=[False, True])
# YOUR CODE ENDS

df_dept_desc_date_asc

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
2,10,Prit,Sales,,2020-08-01,240000
1,40,Erisa,Marketing,,2018-02-01,185000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [19]:
# Check result
pd.testing.assert_frame_equal(df_dept_desc_date_asc.reset_index(drop=True),
                              df_emp_backup.sort_values([''.join(['dE', 'Pt']).lower(), '_'.join(['sTarT', 'dAtE']).lower()], ascending=[bool(0), bool(1)]).reset_index(drop=True))

---

### 🎯 Mini-exercise: Sort by `dept` ascending and then by `salary` descending

#### 👇 Tasks

- ✔️ Sort `df_emp` by `dept` in **ascending** order and then by `salary` in **descending** order.
    - Employees within a same department must be sorted by `salary` in descending order.
    - Store the result to a new variable named `df_dept_asc_salary_desc`.
- ✔️ `df_emp` should remain unaltered after your code.

▶️ Run the code cell below to reset your `df_emp`.

In [20]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [21]:
# YOUR CODE BEGINS
df_dept_asc_salary_desc = df_emp.sort_values(['dept', 'salary'], ascending=[True, False])
# YOUR CODE ENDS

df_dept_asc_salary_desc

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
1,40,Erisa,Marketing,,2018-02-01,185000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500
2,10,Prit,Sales,,2020-08-01,240000
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [22]:
# Check result
pd.testing.assert_frame_equal(df_dept_asc_salary_desc.reset_index(drop=True),
                              df_emp_backup.sort_values([''.join(['dE', 'Pt']).lower(), 'sAlArY'.lower()], ascending=[bool(1), bool(0)]).reset_index(drop=True))

---

### 🎯 Mini-exercise: Sort by `salary` descending in-place

#### 👇 Tasks

- ✔️ Sort `df_emp` by `salary` in **descending** order *in-place*.
    - Directly update `df_emp` without creating a new variable.

#### 🚀 Hints

The code below sorts `my_dataframe` by `some_column` in descending order *in-place*.

```python
my_dataframe.sort_values('some_column', ascending=False, inplace=True)
```

▶️ Run the code cell below to reset your `df_emp`.

In [23]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [24]:
# YOUR CODE BEGINS
df_emp.sort_values('salary', ascending=False, inplace=True)
# YOUR CODE ENDS

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
2,10,Prit,Sales,,2020-08-01,240000
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [25]:
# Check result
pd.testing.assert_frame_equal(df_emp.reset_index(drop=True),
                              df_emp_backup.sort_values(''.join(['sAl', 'aRy']).lower(), ascending=bool(0)).reset_index(drop=True))

---

### 🎯 Mini-exercise: Sort by `department` and `name` both descending in-place

#### 👇 Tasks

- ✔️ Sort `df_emp` by `dept` and then by `name` both in **descending** orders *in-place*.
    - Directly update `df_emp` without creating a new variable.

#### 🚀 Hints

The code below sorts `my_dataframe` by `some_column` and `another_column` in descending order *in-place*.

```python
my_dataframe.sort_values(['some_column', 'another_column'], ascending=[False, False], inplace=True)
```

▶️ Run the code cell below to reset your `df_emp`.

In [26]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [27]:
# YOUR CODE BEGINS
df_emp.sort_values(['dept', 'name'], ascending=[False, False], inplace=True)
# YOUR CODE ENDS

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
2,10,Prit,Sales,,2020-08-01,240000
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [28]:
# Check result
pd.testing.assert_frame_equal(df_emp.reset_index(drop=True),
                              df_emp_backup.sort_values([''.join(['dE', 'Pt']).lower(), ('nA' + 'Me').lower()], ascending=[bool(0), bool(0)]).reset_index(drop=True))

---

## 👉 Renaming Column(s)

You can rename a column using `df.rename(columns={'name_before': 'name_after'})`.

---

### 🎯 Mini-exercise: Rename `office_phone` to `phone_num`

#### 👇 Tasks

- ✔️ Rename `office_phone` column to `phone_num`.
- ✔️ Store the result to a new variable named `df_renamed`.
- ✔️ Your `df_emp` should remain unaltered.

#### 🚀 Hints

Use the following code to rename `col_before` column to `col_after` *out-of-place*.

```python
renamed_dataframe = my_dataframe.rename(columns={'col_before': 'col_after'})
```

▶️ Run the code cell below to reset your `df_emp`.

In [29]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [30]:
# YOUR CODE BEGINS
df_renamed = df_emp.rename(columns={'office_phone': 'phone_num'})
# YOUR CODE ENDS

df_renamed

Unnamed: 0,emp_id,name,dept,phone_num,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [31]:
# Check result
tc.assertEqual(df_emp.columns.tolist(), df_emp_backup.columns.tolist())
tc.assertEqual(df_renamed.columns.tolist(), ['emp_id', 'name', 'dept', 'phone_num', 'start_date', 'salary'])

---

### 🎯 Mini-exercise: Rename `office_phone` to `phone_num` **in-place**

#### 👇 Tasks

- ✔️ Rename `office_phone` column to `phone_num` *in-place*.
    - Directly update `df_emp` without creating a new variable.

#### 🚀 Hints

Use the following code to rename `col_before` column to `col_after` *in-place*.

```python
my_dataframe.rename(columns={'col_before': 'col_after'}, inplace=True)
```

▶️ Run the code cell below to reset your `df_emp`.

In [32]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [33]:
# YOUR CODE BEGINS
df_emp.rename(columns={'office_phone': 'phone_num'}, inplace=True)
# YOUR CODE ENDS

df_emp

Unnamed: 0,emp_id,name,dept,phone_num,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [34]:
# Check result
tc.assertEqual(df_emp.columns.tolist(), ['emp_id', 'name', 'dept', 'phone_num', 'start_date', 'salary'])

---

### 🎯 Mini-exercise: Rename `name` to `first_name` and `salary` to `base_salary` **in-place**

#### 👇 Tasks

- ✔️ Rename `name` column to `first_name` and `salary` to `base_salary` *in-place*.
    - Directly update `df_emp` without creating a new variable.

#### 🚀 Hints

Use the following code as a reference.

```python
my_dataframe.rename(columns={'col_before1': 'col_after1', 'col_before2': 'col_after2'}, inplace=True)
```

▶️ Run the code cell below to reset your `df_emp`.

In [35]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [36]:
# YOUR CODE BEGINS
df_emp.rename(columns={'name': 'first_name', 'salary': 'base_salary'}, inplace=True)
# YOUR CODE ENDS

df_emp

Unnamed: 0,emp_id,first_name,dept,office_phone,start_date,base_salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [37]:
# Check result
tc.assertEqual(df_emp.columns.tolist(), ['emp_id', 'first_name', 'dept', 'office_phone', 'start_date', 'base_salary'])

---

## 👉 Dropping Column(s)

You can rename a column using `df.drop(columns=['col1', 'col2'])`.

---

### 🎯 Mini-exercise: Drop `start_date` column

#### 👇 Tasks

- ✔️ Drop `start_date` column from `df_emp`.
- ✔️ Store the result to a new variable named `df_dropped`.
- ✔️ Your `df_emp` should remain unaltered.

#### 🚀 Hints

Use the following code as a reference.

```python
dropped_dataframe = my_dataframe.drop(columns=['my_column1'])
```

▶️ Run the code cell below to reset your `df_emp`.

In [38]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [39]:
# YOUR CODE BEGINS
df_dropped = df_emp.drop(columns=['start_date'])
# YOUR CODE ENDS

df_dropped

Unnamed: 0,emp_id,name,dept,office_phone,salary
0,30,Nicole,Sales,(217)123-4500,202000
1,40,Erisa,Marketing,,185000
2,10,Prit,Sales,,240000
3,20,Claudia,Marketing,(217)987-6600,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [40]:
# Check result
tc.assertEqual(df_emp.columns.tolist(), df_emp_backup.columns.tolist())
tc.assertEqual(df_dropped.columns.tolist(), ['emp_id', 'name', 'dept', 'office_phone', 'salary'])

---

### 🎯 Mini-exercise: Drop `start_date` column **in-place**

#### 👇 Tasks

- ✔️ Drop `start_date` column *in-place*.
    - Directly update `df_emp` without creating a new variable.

#### 🚀 Hints

Use the following code as a reference.

```python
my_dataframe.drop(columns=['my_column1'], inplace=True)
```

▶️ Run the code cell below to reset your `df_emp`.

In [41]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [42]:
# YOUR CODE BEGINS
df_emp.drop(columns=['start_date'], inplace=True)
# YOUR CODE ENDS

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,salary
0,30,Nicole,Sales,(217)123-4500,202000
1,40,Erisa,Marketing,,185000
2,10,Prit,Sales,,240000
3,20,Claudia,Marketing,(217)987-6600,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [43]:
# Check result
tc.assertEqual(df_emp.columns.tolist(), ['emp_id', 'name', 'dept', 'office_phone', 'salary'])

---

### 🎯 Mini-exercise: Drop `name` and `salary` columns **in-place**

#### 👇 Tasks

- ✔️ Drop `name` and `salary` columns *in-place*.
    - Directly update `df_emp` without creating a new variable.

#### 🚀 Hints

Use the following code as a reference.

```python
my_dataframe.drop(columns=['my_column1', 'my_column2'], inplace=True)
```

▶️ Run the code cell below to reset your `df_emp`.

In [44]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [45]:
# YOUR CODE BEGINS
df_emp.drop(columns=['name', 'salary'], inplace=True)
# YOUR CODE ENDS

df_emp

Unnamed: 0,emp_id,dept,office_phone,start_date
0,30,Sales,(217)123-4500,2017-05-01
1,40,Marketing,,2018-02-01
2,10,Sales,,2020-08-01
3,20,Marketing,(217)987-6600,2019-12-01


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [46]:
# Check result
tc.assertEqual(df_emp.columns.tolist(), ['emp_id', 'dept', 'office_phone', 'start_date'])

---

## 👉 Selecting Subset Columns

You can select a subset of columns from a `DataFrame` using `df[list_of_columns]`.

---

### 🎯 Mini-exercise: `emp_id` and `name`

#### 👇 Tasks

- ✔️ Select only `emp_id` and `name` columns from `df_emp` (in the same order).
- ✔️ Store the result to a new variable named `df_id_name`.
- ✔️ Your `df_emp` should remain unaltered.

#### 🚀 Hints

Use the following code as a reference.

```python
df_subset = df[['col1', 'col2']]
```

▶️ Run the code cell below to reset your `df_emp`.

In [47]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [48]:
# YOUR CODE BEGINS
df_id_name = df_emp[['emp_id', 'name']]
# YOUR CODE ENDS

df_id_name

Unnamed: 0,emp_id,name
0,30,Nicole
1,40,Erisa
2,10,Prit
3,20,Claudia


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [49]:
# Check result
pd.testing.assert_frame_equal(df_emp, df_emp_backup)
pd.testing.assert_frame_equal(df_emp_backup[['emp_id', 'name']], df_id_name, check_like=True)

---

### 🎯 Mini-exercise: `emp_id`, `name`, `dept`, `salary`

#### 👇 Tasks

- ✔️ Select `emp_id`, `name`, `dept`, and `salary` columns from `df_emp` (in the same order).
- ✔️ Store the result to a new variable named `df_subset`.
- ✔️ Your `df_emp` should remain unaltered.

#### 🚀 Hints

Use the following code as a reference.

```python
df_subset = df[['col1', 'col2']]
```

▶️ Run the code cell below to reset your `df_emp`.

In [50]:
df_emp = df_emp_backup.copy()
df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Nicole,Sales,(217)123-4500,2017-05-01,202000
1,40,Erisa,Marketing,,2018-02-01,185000
2,10,Prit,Sales,,2020-08-01,240000
3,20,Claudia,Marketing,(217)987-6600,2019-12-01,160500


In [51]:
# YOUR CODE BEGINS
df_subset = df_emp[['emp_id', 'name', 'dept', 'salary']]
# YOUR CODE ENDS

df_subset

Unnamed: 0,emp_id,name,dept,salary
0,30,Nicole,Sales,202000
1,40,Erisa,Marketing,185000
2,10,Prit,Sales,240000
3,20,Claudia,Marketing,160500


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [52]:
# Check result
pd.testing.assert_frame_equal(df_emp, df_emp_backup)
pd.testing.assert_frame_equal(df_emp_backup[['emp_id', 'name', 'dept', 'salary']], df_subset)

---

## 👉 Filtering Recap

If time permits, we'll work on these exercises.

---

### 📌 Load data

▶️ Run the code cell below to create a new `DataFrame` named `df_you`.

In [53]:
# DO NOT CHANGE THE CODE IN THIS CELL
df_you = pd.read_csv('https://raw.githubusercontent.com/bdi475/datasets/main/about-you.csv')

# Used to keep a clean copy
df_you_backup = df_you.copy()

# head() displays the first 5 rows of a DataFrame
df_you.head()

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,has_iphone
0,Citlalli,Anthropology,,Chicago,125.86,Seven Saints,True
1,Zach,Finance,Information Systems,Glenview,137.04,,False
2,Ori,Information Science,,Skokie,134.94,Culvers,True
3,Dylan,Accountancy,,Chicago,125.86,Signature Grill,True
4,Ajay,Organizational Psychology,Statistics,Fairview Heights,141.24,Chipotle,True


The table below describes each column in `df_you`.

| Column Name             | Description                                               |
|-------------------------|-----------------------------------------------------------|
| name                    | First name                                                |
| major1                  | Major                                                     |
| major2                  | Second major OR minor (blank if no second major or minor) |
| city                    | City the person is from                                   |
| distance_from_champaign | Straight distance from the city to Champaign in miles     |
| fav_restaurant          | Favorite restaurant (blank if no restaurant was given)    |
| has_iphone              | Whether the person use an iPhone                          |

---

### 🎯 Mini-exercise: People from Skokie

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person is from `'Skokie'`.
    - Check whether the `city` column contains `'Skokie'`.
    - Store the result to a new variable named `df_skokie`.
- ✔️ `df_you` should remain unaltered after your code.

In [54]:
# YOUR CODE BEGINS
df_skokie = df_you[df_you['city'] == 'Skokie']
# YOUR CODE ENDS

df_skokie

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,has_iphone
2,Ori,Information Science,,Skokie,134.94,Culvers,True
5,Andrew,Economics,Statistics,Skokie,134.94,,True


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [55]:
# df_you should remain unaltered
pd.testing.assert_frame_equal(df_you, df_you_backup)

# Check result
pd.testing.assert_frame_equal(df_skokie.sort_values(df_skokie.columns.tolist()).reset_index(drop=True),
                              df_you_backup.query(f'{"cItY".lower()} == "{"SkOkIe".capitalize()}"')
                                 .sort_values(df_you_backup.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Anyone with a non-missing `major2`

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person has a second major or a minor.
    - You're looking for rows where `major2` is not `NaN`.
- ✔️ `NaN` is a special value to denote missing value. You must use `my_series.isna()` or `my_series.notna()` to check whether a row contains a missing value or not.
- ✔️ Store the result to a new variable named `df_major2`.
- ✔️ `df_you` should remain unaltered after your code.

#### 🚀 Hints

- `my_series.notna()` can be used to check whether a row contains a missing value or not.

![notna](https://github.com/bdi475/images/blob/main/pandas/notna-series.png?raw=true)

In [56]:
# YOUR CODE BEGINS
df_major2 = df_you[df_you['major2'].notna()]
# YOUR CODE ENDS

df_major2

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,has_iphone
1,Zach,Finance,Information Systems,Glenview,137.04,,False
4,Ajay,Organizational Psychology,Statistics,Fairview Heights,141.24,Chipotle,True
5,Andrew,Economics,Statistics,Skokie,134.94,,True
6,Sarah,Marketing,Theatre,Morris,86.38,,True
8,Jennifer,Food Science,Human Nutrition,Macau,7807.02,,True
10,James,Accountancy,Informatics,Orland Park,106.11,,True
12,Max,Finance,Informatics,Clarendon Hills,117.12,Chick-fil-A,True
14,Jackie,Supply Chain Management,Marketing,Wheeling,397.48,Portillos,True
15,Nicole,Accountancy,Finance,Shanghai,7154.42,,True
17,Keziah,Agricultural and Consumer Economics,Communications,Bolingbrook,109.64,,True


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [57]:
# df_you should remain unaltered
pd.testing.assert_frame_equal(df_you, df_you_backup)

# Check result
pd.testing.assert_frame_equal(df_major2.sort_values(df_major2.columns.tolist()).reset_index(drop=True),
                              df_you_backup.query(f'major2 == major2')
                                 .sort_values(df_you_backup.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Anyone with a Favorite Restaurant

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person has a favorite restaurant
    - You're looking for rows where `fav_restaurant` is not missing (not `NaN`).
- ✔️ `NaN` is a special value to denote missing value. You must use `my_series.isna()` or `my_series.notna()` to compare a `Series` with `NaN`.
- ✔️ Store the result to a new variable named `df_fav`.
- ✔️ `df_you` should remain unaltered after your code.

#### 🚀 Hints

- `my_series.notna()` can be used to check whether a row contains a missing value or not.

In [58]:
# YOUR CODE BEGINS
df_fav = df_you[df_you['fav_restaurant'].notna()]
# YOUR CODE ENDS

df_fav

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,has_iphone
0,Citlalli,Anthropology,,Chicago,125.86,Seven Saints,True
2,Ori,Information Science,,Skokie,134.94,Culvers,True
3,Dylan,Accountancy,,Chicago,125.86,Signature Grill,True
4,Ajay,Organizational Psychology,Statistics,Fairview Heights,141.24,Chipotle,True
7,Ahsaas,Finance,,Muscat,7543.35,Five Guys,True
11,Jaewon,Acturial Science,,Seoul,6623.7,Jimmy Johns,True
12,Max,Finance,Informatics,Clarendon Hills,117.12,Chick-fil-A,True
13,Nick,Information Science,,Northbrook,140.84,Potbelly,True
14,Jackie,Supply Chain Management,Marketing,Wheeling,397.48,Portillos,True
16,Harsha,Accountancy,,Lisle,116.73,Chipotle,True


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [59]:
# df_you should remain unaltered
pd.testing.assert_frame_equal(df_you, df_you_backup)

# Check result
pd.testing.assert_frame_equal(df_fav.sort_values(df_fav.columns.tolist()).reset_index(drop=True),
                              df_you_backup.query(f'fav_restaurant == fav_restaurant')
                                 .sort_values(df_you_backup.columns.tolist()).reset_index(drop=True))