# Lecture 11 - Pandas Filter, Sort, Read/Update Cells

Monday 2021/03/08

## Lecture Notes and in-class exercises

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [None]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly import Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly import NumPy with an alias.')

---

### 📌 Load data

The first part of today's lecture is all about **you**. 👻 Literally.

▶️ Run the code cell below to create a new `DataFrame` named `df_you`.

In [None]:
df_you = pd.read_csv('https://raw.githubusercontent.com/bdi475/datasets/main/about-you.csv')

# Used to keep a clean copy
df_you_backup = df_you.copy()

# head() displays the first 5 rows of a DataFrame
df_you.head()

☝️ **Hold on.** Didn't we always create `DataFrame`s using `pd.DataFrame()`?

Yes. But we can *import* existing data as a Pandas `DataFrame` using `pd.read_csv()`. There are many other similar import methods. For now, we'll mostly use `pd.read.csv()`.

The table below explains each column in `df_you`.

| Column Name             | Description                                               |
|-------------------------|-----------------------------------------------------------|
| name                    | First name                                                |
| major1                  | Major                                                     |
| major2                  | Second major OR minor (blank if no second major or minor) |
| city                    | City the person is from                                   |
| distance_from_champaign | Straight distance from the city to Champaign in miles     |
| fav_restaurant          | Favorite restaurant (blank if no restaurant was given)    |
| has_iphone              | Whether the person use an iPhone                          |

---

### 📌 Concise summary of a `DataFrame`

👉 A common first step in working with a `DataFrame` is to use the `info()` method. `info()` prints a concise summary of a `DataFrame`.
- Index data type
- Column information: for each column, the following information is displayed:
    - Number of non-missing values
    - Data type of the column
- Memory usage

▶️ Run `df_you.info()` below to see the `info()` method in action.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

👉 From the result of `df_you.info()`, we can understand a couple of things:

- There are 7 columns.
- Five columns have the `object` data type.
    - In Pandas, a string data type is shown as `object`, not `str`.
        - We will skip the technical discussion for now.
- The second line of the output tells us that there are 32 entries.
- Some columns have 32 non-null values - these columns do not contain any missing value.
- Some columns have less than 32 non-null values - these columns contain one or more missing values.
    - Missing values are displayed as `NaN`.
    - To denote a missing value, use NumPy's `np.nan` (more on this later).

---

### 📌 Number of rows and columns in a `DataFrame`

👉 How many rows and columns does `df_you` have?

▶️ Run `df_you.shape` below to see the *shape* of the database.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

👉 Can you store the number of rows and columns to variables?

`df_you.shape` returns a `tuple` in `(num_rows, num_cols)` format. 🙀 A `tuple` is a `list` that cannot be modified once created.

▶️ Run the code cell below to see how they work nearly identical.

In [None]:
# These two are nearly identical,
# The only difference is that my_tuple cannot be modified
my_list = [10, 20]
my_tuple = (10, 20)

print(f'my_list[1]={my_list[1]}')    # prints 20
print(f'my_tuple[1]={my_tuple[1]}')  # also prints 20

---

### 🎯 Mini-exercise: Find of number of rows and columns in a `DataFrame`

#### 👇 Tasks

- ✔️ Store the number of rows in `df_you` to a new variable named `num_rows`.
- ✔️ Store the number of columns in `df_you` to a new variable named `num_cols`.
- ✔️ Use `.shape`, not `len()`.

In [None]:
# YOUR CODE BEGINS





# YOUR CODE ENDS

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(num_rows, len(df_you.index), f'Number of rows should be {len(df_you.index)}')
tc.assertEqual(num_cols, len(df_you.columns), f'Number of columns should be {len(df_you.columns)}')

---

### 📌 Filtering rows

Let's step back and go back to working with a `Series`.

▶️ Create a `Series` named `nums` with the following four integers: `-20`, `-10`, `10`, `20`. 

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

nums

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? Let's first try this **manually**.

▶️ Create a new `Series` named `keep` with the following four boolean values: `False`, `False`, `True`, `True`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Check your work
pd.testing.assert_series_equal(keep,
                              pd.Series([0, 0, 1, 1]).astype(bool))

# Display keep
keep

▶️ Now, you can use the boolean `Series` to filter another `Series`. Type in `nums[keep]` below and run the cell.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

The syntax for filtering a `Series` is `my_series[keep]` where `keep` is a `Series` of boolean values indicating whether to keep an element or not. `keep` should have the exact same number of elements as `my_series`.

▶️ **Uncomment the code cell below first** and run it to see what happens when your `keep` does not have the same number of elements as `my_series`.

(⛔️ **Heads-up**: The code will throw an error! Once you're done running the cell, comment the lines.)

In [None]:
# keep_incorrect = pd.Series([False, False, True])
# nums[keep_incorrect]

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? The last method we've used was inefficient. Imagine if your `Series` contains million elements. You would need to spend a few months continuously typing `True` and `False`! 🤡

As a data analyst, your goal is to perform tasks *programmatically*.

▶️ Type `keep_by_comparison = nums > 0` in the code cell below to perform a comparison on the `nums` Series.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

keep_by_comparison

Notice how `keep_by_comparison` is idential to the original `keep` Series?

▶️ Use the `keep_by_comparison` to filter positive values in `nums`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

Note that applying a filter returns **a new `Series`** without modifying the original `Series`.

▶️ Run the code below.

In [None]:
print('Negative Values (filtered):')
display(nums[nums < 10])

print('\n\nOriginal Values:')
display(nums)

---

### 🎯 Mini-exercise: Filter even numbers

#### 👇 Tasks

- ✔️ Using `all_nums`, filter only even numbers.
    - Store the result to a new variable named `even_nums`.
- ✔️ `all_nums` should remain unaltered after your code.

#### 🚀 Hints

- Use the modulo operator (`%`) to check whether a number is even.
    - `some_num % 2 == 0`

In [None]:
all_nums = pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4])

# YOUR CODE BEGINS

# YOUR CODE ENDS

even_nums

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
pd.testing.assert_series_equal(all_nums, pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4]))
pd.testing.assert_series_equal(even_nums.reset_index(drop=True),
                               pd.Series([2, 4, 8, -2, 4]))

---

### 📌 Element-wise comparison in a `Series`

▶️ Run the code cell below to create a new `Series` named `countries`.

In [None]:
countries = pd.Series(['United States', 'Oman', 'United States',
                       'China', 'Korea, South', 'United States'])

display(countries)

What happens when you perform an equality comparison on strings?

▶️ Compare `countries` with the string `'United States'` using an equality comparison operator (`==`).

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

▶️ Run the code cell below to check the data type of the result.

In [None]:
type(countries == 'United States')

The result is **another `Series`** containing boolean (`True`/`False`) values. Pandas performs a string comparison (`my_str == 'United States'`) on **each element**.

You can also supply more than one condition.

▶️ Run the code cell below to check whether a country is **either** `'Oman'` **or** `'China'`.

In [None]:
(countries == 'Oman') | (countries == 'China')

In [None]:
countries[(countries == 'Oman') | (countries == 'China')]

---

### 📌 Filtering a `DataFrame`

👉 I will keep saying this. A `DataFrame` is a combination of one or more columns. Filtering a `DataFrame` is very similar to filtering a `Series`.

▶️ Run the code cell below to create a new `DataFrame` named `df_cities`.

In [None]:
df_cities = pd.DataFrame({'city': ['Lisle', 'Muscat', 'Niles', 'Shanghai', 'Seoul', 'Chicago'],
 'country': ['United States', 'Oman', 'United States', 'China', 'Korea, South', 'United States'],
 'population': [23270, 1421409, 28938, 22120000, 21794000, 8604203]})

df_cities

To only keep rows where the `country` is `'United States'`, we can again supply a `Series` of boolean values.

▶️ Create a new `Series` named `keep` with the following 6 boolean values - `True`, `False`, `True`, `False`, `False`, `True`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Check your work
pd.testing.assert_series_equal(keep,
                               pd.Series([1, 0,1, 0, 0, 1]).astype(bool))

# Display keep
keep

🤠 You know the drill now.

▶️ Type `df_cities[keep]` in the code cell below and run it.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

---

### 🎯 Mini-exercise: Cities with population over a million

#### 👇 Tasks

- ✔️ Using `df_cities`, filter rows with a population greater than a million (`1000000`).
    - Store the result to a new variable named `df_large_cities`.
- ✔️ `df_cities` should remain unaltered after your code.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_large_cities

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
pd.testing.assert_frame_equal(df_large_cities.reset_index(drop=True),
                              pd.DataFrame({'city': ['Muscat', 'Shanghai', 'Seoul', 'Chicago'],
                                            'country': ['Oman', 'China', 'Korea, South', 'United States'],
                                            'population': [1421409, 22120000, 21794000, 8604203]}))

---

### 🎯 Mini-exercise: People who does not use an iPhone

▶️ Run the code cell below to see the **first** 3 rows of `df_you`.

In [None]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.head(3)

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person does not use an iPhone.
    - Store the result to a new variable named `df_no_iphone`.
- ✔️ `df_you` should remain unaltered after your code.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_no_iphone

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
pd.testing.assert_frame_equal(df_no_iphone.reset_index(drop=True),
                              pd.DataFrame({'name': ['Zach', 'Mark'], 'major1': ['Finance', 'Finance'],
                                            'major2': ['Information Systems', np.nan], 'city': ['Glenview', 'Metamora'],
                                            'distance_from_champaign': [137.04, 74.97], 'fav_restaurant': [np.nan, 'Taco Bell'], 'has_iphone': [False, False]}))

---

### 🎯 Mini-exercise: People who are from outside the States

▶️ Run the code cell below to see the **last** 2 rows of `df_you`.

In [None]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.tail(2)

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person is from a city that is at least 1000 miles away.
    - Use the `distance_from_champaign` column.
    - Store the result to a new variable named `df_far`.
- ✔️ `df_you` should remain unaltered after your code.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_far

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_you_backup[df_you_backup['distance_from_champaign'] >= 1000]

pd.testing.assert_frame_equal(df_far.sort_values(df_far.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: People who are `Economics` majors or likes `Taco Bell`

▶️ Run the code cell below to see the **last** 5 rows of `df_you`.

In [None]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.tail(5)

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows that matches the following criteria:
    - The person's `major1` is `Economics`, **OR**
    - The person's `fav_restaurant` is `Taco Bell`
- ✔️ Store the filtered `DataFrame` to a new variable named `df_econotacos`.
- ✔️ `df_you` should remain unaltered after your code.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_econotacos

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_you_backup[(df_you_backup['major1'] == 'Economics') | (df_you_backup['fav_restaurant'] == 'Taco Bell')]

pd.testing.assert_frame_equal(df_econotacos.sort_values(df_far.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Do your own thang

▶️ Run the code cell below to **randomly select** 5 rows from `df_you`.

In [None]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.sample(5)

#### 👇 Tasks

- ✔️ Using `df_you`, create a filter that may interest you.

In [None]:
# YOUR CODE BEGINS



# YOUR CODE ENDS

---

### 📌 Sorting a `DataFrame`

▶️ Run the code cell below to **sort** `df_you` by `distance_from_champaign`.

In [None]:
df_you.sort_values('distance_from_champaign')

▶️ Run the code cell below to **sort** `df_you` by `major1` and then by `major2` for people with same `major1` values.

In [None]:
df_you.sort_values(['major1', 'major2'])

▶️ Run the code cell below to **sort** `df_you` by `distance_from_champaign` in descending order.

In [None]:
df_you.sort_values('distance_from_champaign', ascending=False)