# Pandas Filtering and Sorting

## Lecture Notes and in-class exercises

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [None]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
import sys
tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")

---

### 🎯 Challenge 1: Create a Pandas Series

#### 👇 Tasks

- ✔️ Create a new Pandas `Series` named `my_series` with the following three values: `10`, `20`, `30`.

#### 🚀 Hint

The code below creates a new Pandas `Series` with the values `1` and `2`.

```python
my_new_series = pd.Series([1, 2])
```

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

print(my_series)

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [None]:
pd.testing.assert_series_equal(my_series, pd.Series(x * 10 for x in range(1, 4)))

---

### 🎯 Challenge 2: Create a Pandas DataFrame

#### 👇 Tasks

- ✔️ You are given two lists - `product_names` and `num_reviews` that contain the names of make-up products and the number of reviews on Sephora.com.
- ✔️ Using the two lists, create a new Pandas `DataFrame` named `df_top_products` that has the following two columns:
    1. `product_name`: Names of the products
    2. `num_review`: Number of reviews
- ✔️ Note that the column names are singular.

#### 🚀 Hint

The code below creates a new Pandas `DataFrame` from two series.

```python
my_new_dataframe = pd.DataFrame({
    "column_one": my_series1,
    "column_two": my_series2
})
```

In [None]:
product_names = [
    "Laneige Lip Sleeping Mask",
    "The Ordinary Hyaluronic Acid 2% + B5",
    "Laneige Lip Glowy Balm",
    "Chanel COCO MADEMOISELLE Eau de Parfum"
]

num_reviews = [
    12715,
    2274,
    2766,
    724
]

# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_top_products)

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [None]:
pd.testing.assert_frame_equal(
    df_top_products.reset_index(drop=True),
    pd.DataFrame({"product_name": {0: "Laneige Lip Sleeping Mask",
        1: "The Ordinary Hyaluronic Acid 2% + B5",
        2: "Laneige Lip Glowy Balm",
        3: "Chanel COCO MADEMOISELLE Eau de Parfum"},
        "num_review": {0: 12715, 1: 2274, 2: 2766, 3: 724}})
)

---

### 📌 Load data

The second part of today's lecture is all about **you**. 👻 Literally.

▶️ Run the code cell below to create a new `DataFrame` named `df_you`.

In [None]:
df_you = pd.read_csv("https://github.com/bdi475/datasets/raw/main/about-you.csv")

# Used to keep a clean copy
df_you_backup = df_you.copy()

# head() displays the first 5 rows of a DataFrame
df_you.head()

☝️ **Hold on.** Didn't we always create `DataFrame`s using `pd.DataFrame()`?

Yes. But we can *import* existing data as a Pandas `DataFrame` using `pd.read_csv()`. There are many other similar import methods. For now, we'll mostly use `pd.read_csv()`.

The table below explains each column in `df_you`.

| Column Name             | Description                                               |
|-------------------------|-----------------------------------------------------------|
| name                    | First name                                                |
| major1                  | Major                                                     |
| major2                  | Second major OR minor (blank if no second major or minor) |
| city                    | City the person is from                                   |
| country                 | Country the person is from                                   |
| fav_restaurant          | Favorite restaurant (blank if no restaurant was given)    |
| fav_movie               | Favorite movie (blank if no movie was given)              |
| has_iphone              | Whether the person use an iPhone                          |

---

### 📌 Concise summary of a `DataFrame`

👉 A common first step in working with a `DataFrame` is to use the `info()` method. `info()` prints a concise summary of a `DataFrame`.
- Index data type
- Column information: for each column, the following information is displayed:
    - Number of non-missing values
    - Data type of the column
- Memory usage

▶️ Run `df_you.info()` below to see the `info()` method in action.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

👉 From the result of `df_you.info()`, we can understand a couple of things:

- There are 8 columns.
- 7 out of 8 columns have the `object` data type.
    - In Pandas, a string data type is shown as `object`, not `str`.
        - We will skip the technical discussion for now.
- The second line of the output tells us the number of rows (i.e., observations).
- Some columns contain one or more missing values.
    - Missing values are displayed as `NaN`.
    - To denote a missing value, use NumPy's `np.nan` (more on this later).

---

### 🎯 Challenge 3: Display first/last/random rows

▶️ Run `df_you.head()` to print the first 5 rows of `df_you`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

▶️ Run `df_you.tail(4)` to print the last 4 rows of `df_you`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

▶️ Run `df_you.sample(3)` to print 3 randomly sampled rows from `df_you`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

In [None]:
# Autograder

---

### 📌 Number of rows and columns in a `DataFrame`

👉 How many rows and columns does `df_you` have?

▶️ Run `df_you.shape` below to see the *shape* (number of rows and columns) of the database.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

👉 Can you store the number of rows and columns to variables?

---

- `df_you.shape` returns a `tuple` in `(num_rows, num_cols)` format. 
- What is a `tuple`? 🙀
- A `tuple` is a `list` that cannot be modified once created.

▶️ Run the code cell below to see how a `tuple` is nearly identical to a `list`.

In [None]:
# These two are nearly identical,
# The only difference is that my_tuple cannot be modified once created
my_list = [10, 20]
my_tuple = (10, 20)

print(f"my_list[1]={my_list[1]}")    # prints 20
print(f"my_tuple[1]={my_tuple[1]}")  # also prints 20

---

### 🎯 Challenge 4: Find the number of rows and columns in a `DataFrame`

#### 👇 Tasks

- ✔️ Store the number of rows in `df_you` to a new variable named `num_rows`.
- ✔️ Store the number of columns in `df_you` to a new variable named `num_cols`.
- ✔️ Use `.shape`, not `len()`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

print(num_rows)
print(num_cols)

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(num_rows, len(df_you.index), f"Number of rows should be {len(df_you.index)}")
tc.assertEqual(num_cols, len(df_you.columns), f"Number of columns should be {len(df_you.columns)}")

---

### 📌 Filtering rows

Let's step back and go back to working with a `Series`.

▶️ Create a `Series` named `nums` with the following four integers: `-20`, `-10`, `10`, `20`. 

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

nums

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? Let's first try this **manually**.

▶️ Create a new `Series` named `keep` with the following four boolean values: `False`, `False`, `True`, `True`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Check your work
pd.testing.assert_series_equal(keep,
                              pd.Series([0, 0, 1, 1]).astype(bool))

# Display keep
keep

Let's visualize the two `Series` (`nums` and `keep`) you've created.

![nums-and-keep](https://github.com/bdi475/images/blob/main/nums-and-keep-series.png?raw=true)

▶️ Now, you can use the boolean `Series` to filter another `Series`. Type in `nums[keep]` below and run the cell.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

If you're confused about what just happened, the visualization below may give you a better idea.

![nums-and-keep-filter-result](https://github.com/bdi475/images/blob/main/nums-and-keep-filter-result.png?raw=true)

The syntax for filtering a `Series` is `my_series[keep]` where `keep` is a `Series` of boolean values indicating whether to keep an element or not. `keep` should have the exact same number of elements as `my_series`.

▶️ **Uncomment the code cell below first** and run it to see what happens when your `keep` does not have the same number of elements as `my_series`.

(⛔️ **Heads-up**: The code will throw an error! Once you're done running the cell, comment the lines.)

In [None]:
# keep_incorrect = pd.Series([False, False, True])
# nums[keep_incorrect]

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? The last method we've used was inefficient. Imagine if your `Series` contains million elements. You would need to spend a few months continuously typing `True` and `False`! 🤡

As a data analyst, your goal is to perform tasks *programmatically*.

▶️ Type `keep_by_comparison = nums > 0` in the code cell below to perform a comparison on the `nums` Series.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

keep_by_comparison

Notice how `keep_by_comparison` is idential to the original `keep` Series?

▶️ Use the `keep_by_comparison` to filter positive values in `nums`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

Note that applying a filter returns **a new `Series`** without modifying the original `Series`.

▶️ Run the code below.

In [None]:
print("Negative Values (filtered):")
display(nums[nums < 10])

print("\n\nOriginal Values:")
display(nums)

---

### 🎯 Challenge 5: Filter even numbers

#### 👇 Tasks

- ✔️ Using `all_nums`, filter only even numbers.
    - Store the result to a new variable named `even_nums`.
- ✔️ `all_nums` should remain unaltered after your code.

#### 🚀 Hints

- Use the modulo operator (`%`) to check whether a number is even.
    - `some_num % 2 == 0`

In [None]:
all_nums = pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4])

# YOUR CODE BEGINS

# YOUR CODE ENDS

even_nums

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
pd.testing.assert_series_equal(all_nums, pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4]))
pd.testing.assert_series_equal(even_nums.reset_index(drop=True),
                               pd.Series([2, 4, 8, -2, 4]))

---

### 📌 Filtering a `DataFrame`

👉 I will keep saying this. A `DataFrame` is a combination of one or more columns. Filtering a `DataFrame` is very similar to filtering a `Series`.

▶️ Run the code cell below to create a new `DataFrame` named `df`.

In [None]:
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

df

To only keep rows where the `name` is `'John'`, we can again supply a `Series` of boolean values. Only the first and last row of the `DataFrame` contain `'John'`.

▶️ Create a new `Series` named `is_john` with the following boolean values - `True`, `False`, `False`, `True`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Check your work
tc.assertEqual(is_john.to_list(), pd.Series([1, 0, 0, 1]).astype(bool).to_list())

# Display keep
is_john

▶️ Type `result = df[is_john]` in the code cell below and run it.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

result

Here is a visualization of how `df[john]` works.

![mini-dataframe-filter-rows](https://github.com/bdi475/images/blob/main/filter-mini-dataframe-result.png?raw=true)

---

### 🎯 Challenge 6: Find all positive transactions

#### 👇 Tasks

- ✔️ Given `df`, filter rows with positive `amount` values.
    - Store the result to a new variable named `df_pos`.
    - `df_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

▶️ Run the code cell below to create `df`.

In [None]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_pos

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_pos.reset_index(drop=True),
                              df_check.iloc[[2, 3]].reset_index(drop=True))

---

### 📌 Logical operators in pandas `Series`

👉 There are only three *logical* operators in Pandas you need to remember.

- `&`: Logical **AND**
- `|`: Logical **OR**
- `~`: Logical **NOT**

These operators perform element-wise *logical* operations.

#### 📍 Logical AND

👉 A logical AND operator `&` returns `True` only if both the operands are `True`.

![s1_AND_s2](https://github.com/bdi475/images/blob/main/s1-AND-s2.png?raw=true)

▶️ Perform a logical AND operation (`&`) on `s1` and `s2` and store the result to a new variable named `s1_AND_s2`.

In [None]:
s1 = pd.Series([True, True, False, False])
s2 = pd.Series([True, False, True, False])

# YOUR CODE BEGINS

# YOUR CODE ENDS

# 🧭 Check your work
pd.testing.assert_series_equal(s1_AND_s2, pd.Series([1, 0, 0, 0]).astype(bool))

# Display s1, s2, s1_AND_S2 together as a DataFrame
pd.DataFrame({"s1": s1, "s2": s2, "s1_AND_s2": s1_AND_s2})

#### 📍 Logical OR

👉 A logical OR operator `|` returns `True` if either of the operands is `True`.

![s1_OR_s2](https://github.com/bdi475/images/blob/main/s1-OR-s2.png?raw=true)

▶️ Perform a logical OR operation (`|`) on `s1` and `s2` and store the result to a new variable named `s1_OR_s2`.

In [None]:
s1 = pd.Series([True, True, False, False])
s2 = pd.Series([True, False, True, False])

# YOUR CODE BEGINS

# YOUR CODE ENDS

# 🧭 Check your work
pd.testing.assert_series_equal(s1_OR_s2, pd.Series([1, 1, 1, 0]).astype(bool))

# Display s1, s2, s1_OR_s2 together as a DataFrame
pd.DataFrame({"s1": s1,
              "s2": s2,
              "s1_OR_s2": s1_OR_s2})

#### 📍 Logical NOT

👉 A logical NOT operator `~` reverses each operand.

![NOT_s1](https://github.com/bdi475/images/blob/main/NOT-s1.png?raw=true)

▶️ Perform a logical OR operation (`~`) on `s1` and store the result to a new variable named `NOT_s1`.

In [None]:
s1 = pd.Series([True, True, False, False])

# YOUR CODE BEGINS

# YOUR CODE ENDS

# 🧭 Check your work
pd.testing.assert_series_equal(NOT_s1, pd.Series([0, 0, 1, 1]).astype(bool))

# Display s1 and NOT_s1 together as a DataFrame
pd.DataFrame({"s1": s1,
              "NOT_s1": NOT_s1})

---

### 🎯 Challenge 7: Find John's positive transaction(s)

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is `'John'` **and** the amount is positive.
    - Store the result to a new variable named `df_john_pos`.
    - `df_john_and_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Create another boolean Series `is_positive` using a *greather than* comparison (`df['amount'] > 0`).
- Use a logical AND operator `&` to combine `is_john` and `is_positive`.

▶️ Run the code cell below to create `df`.

In [None]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_john_and_pos

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_john_and_pos.reset_index(drop=True),
                              df_check.iloc[[3]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![is_john_AND_is_positive](https://github.com/bdi475/images/blob/main/is-john-AND-is-positive.png?raw=true)

---

### 🎯 Challenge 8: Find transactions that are made by John OR are positive

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is `"John"` **or** the amount is positive.
    - Store the result to a new variable named `df_john_or_pos`.
    - `df_john_or_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Create another boolean Series `is_positive` using a *greather than* comparison (`df['amount'] > 0`).
- Use a logical OR operator `|` to combine `is_john` and `is_positive`.

▶️ Run the code cell below to create `df`.

In [None]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_john_or_pos

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_john_or_pos.reset_index(drop=True),
                              df_check.iloc[[0, 2, 3]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![is_john_OR_is_positive](https://github.com/bdi475/images/blob/main/is-john-OR-is-positive.png?raw=true)

---

### 🎯 Challenge 9: Find transactions that are NOT made by John

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is NOT `'John'`.
    - Store the result to a new variable named `df_not_john`.
    - `df_not_john` should be a `DataFrame`.
- ✔️ Although you can do this without the NOT operator (`~`), **your goal is to use `~`**.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Use a logical NOT operator `~` to reverse `is_john`.

▶️ Run the code cell below to create `df`.

In [None]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_not_john

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_not_john.reset_index(drop=True),
                              df_check.iloc[[1, 2]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![not_john](https://github.com/bdi475/images/blob/main/not-john.png?raw=true)

---

### 📌 Element-wise comparison in a `Series`

▶️ Run the code cell below to create a new `Series` named `countries`.

In [None]:
countries = pd.Series(["United States", "Oman", "United States",
                       "China", "South Korea", "United States"])

display(countries)

What happens when you perform an equality comparison on strings?

▶️ Compare `countries` with the string `'United States'` using an equality comparison operator (`==`).

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

▶️ Run the code cell below to check the data type of the result.

In [None]:
type(countries == "United States")

The result is **another `Series`** containing boolean (`True`/`False`) values. Pandas performs a string comparison (`my_str == 'United States'`) on **each element**.

Remember, you can also supply more than one condition using the following two operators:

1. logical OR (`|`)
2. logical AND (`&`)

▶️ Run the code cell below to check whether a country is **either** `'Oman'` **or** `'China'`.

In [None]:
(countries == "Oman") | (countries == "China")

In [None]:
countries[(countries == "Oman") | (countries == "China")]

---

### 📌 Another example of filtering a `DataFrame`

▶️ Run the code cell below to create a new `DataFrame` named `df_cities`.

In [None]:
df_cities = pd.DataFrame({"city": ["Lisle", "Dubai", "Niles", "Shanghai", "Seoul", "Chicago"],
 "country": ["United States", "United Arab Emirates", "United States", "China", "South Korea", "United States"],
 "population": [23270, 3331409, 28938, 26320000, 21794000, 8604203]})

df_cities

To only keep rows where the `country` is `'United States'`, we can again supply a `Series` of boolean values.

▶️ Create a new `Series` named `keep` with the following 6 boolean values - `True`, `False`, `True`, `False`, `False`, `True`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Check your work
pd.testing.assert_series_equal(keep.reset_index(drop=True),
                               pd.Series([1, 0, 1, 0, 0, 1]).astype(bool).reset_index(drop=True),
                               check_names=False)

# Display keep
keep

🤠 You know the drill now.

▶️ Type `df_cities[keep]` in the code cell below and run it.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

---

### 🎯 Challenge 10: Cities with population over a million

#### 👇 Tasks

- ✔️ Using `df_cities`, filter rows with a population greater than a million (`1000000`).
    - Store the result to a new variable named `df_large_cities`.
- ✔️ `df_cities` should remain unaltered after your code.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_large_cities

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
pd.testing.assert_frame_equal(df_large_cities.reset_index(drop=True),
                              df_cities.query('population > 1000000').reset_index(drop=True))

---

### 🎯 Challenge 11: People who do not use an iPhone

▶️ Run the code cell below to see the **first** 3 rows of `df_you`.

In [None]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.head(3)

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person does not use an iPhone.
    - Store the result to a new variable named `df_no_iphone`.
- ✔️ `df_you` should remain unaltered after your code.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_no_iphone

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
pd.testing.assert_frame_equal(df_no_iphone.reset_index(drop=True),
                              df_you[~df_you["has_iphone"]].reset_index(drop=True))

---

### 🎯 Challenge 12: People who are Economics majors or like Shawarma Joint

▶️ Run the code cell below to see the **last** 5 rows of `df_you`.

In [None]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.tail()

#### 👇 Tasks

- ✔️ Using `df_you`, filter rows that matches the following criteria:
    - The person's `major1` is `"Economics"`, **OR**
    - The person's `fav_restaurant` is `"Shawarma Joint"`
- ✔️ Store the filtered `DataFrame` to a new variable named `df_econ_or_shawarma`.
- ✔️ `df_you` should remain unaltered after your code.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_econ_or_shawarma

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_you_backup[(df_you_backup["major1"] == "Economics") | (df_you_backup["fav_restaurant"] == "Shawarma Joint")]

pd.testing.assert_frame_equal(df_econ_or_shawarma.sort_values(df_econ_or_chipotle.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

## 👉 Sorting a `DataFrame`

You can sort a `DataFrame` using `df.sort_values()`.

![sort_values usage](https://github.com/bdi475/images/blob/main/pandas/sort-values-01.png?raw=true)

▶️ Run the code cell below to **sort** `df_cities` by `population`.

In [None]:
df_cities.sort_values('population')

▶️ Run the code cell below to **sort** `df_cities` by `population` in descending order.

In [None]:
df_cities.sort_values('population', ascending=False)

---

## Exercises using the Yeezys dataset

### 📌 Load data

▶️ Run the code cell below to create a new `DataFrame` named `df_sneakers`.

In [None]:
df_sneakers = pd.read_csv("https://github.com/bdi475/datasets/raw/main/yeezy_sneakers.csv")

# Used to keep a clean copy
df_sneakers_backup = df_sneakers.copy()

# head() displays the first 5 rows of a DataFrame
df_sneakers.head()

The table below describes the columns in `df_sneakers`.

| Column Name             | Description           |
|-------------------------|-----------------------|
| brand                   | Brand of the sneaker  |
| product                 | Name of the sneaker   |
| price                   | Price of the sneaker  |

---

### 📌 Concise summary of a `DataFrame`

▶️ Print out a summary of the `df_sneakers` using the `info()` method.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

---

### 📌 Number of rows and columns in a `DataFrame`

👉 How many rows and columns does `df_sneakers` have?

▶️ Run `df_sneakers.shape` below to see the *shape* of the database.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

---

### 🎯 Challenge 13: Find the number of rows and columns in a `DataFrame`

#### 👇 Tasks

- ✔️ Store the number of rows in `df_sneakers` to a new variable named `num_rows`.
- ✔️ Store the number of columns in `df_sneakers` to a new variable named `num_cols`.
- ✔️ Use `.shape`, not `len()`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

print(num_rows)
print(num_cols)

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(num_rows, len(df_sneakers.index), f"Number of rows should be {len(df_sneakers.index)}")
tc.assertEqual(num_cols, len(df_sneakers.columns), f"Number of columns should be {len(df_sneakers.columns)}")

---

### 🎯 Challenge 14: Find `Adidas` sneakers

#### 👇 Tasks

- ✔️ Find Adidas sneakers and store the filtered result to `df_adidas`.
- ✔️ `df_sneakers` should remain unaltered.

#### 🔑 Expected Output

|    | brand   | product                                |   price |
|---:|:--------|:---------------------------------------|--------:|
|  0 | Adidas  | Yeezy 750 Boost Light Brown            |    1578 |
|  1 | Adidas  | Yeezy 350 Boost Pirate Black           |     910 |
|  2 | Adidas  | Yeezy Boost 350 V2 Lundmark Reflective |    1009 |
|  3 | Adidas  | Yeezy 350 Boost V2 Black/Red           |     954 |
| 10 | Adidas  | Yeezy Boost 350 V2 Black Reflective    |    1437 |
| 11 | Adidas  | Yeezy Boost 350 V2 Antlia Reflective   |     912 |
| 12 | Adidas  | Yeezy Boost 350 V2 Synth Reflective    |    1292 |
| 13 | Adidas  | Yeezy 350 Boost Turtledove             |    1279 |
| 14 | Adidas  | Yeezy 750 Boost Glow in the Dark       |     917 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_adidas

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_sneakers_copy = df_sneakers_backup.copy()

pd.testing.assert_frame_equal(
    df_sneakers_copy \
        .query("brand == 'Adidas'") \
        .sort_values(df_sneakers_copy.columns.to_list()) \
        .reset_index(drop=True),
    df_adidas.reset_index(drop=True) \
        .sort_values(df_adidas.columns.to_list()) \
        .reset_index(drop=True)
)

---

### 🎯 Challenge 15: Find Sneakers under \\$1,000

#### 👇 Tasks

- ✔️ Find sneakers under \\$1,000 and store the filtered result to `df_under_1000`.
- ✔️ `df_sneakers` should remain unaltered.

#### 🔑 Expected Output

|    | brand   | product                              |   price |
|---:|:--------|:-------------------------------------|--------:|
|  1 | Adidas  | Yeezy 350 Boost Pirate Black         |     910 |
|  3 | Adidas  | Yeezy 350 Boost V2 Black/Red         |     954 |
| 11 | Adidas  | Yeezy Boost 350 V2 Antlia Reflective |     912 |
| 14 | Adidas  | Yeezy 750 Boost Glow in the Dark     |     917 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_under_1000

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_sneakers_copy = df_sneakers_backup.copy()

pd.testing.assert_frame_equal(
    df_sneakers_copy \
        .query("price < 1000") \
        .sort_values(df_sneakers_copy.columns.to_list()) \
        .reset_index(drop=True),
    df_under_1000.reset_index(drop=True) \
        .sort_values(df_under_1000.columns.to_list()) \
        .reset_index(drop=True)
)

---

### 🎯 Challenge 16: Find `Nike` Sneakers over \\$3,000

#### 👇 Tasks

- ✔️ Find Nike sneakers over \$3,000 and store the filtered result to `df_nike_over_3000`.
- ✔️ `df_sneakers` should remain unaltered.

#### 🔑 Expected Output

|    | brand   | product                   |   price |
|---:|:--------|:--------------------------|--------:|
|  4 | Nike    | Air Yeezy Blink           |    3142 |
|  5 | Nike    | Air Yeezy 2 Red October   |    6075 |
|  6 | Nike    | Air Yeezy 2 Solar Red     |    4239 |
|  7 | Nike    | Air Yeezy 2 Pure Platinum |    3448 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_nike_over_3000

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_sneakers_copy = df_sneakers_backup.copy()

pd.testing.assert_frame_equal(
    df_sneakers_copy \
        .query("(brand == 'Nike') & (price > 3000)") \
        .sort_values(df_sneakers_copy.columns.to_list()) \
        .reset_index(drop=True),
    df_nike_over_3000.reset_index(drop=True) \
        .sort_values(df_nike_over_3000.columns.to_list()) \
        .reset_index(drop=True)
)

---

### 🎯 Challenge 17: Sort sneakers by price in descending order

#### 👇 Tasks

- ✔️ Sort sneakers by price in descending order and store the result to `df_sorted_by_price_desc`.
- ✔️ `df_sneakers` should remain unaltered.

#### 🔑 Expected Output

|    | brand   | product                                |   price |
|---:|:--------|:---------------------------------------|--------:|
|  5 | Nike    | Air Yeezy 2 Red October                |    6075 |
|  6 | Nike    | Air Yeezy 2 Solar Red                  |    4239 |
|  7 | Nike    | Air Yeezy 2 Pure Platinum              |    3448 |
|  4 | Nike    | Air Yeezy Blink                        |    3142 |
|  9 | Nike    | Air Yeezy Zen Grey                     |    2139 |
|  8 | Nike    | Air Yeezy Net                          |    1888 |
|  0 | Adidas  | Yeezy 750 Boost Light Brown            |    1578 |
| 10 | Adidas  | Yeezy Boost 350 V2 Black Reflective    |    1437 |
| 12 | Adidas  | Yeezy Boost 350 V2 Synth Reflective    |    1292 |
| 13 | Adidas  | Yeezy 350 Boost Turtledove             |    1279 |
|  2 | Adidas  | Yeezy Boost 350 V2 Lundmark Reflective |    1009 |
|  3 | Adidas  | Yeezy 350 Boost V2 Black/Red           |     954 |
| 14 | Adidas  | Yeezy 750 Boost Glow in the Dark       |     917 |
| 11 | Adidas  | Yeezy Boost 350 V2 Antlia Reflective   |     912 |
|  1 | Adidas  | Yeezy 350 Boost Pirate Black           |     910 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_sorted_by_price_desc

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_sneakers_copy = df_sneakers_backup.copy()

pd.testing.assert_frame_equal(
    df_sorted_by_price_desc \
        .reset_index(drop=True),
    df_sneakers_copy.sort_values("price").iloc[::-1] \
        .reset_index(drop=True)
)

---

### 🎯 Challenge 18: Sneakers `> 6000` or `< 1000`

#### 👇 Tasks

- ✔️ Find sneakers that are over \\$6,000 **or** under \\$1,000.
- ✔️ Store the result to a new DataFrame named `df_polar`.
- ✔️ `df_sneakers` should remain unaltered.

#### 🔑 Expected Output

|    | brand   | product                              |   price |
|---:|:--------|:-------------------------------------|--------:|
|  1 | Adidas  | Yeezy 350 Boost Pirate Black         |     910 |
|  3 | Adidas  | Yeezy 350 Boost V2 Black/Red         |     954 |
|  5 | Nike    | Air Yeezy 2 Red October              |    6075 |
| 11 | Adidas  | Yeezy Boost 350 V2 Antlia Reflective |     912 |
| 14 | Adidas  | Yeezy 750 Boost Glow in the Dark     |     917 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_polar

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_sneakers_copy = df_sneakers_backup.copy()

pd.testing.assert_frame_equal(
    df_polar.reset_index(drop=True) \
        .sort_values(df_polar.columns.to_list()) \
        .reset_index(drop=True),
    df_sneakers_copy \
        .query("(price > 6000) | (price < 1000)") \
        .sort_values(df_sneakers_copy.columns.to_list()) \
        .reset_index(drop=True)
)

---

### 🎯 Challenge 19: Sort by brand ascending and price descending

#### 👇 Tasks

- ✔️ Sort `df_sneakers` by (1) brand ascending and (2) price descending within each brand.
- ✔️ Store the result to a new DataFrame named `df_sorted_by_brand_price`.
- ✔️ `df_sneakers` should remain unaltered.

#### 🔑 Expected Output

|    | brand   | product                                |   price |
|---:|:--------|:---------------------------------------|--------:|
|  0 | Adidas  | Yeezy 750 Boost Light Brown            |    1578 |
| 10 | Adidas  | Yeezy Boost 350 V2 Black Reflective    |    1437 |
| 12 | Adidas  | Yeezy Boost 350 V2 Synth Reflective    |    1292 |
| 13 | Adidas  | Yeezy 350 Boost Turtledove             |    1279 |
|  2 | Adidas  | Yeezy Boost 350 V2 Lundmark Reflective |    1009 |
|  3 | Adidas  | Yeezy 350 Boost V2 Black/Red           |     954 |
| 14 | Adidas  | Yeezy 750 Boost Glow in the Dark       |     917 |
| 11 | Adidas  | Yeezy Boost 350 V2 Antlia Reflective   |     912 |
|  1 | Adidas  | Yeezy 350 Boost Pirate Black           |     910 |
|  5 | Nike    | Air Yeezy 2 Red October                |    6075 |
|  6 | Nike    | Air Yeezy 2 Solar Red                  |    4239 |
|  7 | Nike    | Air Yeezy 2 Pure Platinum              |    3448 |
|  4 | Nike    | Air Yeezy Blink                        |    3142 |
|  9 | Nike    | Air Yeezy Zen Grey                     |    2139 |
|  8 | Nike    | Air Yeezy Net                          |    1888 |

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_sorted_by_brand_price

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_sneakers_copy = df_sneakers_backup.copy()

pd.testing.assert_frame_equal(
    df_sorted_by_brand_price \
        .reset_index(drop=True),
    df_sneakers_copy.sort_values(["brand", "price"], ascending=[False, True]).iloc[::-1] \
        .reset_index(drop=True)
)