# Lecture 11 - Pandas Filtering

Tuesday 2021/09/28

## Lecture Notes and in-class exercises

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [1]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [2]:
# YOUR CODE BEGINS
import pandas as pd
import numpy as np
# YOUR CODE ENDS

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [3]:
import sys
tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")

---

### 🎯 Mini-exercise: Create a Pandas Series

#### 👇 Tasks

- ✔️ Create a new Pandas `Series` named `my_series` with the following three values: `10`, `20`, `30`.

#### 🚀 Hint

The code below creates a new Pandas `Series` with the values `1` and `2`.

```python
my_new_series = pd.Series([1, 2])
```

In [4]:
# YOUR CODE BEGINS
my_series = pd.Series([10, 20, 30])
# YOUR CODE ENDS

print(my_series)

0    10
1    20
2    30
dtype: int64


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [5]:
pd.testing.assert_series_equal(my_series, pd.Series(x * 10 for x in range(1, 4)))

---

### 🎯 Mini-exercise: Create a Pandas DataFrame

#### 👇 Tasks

- ✔️ You are given two lists - `product_names` and `num_reviews` that contain the names of make-up products and the number of reviews on Sephora.com.
- ✔️ Using the two lists, create a new Pandas `DataFrame` named `df_top_products` that has the following two columns:
    1. `product_name`: Names of the products
    2. `num_review`: Number of reviews
- ✔️ Note that the column names are singular.

#### 🚀 Hint

The code below creates a new Pandas `DataFrame` from two series.

```python
my_new_dataframe = pd.DataFrame({
    "column_one": my_series1,
    "column_two": my_series2
})
```

In [6]:
product_names = [
    "Laneige Lip Sleeping Mask",
    "The Ordinary Hyaluronic Acid 2% + B5",
    "Laneige Lip Glowy Balm",
    "Chanel COCO MADEMOISELLE Eau de Parfum"
]

num_reviews = [
    12715,
    2274,
    2766,
    724
]

# YOUR CODE BEGINS
df_top_products = pd.DataFrame({
    "product_name": product_names,
    "num_review": num_reviews
})
# YOUR CODE ENDS

display(df_top_products)

Unnamed: 0,product_name,num_review
0,Laneige Lip Sleeping Mask,12715
1,The Ordinary Hyaluronic Acid 2% + B5,2274
2,Laneige Lip Glowy Balm,2766
3,Chanel COCO MADEMOISELLE Eau de Parfum,724


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [7]:
pd.testing.assert_frame_equal(
    df_top_products.reset_index(drop=True),
    pd.DataFrame({"product_name": {0: "Laneige Lip Sleeping Mask",
        1: "The Ordinary Hyaluronic Acid 2% + B5",
        2: "Laneige Lip Glowy Balm",
        3: "Chanel COCO MADEMOISELLE Eau de Parfum"},
        "num_review": {0: 12715, 1: 2274, 2: 2766, 3: 724}})
)

---

### 📌 Load data

The second part of today's lecture is all about **you**. 👻 Literally.

▶️ Run the code cell below to create a new `DataFrame` named `df_you`.

In [8]:
df_you = pd.read_csv("https://raw.githubusercontent.com/bdi475/datasets/main/about-you.csv")

# Used to keep a clean copy
df_you_backup = df_you.copy()

# head() displays the first 5 rows of a DataFrame
df_you.head()

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
0,Khalid,Computer Science,,Fairfax,713.45,Chick-fil-a,Ford vs Ferrari,False
1,Ana Maria,Nondegree,,Bucharest,5313.0,,,True
2,Aishani,Finance,,Naperville,115.46,Noodles and Company,The Mummy,True
3,Joshua,Advertising,,Chicago,125.86,Soho House,The Shawshank Redemption,False
4,Adrian,Finance,Information Systems,Grayslake,180.82,Choong Man Chicken,Inception,True


☝️ **Hold on.** Didn't we always create `DataFrame`s using `pd.DataFrame()`?

Yes. But we can *import* existing data as a Pandas `DataFrame` using `pd.read_csv()`. There are many other similar import methods. For now, we'll mostly use `pd.read_csv()`.

The table below explains each column in `df_you`.

| Column Name             | Description                                               |
|-------------------------|-----------------------------------------------------------|
| name                    | First name                                                |
| major1                  | Major                                                     |
| major2                  | Second major OR minor (blank if no second major or minor) |
| city                    | City the person is from                                   |
| distance_from_champaign | Straight distance from the city to Champaign in miles     |
| fav_restaurant          | Favorite restaurant (blank if no restaurant was given)    |
| fav_movie               | Favorite movie (blank if no movie was given)              |
| has_iphone              | Whether the person use an iPhone                          |

---

### 📌 Concise summary of a `DataFrame`

👉 A common first step in working with a `DataFrame` is to use the `info()` method. `info()` prints a concise summary of a `DataFrame`.
- Index data type
- Column information: for each column, the following information is displayed:
    - Number of non-missing values
    - Data type of the column
- Memory usage

▶️ Run `df_you.info()` below to see the `info()` method in action.

In [9]:
# YOUR CODE BEGINS
df_you.info()
# YOUR CODE ENDS

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     21 non-null     object 
 1   major1                   21 non-null     object 
 2   major2                   8 non-null      object 
 3   city                     19 non-null     object 
 4   distance_from_champaign  18 non-null     float64
 5   fav_restaurant           18 non-null     object 
 6   fav_movie                15 non-null     object 
 7   has_iphone               21 non-null     bool   
dtypes: bool(1), float64(1), object(6)
memory usage: 1.3+ KB


👉 From the result of `df_you.info()`, we can understand a couple of things:

- There are 8 columns.
- Five columns have the `object` data type.
    - In Pandas, a string data type is shown as `object`, not `str`.
        - We will skip the technical discussion for now.
- The second line of the output tells us that there are 21 entries.
- Some columns have 21 non-null values - these columns do not contain any missing value.
- Some columns have less than 21 non-null values - these columns contain one or more missing values.
    - Missing values are displayed as `NaN`.
    - To denote a missing value, use NumPy's `np.nan` (more on this later).

---

### 🎯 Mini-exercise: Display first/last/random rows

▶️ Run `df_you.head()` to print the first 5 rows of `df_you`.

In [10]:
# YOUR CODE BEGINS
df_you.head()
# YOUR CODE ENDS

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
0,Khalid,Computer Science,,Fairfax,713.45,Chick-fil-a,Ford vs Ferrari,False
1,Ana Maria,Nondegree,,Bucharest,5313.0,,,True
2,Aishani,Finance,,Naperville,115.46,Noodles and Company,The Mummy,True
3,Joshua,Advertising,,Chicago,125.86,Soho House,The Shawshank Redemption,False
4,Adrian,Finance,Information Systems,Grayslake,180.82,Choong Man Chicken,Inception,True


▶️ Run `df_you.tail(4)` to print the first 4 rows of `df_you`.

In [11]:
# YOUR CODE BEGINS
df_you.tail(4)
# YOUR CODE ENDS

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
17,Luke,Finance,,Urbana,2.0,Pizzaria Antica,,True
18,Sean S,Advertising,,Taipei,7539.0,Dragon Gate,Let the Bullets Fly,True
19,Sarvani,Psychology,,Grayslake,180.82,Big Bowl,The Trial of Chicago 7,True
20,Leo,Statistics,Computer Science,Wuxi,7158.0,Shiquan,Interstellar,True


▶️ Run `df_you.sample(3)` to print 3 randomly sampled rows from `df_you`.

In [12]:
# YOUR CODE BEGINS
df_you.sample(3)
# YOUR CODE ENDS

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
3,Joshua,Advertising,,Chicago,125.86,Soho House,The Shawshank Redemption,False
17,Luke,Finance,,Urbana,2.0,Pizzaria Antica,,True
5,Anika,Information Sciences,,Barrington,,California Pizza Kitchen,Goldfinch,True


---

### 📌 Number of rows and columns in a `DataFrame`

👉 How many rows and columns does `df_you` have?

▶️ Run `df_you.shape` below to see the *shape* (number of rows and columns) of the database.

In [13]:
# YOUR CODE BEGINS
df_you.shape
# YOUR CODE ENDS

(21, 8)

👉 Can you store the number of rows and columns to variables?

---

- `df_you.shape` returns a `tuple` in `(num_rows, num_cols)` format. 
- What is a `tuple`? 🙀
- A `tuple` is a `list` that cannot be modified once created.

▶️ Run the code cell below to see how a `tuple` is nearly identical to a `list`.

In [14]:
# These two are nearly identical,
# The only difference is that my_tuple cannot be modified once created
my_list = [10, 20]
my_tuple = (10, 20)

print(f"my_list[1]={my_list[1]}")    # prints 20
print(f"my_tuple[1]={my_tuple[1]}")  # also prints 20

my_list[1]=20
my_tuple[1]=20


---

### 🎯 Mini-exercise: Find of number of rows and columns in a `DataFrame`

#### 👇 Tasks

- ✔️ Store the number of rows in `df_you` to a new variable named `num_rows`.
- ✔️ Store the number of columns in `df_you` to a new variable named `num_cols`.
- ✔️ Use `.shape`, not `len()`.

In [15]:
# YOUR CODE BEGINS
num_rows = df_you.shape[0]
num_cols = df_you.shape[1]
# YOUR CODE ENDS

print(num_rows)
print(num_cols)

21
8


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [16]:
tc.assertEqual(num_rows, len(df_you.index), f"Number of rows should be {len(df_you.index)}")
tc.assertEqual(num_cols, len(df_you.columns), f"Number of columns should be {len(df_you.columns)}")

---

### 📌 Filtering rows

Let's step back and go back to working with a `Series`.

▶️ Create a `Series` named `nums` with the following four integers: `-20`, `-10`, `10`, `20`. 

In [17]:
# YOUR CODE BEGINS
nums = pd.Series([-20, -10, 10, 20])
# YOUR CODE ENDS

nums

0   -20
1   -10
2    10
3    20
dtype: int64

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? Let's first try this **manually**.

▶️ Create a new `Series` named `keep` with the following four boolean values: `False`, `False`, `True`, `True`.

In [18]:
# YOUR CODE BEGINS
keep = pd.Series([False, False, True, True])
# YOUR CODE ENDS

# Check your work
pd.testing.assert_series_equal(keep,
                              pd.Series([0, 0, 1, 1]).astype(bool))

# Display keep
keep

0    False
1    False
2     True
3     True
dtype: bool

Let's visualize the two `Series` (`nums` and `keep`) you've created.

![nums-and-keep](https://github.com/bdi475/images/blob/main/nums-and-keep-series.png?raw=true)

▶️ Now, you can use the boolean `Series` to filter another `Series`. Type in `nums[keep]` below and run the cell.

In [19]:
# YOUR CODE BEGINS
nums[keep]
# YOUR CODE ENDS

2    10
3    20
dtype: int64

If you're confused about what just happened, the visualization below may give you a better idea.

![nums-and-keep-filter-result](https://github.com/bdi475/images/blob/main/nums-and-keep-filter-result.png?raw=true)

The syntax for filtering a `Series` is `my_series[keep]` where `keep` is a `Series` of boolean values indicating whether to keep an element or not. `keep` should have the exact same number of elements as `my_series`.

▶️ **Uncomment the code cell below first** and run it to see what happens when your `keep` does not have the same number of elements as `my_series`.

(⛔️ **Heads-up**: The code will throw an error! Once you're done running the cell, comment the lines.)

In [20]:
# keep_incorrect = pd.Series([False, False, True])
# nums[keep_incorrect]

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? The last method we've used was inefficient. Imagine if your `Series` contains million elements. You would need to spend a few months continuously typing `True` and `False`! 🤡

As a data analyst, your goal is to perform tasks *programmatically*.

▶️ Type `keep_by_comparison = nums > 0` in the code cell below to perform a comparison on the `nums` Series.

In [21]:
# YOUR CODE BEGINS
keep_by_comparison = nums > 0
# YOUR CODE ENDS

keep_by_comparison

0    False
1    False
2     True
3     True
dtype: bool

Notice how `keep_by_comparison` is idential to the original `keep` Series?

▶️ Use the `keep_by_comparison` to filter positive values in `nums`.

In [22]:
# YOUR CODE BEGINS
nums[keep_by_comparison]
# YOUR CODE ENDS

2    10
3    20
dtype: int64

Note that applying a filter returns **a new `Series`** without modifying the original `Series`.

▶️ Run the code below.

In [23]:
print("Negative Values (filtered):")
display(nums[nums < 10])

print("\n\nOriginal Values:")
display(nums)

Negative Values (filtered):


0   -20
1   -10
dtype: int64



Original Values:


0   -20
1   -10
2    10
3    20
dtype: int64

---

### 🎯 Mini-exercise: Filter even numbers

#### 👇 Tasks

- ✔️ Using `all_nums`, filter only even numbers.
    - Store the result to a new variable named `even_nums`.
- ✔️ `all_nums` should remain unaltered after your code.

#### 🚀 Hints

- Use the modulo operator (`%`) to check whether a number is even.
    - `some_num % 2 == 0`

In [24]:
all_nums = pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4])

# YOUR CODE BEGINS
even_nums = all_nums[all_nums % 2 == 0]
# YOUR CODE ENDS

even_nums

0    2
2    4
3    8
4   -2
8    4
dtype: int64

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [25]:
pd.testing.assert_series_equal(all_nums, pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4]))
pd.testing.assert_series_equal(even_nums.reset_index(drop=True),
                               pd.Series([2, 4, 8, -2, 4]))

---

### 📌 Filtering a `DataFrame`

👉 I will keep saying this. A `DataFrame` is a combination of one or more columns. Filtering a `DataFrame` is very similar to filtering a `Series`.

▶️ Run the code cell below to create a new `DataFrame` named `df`.

In [26]:
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


To only keep rows where the `name` is `'John'`, we can again supply a `Series` of boolean values. Only the first and last row of the `DataFrame` contain `'John'`.

▶️ Create a new `Series` named `is_john` with the following boolean values - `True`, `False`, `False`, `True`.

In [27]:
# YOUR CODE BEGINS
is_john = pd.Series([True, False, False, True])
# YOUR CODE ENDS

# Check your work
tc.assertEqual(is_john.to_list(), pd.Series([1, 0, 0, 1]).astype(bool).to_list())

# Display keep
is_john

0     True
1    False
2    False
3     True
dtype: bool

▶️ Type `result = df[is_john]` in the code cell below and run it.

In [28]:
# YOUR CODE BEGINS
result = df[is_john]
# YOUR CODE ENDS

result

Unnamed: 0,name,amount
0,John,-20
3,John,20


Here is a visualization of how `df[john]` works.

![mini-dataframe-filter-rows](https://github.com/bdi475/images/blob/main/filter-mini-dataframe-result.png?raw=true)

---

### 🎯 Mini-exercise: Find all positive transactions

#### 👇 Tasks

- ✔️ Given `df`, filter rows with positive `amount` values.
    - Store the result to a new variable named `df_pos`.
    - `df_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

▶️ Run the code cell below to create `df`.

In [29]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [30]:
# YOUR CODE BEGINS
df_pos = df[df["amount"] > 0]
# YOUR CODE ENDS

df_pos

Unnamed: 0,name,amount
2,Tom,10
3,John,20


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [31]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_pos.reset_index(drop=True),
                              df_check.iloc[[2, 3]].reset_index(drop=True))

---

### 📌 Logical operators in pandas `Series`

👉 There are only three *logical* operators in Pandas you need to remember.

- `&`: Logical **AND**
- `|`: Logical **OR**
- `~`: Logical **NOT**

These operators perform element-wise *logical* operations.

#### 📍 Logical AND

👉 A logical AND operator `&` returns `True` only if both the operands are `True`.

![s1_AND_s2](https://github.com/bdi475/images/blob/main/s1-AND-s2.png?raw=true)

▶️ Perform a logical AND operation (`&`) on `s1` and `s2` and store the result to a new variable named `s1_AND_s2`.

In [32]:
s1 = pd.Series([True, True, False, False])
s2 = pd.Series([True, False, True, False])

# YOUR CODE BEGINS
s1_AND_s2 = s1 & s2
# YOUR CODE ENDS

# 🧭 Check your work
pd.testing.assert_series_equal(s1_AND_s2, pd.Series([1, 0, 0, 0]).astype(bool))

# Display s1, s2, s1_AND_S2 together as a DataFrame
pd.DataFrame({"s1": s1, "s2": s2, "s1_AND_s2": s1_AND_s2})

Unnamed: 0,s1,s2,s1_AND_s2
0,True,True,True
1,True,False,False
2,False,True,False
3,False,False,False


#### 📍 Logical OR

👉 A logical OR operator `|` returns `True` if either of the operands is `True`.

![s1_OR_s2](https://github.com/bdi475/images/blob/main/s1-OR-s2.png?raw=true)

▶️ Perform a logical OR operation (`|`) on `s1` and `s2` and store the result to a new variable named `s1_OR_s2`.

In [33]:
s1 = pd.Series([True, True, False, False])
s2 = pd.Series([True, False, True, False])

# YOUR CODE BEGINS
s1_OR_s2 = s1 | s2
# YOUR CODE ENDS

# 🧭 Check your work
pd.testing.assert_series_equal(s1_OR_s2, pd.Series([1, 1, 1, 0]).astype(bool))

# Display s1, s2, s1_OR_s2 together as a DataFrame
pd.DataFrame({"s1": s1,
              "s2": s2,
              "s1_OR_s2": s1_OR_s2})

Unnamed: 0,s1,s2,s1_OR_s2
0,True,True,True
1,True,False,True
2,False,True,True
3,False,False,False


#### 📍 Logical NOT

👉 A logical NOT operator `~` reverses each operand.

![NOT_s1](https://github.com/bdi475/images/blob/main/NOT-s1.png?raw=true)

▶️ Perform a logical OR operation (`~`) on `s1` and store the result to a new variable named `NOT_s1`.

In [34]:
s1 = pd.Series([True, True, False, False])

# YOUR CODE BEGINS
NOT_s1 = ~s1
# YOUR CODE ENDS

# 🧭 Check your work
pd.testing.assert_series_equal(NOT_s1, pd.Series([0, 0, 1, 1]).astype(bool))

# Display s1 and NOT_s1 together as a DataFrame
pd.DataFrame({"s1": s1,
              "NOT_s1": NOT_s1})

Unnamed: 0,s1,NOT_s1
0,True,False
1,True,False
2,False,True
3,False,True


---

### 🎯 Mini-exercise: Find John's positive transaction(s)

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is `'John'` **and** the amount is positive.
    - Store the result to a new variable named `df_john_pos`.
    - `df_john_and_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Create another boolean Series `is_positive` using a *greather than* comparison (`df['amount'] > 0`).
- Use a logical AND operator `&` to combine `is_john` and `is_positive`.

▶️ Run the code cell below to create `df`.

In [35]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [36]:
# YOUR CODE BEGINS
is_john = df["name"] == "John"
is_positive = df["amount"] > 0

df_john_and_pos = df[is_john & is_positive]
# YOUR CODE ENDS

df_john_and_pos

Unnamed: 0,name,amount
3,John,20


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [37]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_john_and_pos.reset_index(drop=True),
                              df_check.iloc[[3]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![is_john_AND_is_positive](https://github.com/bdi475/images/blob/main/is-john-AND-is-positive.png?raw=true)

---

### 🎯 Mini-exercise: Find transactions that are made by John OR are positive

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is `"John"` **or** the amount is positive.
    - Store the result to a new variable named `df_john_or_pos`.
    - `df_john_or_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Create another boolean Series `is_positive` using a *greather than* comparison (`df['amount'] > 0`).
- Use a logical OR operator `|` to combine `is_john` and `is_positive`.

▶️ Run the code cell below to create `df`.

In [38]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [39]:
# YOUR CODE BEGINS
is_john = df["name"] == "John"
is_positive = df["amount"] > 0

df_john_or_pos = df[is_john | is_positive]
# YOUR CODE ENDS

df_john_or_pos

Unnamed: 0,name,amount
0,John,-20
2,Tom,10
3,John,20


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [40]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_john_or_pos.reset_index(drop=True),
                              df_check.iloc[[0, 2, 3]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![is_john_OR_is_positive](https://github.com/bdi475/images/blob/main/is-john-OR-is-positive.png?raw=true)

---

### 🎯 Mini-exercise: Find transactions that are NOT made by John

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is NOT `'John'`.
    - Store the result to a new variable named `df_not_john`.
    - `df_not_john` should be a `DataFrame`.
- ✔️ Although you can do this without the NOT operator (`~`), **your goal is to use `~`**.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Use a logical NOT operator `~` to reverse `is_john`.

▶️ Run the code cell below to create `df`.

In [41]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [42]:
# YOUR CODE BEGINS
is_john = df["name"] == "John"

df_not_john = df[~is_john]
# YOUR CODE ENDS

df_not_john

Unnamed: 0,name,amount
1,Mary,-10
2,Tom,10


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [43]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_not_john.reset_index(drop=True),
                              df_check.iloc[[1, 2]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![not_john](https://github.com/bdi475/images/blob/main/not-john.png?raw=true)

---

### 📌 Element-wise comparison in a `Series`

▶️ Run the code cell below to create a new `Series` named `countries`.

In [44]:
countries = pd.Series(["United States", "Oman", "United States",
                       "China", "South Korea", "United States"])

display(countries)

0    United States
1             Oman
2    United States
3            China
4      South Korea
5    United States
dtype: object

What happens when you perform an equality comparison on strings?

▶️ Compare `countries` with the string `'United States'` using an equality comparison operator (`==`).

In [45]:
# YOUR CODE BEGINS
countries == "United States"
# YOUR CODE ENDS

0     True
1    False
2     True
3    False
4    False
5     True
dtype: bool

▶️ Run the code cell below to check the data type of the result.

In [46]:
type(countries == "United States")

pandas.core.series.Series

The result is **another `Series`** containing boolean (`True`/`False`) values. Pandas performs a string comparison (`my_str == 'United States'`) on **each element**.

Remember, you can also supply more than one condition using the following two operators:

1. logical OR (`|`)
2. logical AND (`&`)

▶️ Run the code cell below to check whether a country is **either** `'Oman'` **or** `'China'`.

In [47]:
(countries == "Oman") | (countries == "China")

0    False
1     True
2    False
3     True
4    False
5    False
dtype: bool

In [48]:
countries[(countries == "Oman") | (countries == "China")]

1     Oman
3    China
dtype: object

---

### 📌 Another example of filtering a `DataFrame`

▶️ Run the code cell below to create a new `DataFrame` named `df_cities`.

In [49]:
df_cities = pd.DataFrame({"city": ["Lisle", "Dubai", "Niles", "Shanghai", "Seoul", "Chicago"],
 "country": ["United States", "United Arab Emirates", "United States", "China", "South Korea", "United States"],
 "population": [23270, 3331409, 28938, 26320000, 21794000, 8604203]})

df_cities

Unnamed: 0,city,country,population
0,Lisle,United States,23270
1,Dubai,United Arab Emirates,3331409
2,Niles,United States,28938
3,Shanghai,China,26320000
4,Seoul,South Korea,21794000
5,Chicago,United States,8604203


To only keep rows where the `country` is `'United States'`, we can again supply a `Series` of boolean values.

▶️ Create a new `Series` named `keep` with the following 6 boolean values - `True`, `False`, `True`, `False`, `False`, `True`.

In [50]:
# YOUR CODE BEGINS
keep = pd.Series([True, False, True, False, False, True])
# OR
keep = df_cities["country"] == "United States"
# YOUR CODE ENDS

# Check your work
pd.testing.assert_series_equal(keep.reset_index(drop=True),
                               pd.Series([1, 0, 1, 0, 0, 1]).astype(bool).reset_index(drop=True),
                               check_names=False)

# Display keep
keep

0     True
1    False
2     True
3    False
4    False
5     True
Name: country, dtype: bool

🤠 You know the drill now.

▶️ Type `df_cities[keep]` in the code cell below and run it.

In [51]:
# YOUR CODE BEGINS
df_cities[keep]
# YOUR CODE ENDS

Unnamed: 0,city,country,population
0,Lisle,United States,23270
2,Niles,United States,28938
5,Chicago,United States,8604203


---

### 🎯 Mini-exercise: Cities with population over a million

#### 👇 Tasks

- ✔️ Using `df_cities`, filter rows with a population greater than a million (`1000000`).
    - Store the result to a new variable named `df_large_cities`.
- ✔️ `df_cities` should remain unaltered after your code.

In [52]:
# YOUR CODE BEGINS
df_large_cities = df_cities[df_cities['population'] > 1000000]
# YOUR CODE ENDS

df_large_cities

Unnamed: 0,city,country,population
1,Dubai,United Arab Emirates,3331409
3,Shanghai,China,26320000
4,Seoul,South Korea,21794000
5,Chicago,United States,8604203


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [53]:
pd.testing.assert_frame_equal(df_large_cities.reset_index(drop=True),
                              df_cities.query('population > 1000000').reset_index(drop=True))

---

### 🎯 Mini-exercise: People who do not use an iPhone

▶️ Run the code cell below to see the **first** 3 rows of `df_you`.

In [54]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.head(3)

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
0,Khalid,Computer Science,,Fairfax,713.45,Chick-fil-a,Ford vs Ferrari,False
1,Ana Maria,Nondegree,,Bucharest,5313.0,,,True
2,Aishani,Finance,,Naperville,115.46,Noodles and Company,The Mummy,True


#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person does not use an iPhone.
    - Store the result to a new variable named `df_no_iphone`.
- ✔️ `df_you` should remain unaltered after your code.

In [55]:
# YOUR CODE BEGINS
df_no_iphone = df_you[df_you["has_iphone"] == False]
# YOUR CODE ENDS

df_no_iphone

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
0,Khalid,Computer Science,,Fairfax,713.45,Chick-fil-a,Ford vs Ferrari,False
3,Joshua,Advertising,,Chicago,125.86,Soho House,The Shawshank Redemption,False
9,Colby,Statistics,Computer Science,Fremont,2169.0,Cravings,TRON: Legacy,False


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [56]:
pd.testing.assert_frame_equal(df_no_iphone.reset_index(drop=True),
                              df_you[~df_you["has_iphone"]].reset_index(drop=True))

---

### 🎯 Mini-exercise: People who are from outside the States

▶️ Run the code cell below to see the **last** 2 rows of `df_you`.

In [57]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.tail(2)

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
19,Sarvani,Psychology,,Grayslake,180.82,Big Bowl,The Trial of Chicago 7,True
20,Leo,Statistics,Computer Science,Wuxi,7158.0,Shiquan,Interstellar,True


#### 👇 Tasks

- ✔️ Using `df_you`, filter rows where the person is from a city that is at least 3000 miles away.
    - Use the `distance_from_champaign` column.
    - Store the result to a new variable named `df_far`.
- ✔️ `df_you` should remain unaltered after your code.

In [58]:
# YOUR CODE BEGINS
df_far = df_you[df_you["distance_from_champaign"] >= 3000]
# YOUR CODE ENDS

df_far

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
1,Ana Maria,Nondegree,,Bucharest,5313.0,,,True
8,Talal,Information Sciences,,Dubai,7345.0,Kabobi Persian,Monsters Inc.,True
10,Dylan,Mathematics,Chemistry,London,4051.0,Fast Food,,True
18,Sean S,Advertising,,Taipei,7539.0,Dragon Gate,Let the Bullets Fly,True
20,Leo,Statistics,Computer Science,Wuxi,7158.0,Shiquan,Interstellar,True


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [59]:
df_check = df_you_backup[df_you_backup["distance_from_champaign"] >= 3000]

pd.testing.assert_frame_equal(df_far.sort_values(df_far.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: People who are `Economics` majors or likes `Chipotle`

▶️ Run the code cell below to see the **last** 5 rows of `df_you`.

In [60]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.tail()

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
16,Sean W,Mathematics,,,,,,True
17,Luke,Finance,,Urbana,2.0,Pizzaria Antica,,True
18,Sean S,Advertising,,Taipei,7539.0,Dragon Gate,Let the Bullets Fly,True
19,Sarvani,Psychology,,Grayslake,180.82,Big Bowl,The Trial of Chicago 7,True
20,Leo,Statistics,Computer Science,Wuxi,7158.0,Shiquan,Interstellar,True


#### 👇 Tasks

- ✔️ Using `df_you`, filter rows that matches the following criteria:
    - The person's `major1` is `Economics`, **OR**
    - The person's `fav_restaurant` is `Chipotle`
- ✔️ Store the filtered `DataFrame` to a new variable named `df_econ_or_chipotle`.
- ✔️ `df_you` should remain unaltered after your code.

In [61]:
# YOUR CODE BEGINS
df_econ_or_chipotle = df_you[(df_you["major1"] == "Economics") | (df_you["fav_restaurant"] == "Chipotle")]
# YOUR CODE ENDS

df_econ_or_chipotle

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
7,Adam,Economics,Statistics,Rockford,184.22,Seven Saints,John Wick,True
12,Michael,Information Sciences,Business,Northbrook,158.81,Chipotle,Rounders,True


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [62]:
df_check = df_you_backup[(df_you_backup["major1"] == "Economics") | (df_you_backup["fav_restaurant"] == "Chipotle")]

pd.testing.assert_frame_equal(df_econ_or_chipotle.sort_values(df_far.columns.tolist()).reset_index(drop=True),
                              df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True))

---

### 🎯 Mini-exercise: Do your own thang

▶️ Run the code cell below to **randomly select** 5 rows from `df_you`.

In [63]:
# Restore clean df_you
df_you = df_you_backup.copy()

df_you.sample(5)

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
15,Kyle,Econometrics,Quantative Economics,Highland Park,144.55,Tsu Kasa,Wolf of Wall Street,True
8,Talal,Information Sciences,,Dubai,7345.0,Kabobi Persian,Monsters Inc.,True
0,Khalid,Computer Science,,Fairfax,713.45,Chick-fil-a,Ford vs Ferrari,False
14,Jaqueline,Mathematics,,,,,,True
16,Sean W,Mathematics,,,,,,True


#### 👇 Tasks

- ✔️ Using `df_you`, create a filter that may interest you.

In [64]:
# YOUR CODE BEGINS



# YOUR CODE ENDS

---

### 📌 Sorting a `DataFrame`

▶️ Run the code cell below to **sort** `df_you` by `distance_from_champaign`.

In [65]:
df_you.sort_values("distance_from_champaign")

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
17,Luke,Finance,,Urbana,2.0,Pizzaria Antica,,True
2,Aishani,Finance,,Naperville,115.46,Noodles and Company,The Mummy,True
3,Joshua,Advertising,,Chicago,125.86,Soho House,The Shawshank Redemption,False
6,Mia,Computer Science,,Chicago,125.86,Olive Garden,Sabrina,True
11,Rhea,Psychology,Business,Chicago,125.86,RPM Italian,,True
15,Kyle,Econometrics,Quantative Economics,Highland Park,144.55,Tsu Kasa,Wolf of Wall Street,True
12,Michael,Information Sciences,Business,Northbrook,158.81,Chipotle,Rounders,True
13,Eli,Systems Engineering,,Lincolnshire,164.07,Maize,Wall-E,True
4,Adrian,Finance,Information Systems,Grayslake,180.82,Choong Man Chicken,Inception,True
19,Sarvani,Psychology,,Grayslake,180.82,Big Bowl,The Trial of Chicago 7,True


▶️ Run the code cell below to **sort** `df_you` by `major1` and then by `major2` for people with same `major1` values.

In [66]:
df_you.sort_values(["major1", "major2"])

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
3,Joshua,Advertising,,Chicago,125.86,Soho House,The Shawshank Redemption,False
18,Sean S,Advertising,,Taipei,7539.0,Dragon Gate,Let the Bullets Fly,True
0,Khalid,Computer Science,,Fairfax,713.45,Chick-fil-a,Ford vs Ferrari,False
6,Mia,Computer Science,,Chicago,125.86,Olive Garden,Sabrina,True
15,Kyle,Econometrics,Quantative Economics,Highland Park,144.55,Tsu Kasa,Wolf of Wall Street,True
7,Adam,Economics,Statistics,Rockford,184.22,Seven Saints,John Wick,True
4,Adrian,Finance,Information Systems,Grayslake,180.82,Choong Man Chicken,Inception,True
2,Aishani,Finance,,Naperville,115.46,Noodles and Company,The Mummy,True
17,Luke,Finance,,Urbana,2.0,Pizzaria Antica,,True
12,Michael,Information Sciences,Business,Northbrook,158.81,Chipotle,Rounders,True


▶️ Run the code cell below to **sort** `df_you` by `distance_from_champaign` in descending order.

In [67]:
df_you.sort_values("distance_from_champaign", ascending=False)

Unnamed: 0,name,major1,major2,city,distance_from_champaign,fav_restaurant,fav_movie,has_iphone
18,Sean S,Advertising,,Taipei,7539.0,Dragon Gate,Let the Bullets Fly,True
8,Talal,Information Sciences,,Dubai,7345.0,Kabobi Persian,Monsters Inc.,True
20,Leo,Statistics,Computer Science,Wuxi,7158.0,Shiquan,Interstellar,True
1,Ana Maria,Nondegree,,Bucharest,5313.0,,,True
10,Dylan,Mathematics,Chemistry,London,4051.0,Fast Food,,True
9,Colby,Statistics,Computer Science,Fremont,2169.0,Cravings,TRON: Legacy,False
0,Khalid,Computer Science,,Fairfax,713.45,Chick-fil-a,Ford vs Ferrari,False
7,Adam,Economics,Statistics,Rockford,184.22,Seven Saints,John Wick,True
4,Adrian,Finance,Information Systems,Grayslake,180.82,Choong Man Chicken,Inception,True
19,Sarvani,Psychology,,Grayslake,180.82,Big Bowl,The Trial of Chicago 7,True
