# Pandas Series, DataFrame, CSV, Filtering

▶️ First, run the code cell below to import `unittest`, a module used for **🧭 Check Your Work** sections and the autograder.

In [1]:
import unittest
tc = unittest.TestCase()

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.

In [2]:
### BEGIN SOLUTION
import pandas as pd
import numpy as np
### END SOLUTION

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [3]:
import sys
tc.assertTrue("pd" in globals(), "Check whether you have correctly import Pandas with an alias.")
tc.assertTrue("np" in globals(), "Check whether you have correctly import NumPy with an alias.")

---

### 🎯 Challenge 1: Create a Pandas Series

#### 👇 Tasks

- ✔️ Create a new Pandas `Series` named `my_series` with the following three values: `10`, `20`, `30`.

#### 🚀 Hint

The code below creates a new Pandas `Series` with the values `1` and `2`.

```python
my_new_series = pd.Series([1, 2])
```

In [4]:
### BEGIN SOLUTION
my_series = pd.Series([10, 20, 30])
### END SOLUTION

print(my_series)

0    10
1    20
2    30
dtype: int64


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [5]:
pd.testing.assert_series_equal(my_series, pd.Series(x * 10 for x in range(1, 4)))

---

### 🎯 Challenge 2: Create a Pandas DataFrame

#### 👇 Tasks

- ✔️ You are given two lists - `product_names` and `num_reviews` that contain the names of make-up products and the number of reviews on Sephora.com.
- ✔️ Using the two lists, create a new Pandas `DataFrame` named `df_top_products` that has the following two columns:
    1. `product_name`: Names of the products
    2. `num_review`: Number of reviews
- ✔️ Note that the column names are singular.

#### 🚀 Hint

The code below creates a new Pandas `DataFrame` from two series.

```python
my_new_dataframe = pd.DataFrame({
    "column_one": my_series1,
    "column_two": my_series2
})
```

In [6]:
product_names = pd.Series([
    "Laneige Lip Sleeping Mask",
    "The Ordinary Hyaluronic Acid 2% + B5",
    "Laneige Lip Glowy Balm",
    "Chanel COCO MADEMOISELLE Eau de Parfum"
])

num_reviews = pd.Series([
    12715,
    2274,
    2766,
    724
])

### BEGIN SOLUTION
df_top_products = pd.DataFrame({
    "product_name": product_names,
    "num_review": num_reviews
})
### END SOLUTION

display(df_top_products)

Unnamed: 0,product_name,num_review
0,Laneige Lip Sleeping Mask,12715
1,The Ordinary Hyaluronic Acid 2% + B5,2274
2,Laneige Lip Glowy Balm,2766
3,Chanel COCO MADEMOISELLE Eau de Parfum,724


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix any incorrect parts.

In [7]:
pd.testing.assert_frame_equal(
    df_top_products.reset_index(drop=True),
    pd.DataFrame({"product_name": {0: "Laneige Lip Sleeping Mask",
        1: "The Ordinary Hyaluronic Acid 2% + B5",
        2: "Laneige Lip Glowy Balm",
        3: "Chanel COCO MADEMOISELLE Eau de Parfum"},
        "num_review": {0: 12715, 1: 2274, 2: 2766, 3: 724}})
)

---

### 📌 Concise summary of a `DataFrame`

👉 A common first step in working with a `DataFrame` is to use the `info()` method. `info()` prints a concise summary of a `DataFrame`.
- Index data type
- Column information: for each column, the following information is displayed:
    - Number of non-missing values
    - Data type of the column
- Memory usage

▶️ Run `df_top_products.info()` below to see the `info()` method in action.

In [8]:
### BEGIN SOLUTION
df_top_products.info()
### END SOLUTION

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   product_name  4 non-null      object
 1   num_review    4 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes


👉 From the result of `df_top_products.info()`, we can understand a couple of things:

- There are 2 columns.
- `product_name` column has an `object` data type.
    - In Pandas, a string data type is shown as `object`, not `str`.
        - We will skip the technical discussion for now.
- The second line of the output tells us that there are 4 entries (rows).

---

### 🎯 Challenge 3: Display first/last/random rows

▶️ Run `df_top_products.head(2)` to print the first 2 rows of `df_top_products`.

In [9]:
### BEGIN SOLUTION
df_top_products.head(2)
### END SOLUTION

Unnamed: 0,product_name,num_review
0,Laneige Lip Sleeping Mask,12715
1,The Ordinary Hyaluronic Acid 2% + B5,2274


▶️ Run `df_top_products.tail(2)` to print the last 2 rows of `df_top_products`.

In [10]:
### BEGIN SOLUTION
df_top_products.tail(4)
### END SOLUTION

Unnamed: 0,product_name,num_review
0,Laneige Lip Sleeping Mask,12715
1,The Ordinary Hyaluronic Acid 2% + B5,2274
2,Laneige Lip Glowy Balm,2766
3,Chanel COCO MADEMOISELLE Eau de Parfum,724


▶️ Run `df_top_products.sample(2)` to randomly sample 2 rows from `df_top_products`.

In [11]:
### BEGIN SOLUTION
df_top_products.sample(2)
### END SOLUTION

Unnamed: 0,product_name,num_review
1,The Ordinary Hyaluronic Acid 2% + B5,2274
2,Laneige Lip Glowy Balm,2766


---

### 📌 Number of rows and columns in a `DataFrame`

👉 How many rows and columns does `df_top_products` have?

▶️ Run `df_top_products.shape` below to see the *shape* (number of rows and columns) of the database.

In [12]:
### BEGIN SOLUTION
df_top_products.shape
### END SOLUTION

(4, 2)

👉 Can you store the number of rows and columns to variables?

---

- `df_you.shape` returns a `tuple` in `(num_rows, num_cols)` format. 
- What is a `tuple`? 🙀
- A `tuple` is a `list` that cannot be modified once created.

▶️ Run the code cell below to see how a `tuple` is nearly identical to a `list`.

In [13]:
# These two are nearly identical,
# The only difference is that my_tuple cannot be modified once created
my_list = [10, 20]
my_tuple = (10, 20)

print(f"my_list[1]={my_list[1]}")    # prints 20
print(f"my_tuple[1]={my_tuple[1]}")  # also prints 20

my_list[1]=20
my_tuple[1]=20


---

### 🎯 Challenge 4: Find the number of rows and columns in a `DataFrame`

#### 👇 Tasks

- ✔️ Store the number of rows in `df_top_products` to a new variable named `num_rows`.
- ✔️ Store the number of columns in `df_top_products` to a new variable named `num_cols`.
- ✔️ Use `.shape`, not `len()`.

In [14]:
### BEGIN SOLUTION
num_rows = df_top_products.shape[0]
num_cols = df_top_products.shape[1]
### END SOLUTION

print(num_rows)
print(num_cols)

4
2


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [15]:
tc.assertEqual(num_rows, len(df_top_products.index), f"Number of rows should be {len(df_top_products.index)}")
tc.assertEqual(num_cols, len(df_top_products.columns), f"Number of columns should be {len(df_top_products.columns)}")

---

### 📌 Filtering rows

Let's step back and go back to working with a `Series`.

▶️ Create a `Series` named `nums` with the following four integers: `-20`, `-10`, `10`, `20`. 

In [16]:
### BEGIN SOLUTION
nums = pd.Series([-20, -10, 10, 20])
### END SOLUTION

nums

0   -20
1   -10
2    10
3    20
dtype: int64

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? Let's first try this **manually**.

▶️ Create a new `Series` named `keep` with the following four boolean values: `False`, `False`, `True`, `True`.

In [17]:
### BEGIN SOLUTION
keep = pd.Series([False, False, True, True])
### END SOLUTION

# Check your work
pd.testing.assert_series_equal(keep,
                              pd.Series([0, 0, 1, 1]).astype(bool))

# Display keep
keep

0    False
1    False
2     True
3     True
dtype: bool

Let's visualize the two `Series` (`nums` and `keep`) you've created.

![nums-and-keep](https://github.com/bdi475/images/blob/main/nums-and-keep-series.png?raw=true)

▶️ Now, you can use the boolean `Series` to filter another `Series`. Type in `nums[keep]` below and run the cell.

In [18]:
### BEGIN SOLUTION
nums[keep]
### END SOLUTION

2    10
3    20
dtype: int64

If you're confused about what just happened, the visualization below may give you a better idea.

![nums-and-keep-filter-result](https://github.com/bdi475/images/blob/main/nums-and-keep-filter-result.png?raw=true)

The syntax for filtering a `Series` is `my_series[keep]` where `keep` is a `Series` of boolean values indicating whether to keep an element or not. `keep` should have the exact same number of elements as `my_series`.

▶️ **Uncomment the code cell below first** and run it to see what happens when your `keep` does not have the same number of elements as `my_series`.

(⛔️ **Heads-up**: The code will throw an error! Once you're done running the cell, comment the lines.)

In [19]:
# keep_incorrect = pd.Series([False, False, True])
# nums[keep_incorrect]

👉 Is there a way *filter* the `Series` so that it only contains **positive** values? The last method we've used was inefficient. Imagine if your `Series` contains million elements. You would need to spend a few months continuously typing `True` and `False`! 🤡

As a data analyst, your goal is to perform tasks *programmatically*.

▶️ Type `keep_by_comparison = nums > 0` in the code cell below to perform a comparison on the `nums` Series.

In [20]:
### BEGIN SOLUTION
keep_by_comparison = nums > 0
### END SOLUTION

keep_by_comparison

0    False
1    False
2     True
3     True
dtype: bool

Notice how `keep_by_comparison` is idential to the original `keep` Series?

▶️ Use the `keep_by_comparison` to filter positive values in `nums`.

In [21]:
### BEGIN SOLUTION
nums[keep_by_comparison]
### END SOLUTION

2    10
3    20
dtype: int64

Note that applying a filter returns **a new `Series`** without modifying the original `Series`.

▶️ Run the code below.

In [22]:
print("Negative Values (filtered):")
display(nums[nums < 10])

print("\n\nOriginal Values:")
display(nums)

Negative Values (filtered):


0   -20
1   -10
dtype: int64



Original Values:


0   -20
1   -10
2    10
3    20
dtype: int64

---

### 🎯 Challenge 5: Filter even numbers

#### 👇 Tasks

- ✔️ Using `all_nums`, filter only even numbers.
    - Store the result to a new variable named `even_nums`.
- ✔️ `all_nums` should remain unaltered after your code.

#### 🚀 Hints

- Use the modulo operator (`%`) to check whether a number is even.
    - `some_num % 2 == 0`

In [23]:
all_nums = pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4])

### BEGIN SOLUTION
even_nums = all_nums[all_nums % 2 == 0]
### END SOLUTION

even_nums

0    2
2    4
3    8
4   -2
8    4
dtype: int64

#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [24]:
pd.testing.assert_series_equal(all_nums, pd.Series([2, 5, 4, 8, -2, -5, -11, 13, 4]))
pd.testing.assert_series_equal(even_nums.reset_index(drop=True),
                               pd.Series([2, 4, 8, -2, 4]))

---

### 📌 Filtering a `DataFrame`

👉 I will keep saying this. A `DataFrame` is a combination of one or more columns. Filtering a `DataFrame` is very similar to filtering a `Series`.

▶️ Run the code cell below to create a new `DataFrame` named `df`.

In [25]:
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


To only keep rows where the `name` is `'John'`, we can again supply a `Series` of boolean values. Only the first and last row of the `DataFrame` contain `'John'`.

▶️ Create a new `Series` named `is_john` with the following boolean values - `True`, `False`, `False`, `True`.

In [26]:
### BEGIN SOLUTION
is_john = pd.Series([True, False, False, True])
### END SOLUTION

# Check your work
tc.assertEqual(is_john.to_list(), pd.Series([1, 0, 0, 1]).astype(bool).to_list())

# Display keep
is_john

0     True
1    False
2    False
3     True
dtype: bool

▶️ Type `result = df[is_john]` in the code cell below and run it.

In [27]:
### BEGIN SOLUTION
result = df[is_john]
### END SOLUTION

result

Unnamed: 0,name,amount
0,John,-20
3,John,20


Here is a visualization of how `df[john]` works.

![mini-dataframe-filter-rows](https://github.com/bdi475/images/blob/main/filter-mini-dataframe-result.png?raw=true)

---

### 🎯 Challenge 6: Find all positive transactions

#### 👇 Tasks

- ✔️ Given `df`, filter rows with positive `amount` values.
    - Store the result to a new variable named `df_pos`.
    - `df_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

▶️ Run the code cell below to create `df`.

In [28]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [29]:
### BEGIN SOLUTION
df_pos = df[df["amount"] > 0]
### END SOLUTION

df_pos

Unnamed: 0,name,amount
2,Tom,10
3,John,20


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [30]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_pos.reset_index(drop=True),
                              df_check.iloc[[2, 3]].reset_index(drop=True))

---

### 📌 Logical operators in pandas `Series`

👉 There are only three *logical* operators in Pandas you need to remember.

- `&`: Logical **AND**
- `|`: Logical **OR**
- `~`: Logical **NOT**

These operators perform element-wise *logical* operations.

#### 📍 Logical AND

👉 A logical AND operator `&` returns `True` only if both the operands are `True`.

![s1_AND_s2](https://github.com/bdi475/images/blob/main/s1-AND-s2.png?raw=true)

▶️ Perform a logical AND operation (`&`) on `s1` and `s2` and store the result to a new variable named `s1_AND_s2`.

In [31]:
s1 = pd.Series([True, True, False, False])
s2 = pd.Series([True, False, True, False])

### BEGIN SOLUTION
s1_AND_s2 = s1 & s2
### END SOLUTION

# 🧭 Check your work
pd.testing.assert_series_equal(s1_AND_s2, pd.Series([1, 0, 0, 0]).astype(bool))

# Display s1, s2, s1_AND_S2 together as a DataFrame
pd.DataFrame({"s1": s1, "s2": s2, "s1_AND_s2": s1_AND_s2})

Unnamed: 0,s1,s2,s1_AND_s2
0,True,True,True
1,True,False,False
2,False,True,False
3,False,False,False


#### 📍 Logical OR

👉 A logical OR operator `|` returns `True` if either of the operands is `True`.

![s1_OR_s2](https://github.com/bdi475/images/blob/main/s1-OR-s2.png?raw=true)

▶️ Perform a logical OR operation (`|`) on `s1` and `s2` and store the result to a new variable named `s1_OR_s2`.

In [32]:
s1 = pd.Series([True, True, False, False])
s2 = pd.Series([True, False, True, False])

### BEGIN SOLUTION
s1_OR_s2 = s1 | s2
### END SOLUTION

# 🧭 Check your work
pd.testing.assert_series_equal(s1_OR_s2, pd.Series([1, 1, 1, 0]).astype(bool))

# Display s1, s2, s1_OR_s2 together as a DataFrame
pd.DataFrame({"s1": s1,
              "s2": s2,
              "s1_OR_s2": s1_OR_s2})

Unnamed: 0,s1,s2,s1_OR_s2
0,True,True,True
1,True,False,True
2,False,True,True
3,False,False,False


#### 📍 Logical NOT

👉 A logical NOT operator `~` reverses each operand.

![NOT_s1](https://github.com/bdi475/images/blob/main/NOT-s1.png?raw=true)

▶️ Perform a logical OR operation (`~`) on `s1` and store the result to a new variable named `NOT_s1`.

In [33]:
s1 = pd.Series([True, True, False, False])

### BEGIN SOLUTION
NOT_s1 = ~s1
### END SOLUTION

# 🧭 Check your work
pd.testing.assert_series_equal(NOT_s1, pd.Series([0, 0, 1, 1]).astype(bool))

# Display s1 and NOT_s1 together as a DataFrame
pd.DataFrame({"s1": s1,
              "NOT_s1": NOT_s1})

Unnamed: 0,s1,NOT_s1
0,True,False
1,True,False
2,False,True
3,False,True


---

### 🎯 Challenge 7: Find John's positive transaction(s)

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is `'John'` **and** the amount is positive.
    - Store the result to a new variable named `df_john_pos`.
    - `df_john_and_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Create another boolean Series `is_positive` using a *greather than* comparison (`df['amount'] > 0`).
- Use a logical AND operator `&` to combine `is_john` and `is_positive`.

▶️ Run the code cell below to create `df`.

In [34]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [35]:
### BEGIN SOLUTION
is_john = df["name"] == "John"
is_positive = df["amount"] > 0

df_john_and_pos = df[is_john & is_positive]
### END SOLUTION

df_john_and_pos

Unnamed: 0,name,amount
3,John,20


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [36]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_john_and_pos.reset_index(drop=True),
                              df_check.iloc[[3]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![is_john_AND_is_positive](https://github.com/bdi475/images/blob/main/is-john-AND-is-positive.png?raw=true)

---

### 🎯 Challenge 8: Find transactions that are made by John OR are positive

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is `"John"` **or** the amount is positive.
    - Store the result to a new variable named `df_john_or_pos`.
    - `df_john_or_pos` should be a `DataFrame`.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Create another boolean Series `is_positive` using a *greather than* comparison (`df['amount'] > 0`).
- Use a logical OR operator `|` to combine `is_john` and `is_positive`.

▶️ Run the code cell below to create `df`.

In [37]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [38]:
### BEGIN SOLUTION
is_john = df["name"] == "John"
is_positive = df["amount"] > 0

df_john_or_pos = df[is_john | is_positive]
### END SOLUTION

df_john_or_pos

Unnamed: 0,name,amount
0,John,-20
2,Tom,10
3,John,20


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [39]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_john_or_pos.reset_index(drop=True),
                              df_check.iloc[[0, 2, 3]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![is_john_OR_is_positive](https://github.com/bdi475/images/blob/main/is-john-OR-is-positive.png?raw=true)

---

### 🎯 Challenge 9: Find transactions that are NOT made by John

#### 👇 Tasks

- ✔️ Given `df`, find rows where the name is NOT `'John'`.
    - Store the result to a new variable named `df_not_john`.
    - `df_not_john` should be a `DataFrame`.
- ✔️ Although you can do this without the NOT operator (`~`), **your goal is to use `~`**.
- ✔️ `df` should remain unaltered after running your code.

#### 🚀 Hints

- Create a boolean Series `is_john` using an equality comparison (`df['name'] == 'John'`).
- Use a logical NOT operator `~` to reverse `is_john`.

▶️ Run the code cell below to create `df`.

In [40]:
# DO NOT CHANGE THE CODE IN THIS CELL
df = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})
df

Unnamed: 0,name,amount
0,John,-20
1,Mary,-10
2,Tom,10
3,John,20


In [41]:
### BEGIN SOLUTION
is_john = df["name"] == "John"

df_not_john = df[~is_john]
### END SOLUTION

df_not_john

Unnamed: 0,name,amount
1,Mary,-10
2,Tom,10


#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [42]:
df_check = pd.DataFrame({"name": ["John", "Mary", "Tom", "John"], "amount": [-20, -10, 10, 20]})

pd.testing.assert_frame_equal(df, df_check)
pd.testing.assert_frame_equal(df_not_john.reset_index(drop=True),
                              df_check.iloc[[1, 2]].reset_index(drop=True))

#### ⚜️ A diagram to help your understanding
![not_john](https://github.com/bdi475/images/blob/main/not-john.png?raw=true)

---

### 📌 Element-wise comparison in a `Series`

▶️ Run the code cell below to create a new `Series` named `countries`.

In [43]:
countries = pd.Series(["United States", "Oman", "United States",
                       "China", "South Korea", "United States"])

display(countries)

0    United States
1             Oman
2    United States
3            China
4      South Korea
5    United States
dtype: object

What happens when you perform an equality comparison on strings?

▶️ Compare `countries` with the string `'United States'` using an equality comparison operator (`==`).

In [44]:
### BEGIN SOLUTION
countries == "United States"
### END SOLUTION

0     True
1    False
2     True
3    False
4    False
5     True
dtype: bool

▶️ Run the code cell below to check the data type of the result.

In [45]:
type(countries == "United States")

pandas.core.series.Series

The result is **another `Series`** containing boolean (`True`/`False`) values. Pandas performs a string comparison (`my_str == 'United States'`) on **each element**.

Remember, you can also supply more than one condition using the following two operators:

1. logical OR (`|`)
2. logical AND (`&`)

▶️ Run the code cell below to check whether a country is **either** `'Oman'` **or** `'China'`.

In [46]:
(countries == "Oman") | (countries == "China")

0    False
1     True
2    False
3     True
4    False
5    False
dtype: bool

In [47]:
countries[(countries == "Oman") | (countries == "China")]

1     Oman
3    China
dtype: object

---

### 📌 Another example of filtering a `DataFrame`

▶️ Run the code cell below to create a new `DataFrame` named `df_cities`.

In [48]:
df_cities = pd.DataFrame({"city": ["Lisle", "Dubai", "Niles", "Shanghai", "Seoul", "Chicago"],
 "country": ["United States", "United Arab Emirates", "United States", "China", "South Korea", "United States"],
 "population": [23270, 3331409, 28938, 26320000, 21794000, 8604203]})

df_cities

Unnamed: 0,city,country,population
0,Lisle,United States,23270
1,Dubai,United Arab Emirates,3331409
2,Niles,United States,28938
3,Shanghai,China,26320000
4,Seoul,South Korea,21794000
5,Chicago,United States,8604203


To only keep rows where the `country` is `'United States'`, we can again supply a `Series` of boolean values.

▶️ Create a new `Series` named `keep` with the following 6 boolean values - `True`, `False`, `True`, `False`, `False`, `True`.

In [49]:
### BEGIN SOLUTION
keep = pd.Series([True, False, True, False, False, True])
# OR
keep = df_cities["country"] == "United States"
### END SOLUTION

# Check your work
pd.testing.assert_series_equal(keep.reset_index(drop=True),
                               pd.Series([1, 0, 1, 0, 0, 1]).astype(bool).reset_index(drop=True),
                               check_names=False)

# Display keep
keep

0     True
1    False
2     True
3    False
4    False
5     True
Name: country, dtype: bool

🤠 You know the drill now.

▶️ Type `df_cities[keep]` in the code cell below and run it.

In [50]:
### BEGIN SOLUTION
df_cities[keep]
### END SOLUTION

Unnamed: 0,city,country,population
0,Lisle,United States,23270
2,Niles,United States,28938
5,Chicago,United States,8604203


---

### 🎯 Challenge 10: Cities with population over a million

#### 👇 Tasks

- ✔️ Using `df_cities`, filter rows with a population greater than a million (`1000000`).
    - Store the result to a new variable named `df_large_cities`.
- ✔️ `df_cities` should remain unaltered after your code.

In [51]:
### BEGIN SOLUTION
df_large_cities = df_cities[df_cities['population'] > 1000000]
### END SOLUTION

df_large_cities

Unnamed: 0,city,country,population
1,Dubai,United Arab Emirates,3331409
3,Shanghai,China,26320000
4,Seoul,South Korea,21794000
5,Chicago,United States,8604203


#### 🧭 Check Your Work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [52]:
pd.testing.assert_frame_equal(df_large_cities.reset_index(drop=True),
                              df_cities.query('population > 1000000').reset_index(drop=True))