## Part 4: Filtering — Using Conditionals to Filter Rows and Columns

### 1. Boolean masking

We start by creating a new small DataFrame with an extra person who shares a last name (`Jones`) so we can demonstrate filtering:

In [None]:
import pandas as pd

people = {
    "first": ["Alice", "Bob", "Carol", "Sarah"],
    "last": ["Smith", "Jones", "Lee", "Jones"],
    "email": [
        "alice@example.com",
        "bob@example.com",
        "carol@example.com",
        "sarah@example.com",
    ],
    "uid": ["AS100293", "BJ240806", "CL150510", "SJ251203"],
    "year": [1993, 2006, 2010, 2003],
    "age": [32, 19, 15, 22],
}
df_small = pd.DataFrame(people)

#### Basic filter: single condition

In [None]:
# Create a boolean Series where last name is "Jones"
mask_jones = (df_small["last"] == "Jones")
print(mask_jones)  # True/False per row

# Apply the mask to get only people with last name Jones
print("People with last name Jones:")
df_small[mask_jones]

In [None]:
# You can also do it inline:
df_small[df_small["last"] == "Jones"]

# Or with `.loc`:
df_small.loc[df_small["last"] == "Jones"]

#### Combining conditions: AND / OR

In [None]:
# AND: last name Jones and first name Bob
both_conditions = df_small[(df_small["last"] == "Jones") & (df_small["first"] == "Bob")]
print("Last name Jones AND first name Bob:")
both_conditions

In [None]:
# OR: last name Jones OR year before 2010
or_condition = df_small[(df_small["last"] == "Jones") | (df_small["year"] == 2010)]
print("Last name Jones OR born before 2005:")
or_condition

> **Important:** Use `&` for logical AND and `|` for OR, and always wrap each comparison in parentheses. Don’t use the Python `and`/`or` here, they don’t work elementwise on Series.

#### Negation

In [None]:
# Opposite of last name Jones
not_jones = df_small[~(df_small["last"] == "Jones")]
print("People whose last name is NOT Jones:")
not_jones

### 2. Filtering the big Stack Overflow survey DataFrame

Assume `df` is the main survey DataFrame already loaded (and, ideally, with `ResponseId` as index). We'll work with compensation and other fields.

#### High salary filter

In [None]:
df = pd.read_csv("data/survey_results_public.csv")

In [None]:
# Filter respondents with total compensation greater than 70,000
high_salary = df["CompTotal"] > 70000

print("High earners (CompTotal > 70000) with selected columns:")
df.loc[high_salary, ["Country", "LanguageHaveWorkedWith", "CompTotal"]]

#### Filtering by a list of countries

In [None]:
countries_of_interest = ["Switzerland", "United States of America", "Germany", "India", "Canada"]

print(f"Respondents from {countries_of_interest}:")
country_mask = df["Country"].isin(countries_of_interest)
df.loc[country_mask, "Country"]

You can combine filters, for example high salary **and** specific countries:

In [None]:
print("High earners in selected countries:")
df.loc[high_salary & country_mask, ["Country", "LanguageHaveWorkedWith", "CompTotal"]]

### 3. String-based filtering

Many columns (e.g., languages worked with) are stored as semicolon-separated strings. You cannot use `==` to check membership; use `.str.contains`. Handle missing values safely.

In [None]:
# First, create a mask for knowing Python; protect against NaNs by using na=False
knows_python = df["LanguageHaveWorkedWith"].str.contains("Python", na=False)

# Show some respondents who know Python
print("Respondents who use Python:")
df.loc[knows_python, "LanguageHaveWorkedWith"]

### Exercises for Part 4

1. **Excercise 1:**
    - Filter all respondents in the 35–54 range using the `"Age"` column. 
    - Report how many respondents are in that age range.  
    - Filter by people in the 35–54 range and from  **Switzerland**.

2. **Excercise 2:**
    - Filter for people that have worked with the Python language but now with the Java language

#### Solutions

In [None]:
# Exercise 1

# Filter respondents whose Age label falls in 35–54 (i.e., "35-44" or "45-54")
age_mask = df["Age"].str.contains("35-44", na=False) | df["Age"].str.contains("45-54", na=False)
# or
age_mask = df["Age"].str.contains("35-44|45-54", na=False)
df.loc[age_mask, "Age"]

In [None]:
# Further restrict to those in Switzerland
df.loc[age_mask & (df["Country"] == "Switzerland"), ["ResponseId", "Age", "Country"]]

In [None]:
# Exercise 2

# Mask: has worked with Python
worked_python = df["LanguageHaveWorkedWith"].str.contains("Python", na=False)

# Mask: has not worked with Java
not_worked_java = ~df["LanguageHaveWorkedWith"].str.contains("Java", na=False)

# Combined filter
df.loc[worked_python & not_worked_java, ['ResponseId', 'LanguageHaveWorkedWith']]

In [None]:
# Both languages present in the semicolon-delimited worked-with field
worked_both = (
    df["LanguageHaveWorkedWith"].str.contains("Python", na=False)
    & df["LanguageHaveWorkedWith"].str.contains("Java", na=False)
)
df.loc[worked_both, ['ResponseId', 'LanguageHaveWorkedWith']]