# QTM 151 - Introduction to Statistical Computing II
## Lecture 10 - Subsetting Data
**Author:** Danilo Freire (danilo.freire@emory.edu, Emory University)

# I hope you all had a very nice weekend! 😊

## Today's lecture

- Today we will dive a little deeper into the `pandas` library
- More specifically, we will learn how to **subset data**
- This is a **very important skill** to have when working with data
- We will look into four ways to subset data
  - Using VS Code's [`Data Wrangler` extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler)
  - Using `[]`
  - Using `.iloc[]`
  - Using `.query()`
- We will also learn how to **sort data** with `.sort_values()`

![](figures/pandas.png)

## Subsetting data with `Data Wrangler`

- You can visualise data in VS Code by using the [`Data Wrangler` extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler)
- It allows you to see the data in a table format and gives you some basic statistics
- You can also filter the data, sort it, fill missing values, and more 
- This new extension is _super convienent_ when you are working with data you are not familiar with
- It also has the advantage of using a **sandboxed environment**, so you can interact with the data without changing the original file (unless you explicitly save them)
- Install the extension by going to the Extensions view (`Ctrl+Shift+X`) and searching for `Data Wrangler`, or click [here](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler)
- [Full documentation](https://code.visualstudio.com/docs/datascience/data-wrangler)

![](figures/data-wrangler01.png)

- The extension is automatically integrated with the Jupyter extension, so you can open a `.csv` file and start wrangling the data
- Or you can open a Jupyter notebook, load your dataset with `pandas`, and start wrangling the data from there

## Subsetting data with `Data Wrangler`

- Let's see a quick example!
- The quickest way to open the `Data Wrangler` is by right-clicking on a `.csv` file and selecting `Open in Data Wrangler`
- Let's visualise the `features.csv` dataset. It is located in the `data_raw` folder
- The dataset is about cars and has the following variables:
  - `mpg`: Miles per gallon - Fuel efficiency of the car
  - `cylinders`: Number of cylinders 
  - `displacement`: Volume of all the cylinders in the engine 
  - `horsepower`: Power of the engine in horsepower
  - `weight`: Weight in pounds 
  - `acceleration`: Acceleration from 0 to 60 mph
  - `vehicle_id`: Unique identifier for each car

![](figures/data-wrangler02.png)

## Subsetting data with `Data Wrangler`

![](figures/data-wrangler03.png)

## Subsetting data with `Data Wrangler`

![](figures/data-wrangler04.png)

## Subsetting data with `Data Wrangler`

![](figures/data-wrangler05.png)

# Any questions so far? 🤔

## Subsetting data with `Data Wrangler`

- When you click on `Filter`, you will see a menu like this:

![](figures/data-wrangler06.png)

- Select the column you want to filter, and click on `Add Filter`
- Then you can select the condition you want to filter by and click on `Apply`

![](figures/data-wrangler07.png)

- You can also sort the data by clicking on `Sort` and selecting the column you want to sort by

![](figures/data-wrangler08.png)

- `Data Wrangler` will also show the Python code that corresponds to the operations you are doing!
- This is a great way to learn how to use `pandas`! 🐼
- You can then `Export to notebook` and continue working on your data in a Jupyter notebook, `Export as file`, or `Copy all code` and paste it in your Python script

## Let's practice!

- Please open the `features.csv` dataset in the `Data Wrangler`
- We will filter the data to show only cars with 6 or more `cylinders`
- And sort the data by `mpg` in descending order
- Finally, we will export the code to a Jupyter notebook
- Let's do it together!

- [Download the dataset](https://github.com/danilofreire/qtm151/blob/main/lectures/lecture-11/data_raw/features.csv) if you don't have it (click on `Download Raw File`)
- Open the dataset in the `Data Wrangler` and do the following:
  - Click on `Filter`
  - Select `cylinders` in `Choose column`
  - Select `Greater than or equal to` in `condition`
  - Type `6` in `Value`
  - Click on `Apply`

![](figures/data-wrangler09.png)
![](figures/data-wrangler10.png)

## Try it yourself! 🚗 {#sec:exercise-01}

*(This exercise is intended to be done using the Data Wrangler extension in VS Code, as described in the slides. The goal is to observe the generated Python code.)*

- Now, let's add two more conditions to the filter in Data Wrangler:
- We want to find the fastest, less fuel-efficient cars with 6 or more cylinders.
- Add a new condition to the filter:
  - Select `acceleration` in `Choose column`
  - Select `Less than or equal to` in `condition`
  - Type `15` in `Value`
- Then, add the second condition:
  - Select `mpg` in `Choose column`
  - Select `Less than or equal to` in `condition`
  - Type `12` in `Value`
  - Click on `Apply`
  
- **How many cars are left?** (Observe in Data Wrangler)

## Subsetting data with `pandas`
### The `[]` operator

- We saw how `pandas` can be used to subset data in the `Data Wrangler`
- Now, let's see how we can do the same in a Jupyter notebook or directly in a Python script
- Let's start with the `[]` operator
- This is the most common way to subset data in `pandas`, and we can use it to select columns by name, or to select rows by index (though `.loc[]` and `.iloc[]` are preferred for explicit row selection).
- To select columns by name, we type `df['column']` or `df[['col1', 'col2']]`
- But what if we don't know the name of the column? We can use the `.columns` attribute to see all the columns in the dataset

- Let's load the dataset and see the columns
- We can do so by typing the name of the dataset and use a period (`.`) to extract the attribute "columns"
- If you want to add many columns, it is often a good idea to store the column names in a separate variable, as we will do below

In [None]:
import pandas as pd
from IPython.display import display # For richer display in Jupyter

# Ensure 'data_raw/features.csv' is in the correct path
try:
    carfeatures = pd.read_csv('data_raw/features.csv')
    car_colnames = carfeatures.columns
    print(car_colnames)
except FileNotFoundError:
    print("File 'data_raw/features.csv' not found. Please check the path.")
    carfeatures = pd.DataFrame() # Create an empty DataFrame to avoid further errors

## Subsetting data with `pandas`
### The `[]` operator

- We can either subset the data by typing the column name or the column index (using the list of column names).

- With the index (note the brackets):

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures[car_colnames[0]].head()) # Displaying head for brevity
else:
    print("carfeatures DataFrame not loaded.")

- With the name:

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures['mpg'].head()) # Displaying head for brevity
else:
    print("carfeatures DataFrame not loaded.")

## Selecting multiple columns

- We can select multiple columns by passing a list of column names
- Let's say we want to select the columns `mpg` and `weight`
- We can do so by typing:

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures[['mpg', 'weight']].head())
else:
    print("carfeatures DataFrame not loaded.")

- We can also store the columns in a variable and pass it to the `[]` operator

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    list_subsetcols = ["mpg", "weight"]
    subcols_carfeatures = carfeatures[list_subsetcols]
    display(subcols_carfeatures.head())
else:
    print("carfeatures DataFrame not loaded.")

## Subsetting data with `pandas`

- Note that `pandas` uses **double brackets** when selecting multiple columns
- This is because `pandas` **interprets a single bracket as a series** (a single column)
- So we need two brackets to pass a list of columns
- An explanation for this behavior is that `pandas` is trying to avoid ambiguity
- Using `df['col1', 'col2']` would be interpreted as trying to access a single column with the name `('col1', 'col2')` 
- Yes, it's a bizarre name, but it could happen if you, say, merged two columns and named it like that (which is not a good idea!)
- More on this [here](https://www.reddit.com/r/learnpython/comments/yl9htk/pandas_module_double_square_brackets_purpose/) and [here](https://stackoverflow.com/questions/33417991/pandas-why-are-double-brackets-needed-to-select-column-after-boolean-indexing)

## Try it yourself! 🚗 {#sec:exercise-02}

- Extract the `weight` and `acceleration` columns from the `carfeatures` DataFrame.
- Display the head of the resulting subset.

In [None]:
# Your code here
# if 'carfeatures' in locals() and not carfeatures.empty:
    # subset_df = carfeatures[ ['weight', 'acceleration'] ] # Hint!
    # display(subset_df.head())
# else:
#     print("carfeatures DataFrame not loaded.")

## Filtering whole dataframes with `[]`

- We can also use the `[]` operator to filter the whole dataframe, not just the columns
- For example, we can filter the dataframe to show only cars with `mpg` greater than 25

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures[carfeatures['mpg'] >= 25].head())
else:
    print("carfeatures DataFrame not loaded.")

- This will return a new dataframe with only the rows that satisfy the condition
- Note that we are using the `[]` operator twice
- The first `[]` is used to select the rows (based on a boolean Series condition), and the (optional) second `[]` could be used to select specific columns from the filtered rows.

## Filtering whole dataframes with `[]`

- One can make more complex queries by combining multiple conditions using `&` (and) or `|` (or). Remember to use parentheses for each condition.

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures[(carfeatures['acceleration']>=10) & (carfeatures['acceleration']<18)].head())
else:
    print("carfeatures DataFrame not loaded.")

## Subsetting by row/column position 
### The `.iloc[]` (integer location) method

- We use `.iloc[]` to select data by row and column position (integer indices).
- `.iloc[]` can be used to select a single row:
  - `df.iloc[0]` selects the first row.
- Or multiple rows (pass a list of indices):
  - `df.iloc[[0, 1, 2]]` selects the first three rows.
- Multiple rows and columns:
  - `df.iloc[[0, 1, 2], [0, 1, 2]]` selects the first three rows and first three columns.
- The `:` is used to select all rows or columns, or a range of rows or columns (slicing):
  - `df.iloc[:, 0]` selects all rows, first column.
- Or multiple columns by slice:
  - `df.iloc[:, 0:3]` selects all rows, first three columns (columns at index 0, 1, 2).
- A range of rows and columns:
  - `df.iloc[0:3, 0:3]` selects the first three rows and first three columns.
- Remember that Python is zero-indexed and the last index in a slice `[start:stop]` is *not* included.

## Subsetting by row/column position 
### The `.iloc[]` method

- Let's see some examples!
- We can combine `.iloc[]` with `sort_values(by = 'variable')` to sort the data

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    carsorted = carfeatures.sort_values(by = "mpg", ascending = False)
    print("Sorted carfeatures (head):")
    display(carsorted.head())
else:
    print("carfeatures DataFrame not loaded.")
    carsorted = pd.DataFrame() # Define as empty to avoid errors in next cells

- For example, to select the first three rows and the first three columns:

In [None]:
# Select the first three rows 
if 'carsorted' in locals() and not carsorted.empty:
    print("First three rows of sorted data:")
    display(carsorted.iloc[[0, 1, 2]])
else:
    print("carsorted DataFrame not available.")

In [None]:
# Select the first three rows and columns
if 'carsorted' in locals() and not carsorted.empty:
    print("\nFirst three rows and columns of sorted data:")
    display(carsorted.iloc[0:3, 0:3])
else:
    print("carsorted DataFrame not available.")

In [None]:
# Compare with the original dataset
if 'carfeatures' in locals() and not carfeatures.empty:
    print("\nFirst three rows and columns of original data:")
    display(carfeatures.iloc[0:3, 0:3])
else:
    print("carfeatures DataFrame not loaded.")

## The `.iloc[]` method

- The following command extracts all columns for row zero
- In this first example, we will show the car with the highest `mpg` value

In [None]:
# Select the first row, all columns
if 'carsorted' in locals() and not carsorted.empty:
    display(carsorted.iloc[0,:])
else:
    print("carsorted DataFrame not available.")

In [None]:
# Select the first three rows, all columns
if 'carsorted' in locals() and not carsorted.empty:
    display(carsorted.iloc[[0,1,2],:])
else:
    print("carsorted DataFrame not available.")

In [None]:
# The `:` can be omitted when selecting all columns
if 'carsorted' in locals() and not carsorted.empty:
    display(carsorted.iloc[[0,1,2]])
else:
    print("carsorted DataFrame not available.")

## Subset blocks of data

In [None]:
# Extract rows 0 to 5 (exclusive of 5, so rows 0, 1, 2, 3, 4)
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures.iloc[0:5,:])
else:
    print("carfeatures DataFrame not loaded.")

In [None]:
# Extract rows 0 up to 8 (exclusive of 8, so rows 0 through 7)
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures.iloc[:8,:]) 
else:
    print("carfeatures DataFrame not loaded.")

## Try it yourself! 🚗 {#sec:exercise-03}

- Create a new DataFrame called `car_ascendingmpg` which sorts `carfeatures` from lowest to highest "mpg".
- Using `.iloc[]`, subset the data of the 5 cars with the lowest "mpg" from `car_ascendingmpg`.
- Display this subset.

In [None]:
# Your code here
# if 'carfeatures' in locals() and not carfeatures.empty:
    # car_ascendingmpg = carfeatures.sort_values(by="mpg") # Hint: ascending is default
    # lowest_5_mpg_cars = car_ascendingmpg.iloc[0:5] # Hint: first 5 rows
    # display(lowest_5_mpg_cars)
# else:
#     print("carfeatures DataFrame not loaded.")

## The `.query()` method

- The `.query()` method is a powerful way to subset data in `pandas`
- It allows you to filter data using a string expression, similar to natural language and SQL
- The syntax is `df.query('expression')`
- The expression is written as if you were writing a sentence
- For example, to filter cars with more than 6 cylinders, we would write:
  - `df.query('cylinders > 6')`
  - `df.query('cylinders > 6 & acceleration < 15 & mpg < 12')`

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures.query("mpg >= 25").head())
else:
    print("carfeatures DataFrame not loaded.")

## Let's see more examples!

- Combine multiple conditions with `and`, `&`, `or`, `|`

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures.query("(acceleration >= 10) and (acceleration < 18)").head())
else:
    print("carfeatures DataFrame not loaded.")

## Let's see more examples!

- Use `.query()` and `sort_values` together

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures.query("(cylinders == 8) & (mpg <= 15) & (acceleration <= 12)").sort_values(by = "mpg").head())
else:
    print("carfeatures DataFrame not loaded.")

## Let's see more examples!

- Use `.query()` and select columns
- Note that we are using the `[[]]` operator to select the columns
- Also note that we don't need parentheses when using `&` and `|` in the expression, but it is a good practice to use them

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    display(carfeatures.query("(cylinders == 8) & (mpg <= 15) and acceleration <= 12")[["mpg", "cylinders", "acceleration"]].head())
else:
    print("carfeatures DataFrame not loaded.")

## Let's see more examples!

- Combine `.query()` with global variables (use `@` prefix for variable names in the query string)

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    threshold = 25
    data_varthreshold_mpg = carfeatures.query("mpg >= @threshold")
    display(data_varthreshold_mpg.head())
else:
    print("carfeatures DataFrame not loaded.")

## Let's see more examples!

- Expression with variable names that contain spaces
- Use backticks (\` \`) to refer to the variable name

In [None]:
if 'carfeatures' in locals() and not carfeatures.empty:
    carfeatures["new variable"] = carfeatures["mpg"] # Create a column with a space in its name
    data_spacesthreshold_mpg = carfeatures.query("`new variable` >= 25")
    display(data_spacesthreshold_mpg.head())
    # Clean up the added column if you want
    # carfeatures = carfeatures.drop(columns=["new variable"])
else:
    print("carfeatures DataFrame not loaded.")

## Try it yourself! 🚗 {#sec:exercise-04}

- Using the `carfeatures` DataFrame:
  - Subset the data to include only cars where `mpg` is greater than or equal to 25 AND `cylinders` is equal to 8.
  - From this subset, select only the `mpg` and `cylinders` columns.
  - Display the head of this final selection.

In [None]:
# Your code here
# if 'carfeatures' in locals() and not carfeatures.empty:
    # query_str = "(mpg >= 25) & (cylinders == 8)" # Hint for the query
    # columns_to_select = ["mpg", "cylinders"] # Hint for columns
    # subset_df = carfeatures.query(query_str)[columns_to_select]
    # display(subset_df.head())
# else:
#     print("carfeatures DataFrame not loaded.")

## Plotting Subsets

- `matplotlib` can also be used to plot subsets of data
- The syntax is similar to what we have seen before
- For example, you can just add other `plt.scatter()` or `plt.hist()` commands to the same cell
- Or you can create a `for` loop to plot multiple subsets
- Let's see an example using `cylinders`

- First, we need to import `matplotlib` and use `pd.unique()` to extract a list with the unique elements in that column

In [None]:
import matplotlib.pyplot as plt
import pandas as pd # Ensure pandas is imported

if 'carfeatures' in locals() and not carfeatures.empty and 'cylinders' in carfeatures.columns:
    list_unique_cylinders = pd.unique(carfeatures["cylinders"])
    print(list_unique_cylinders)
else:
    print("carfeatures DataFrame not loaded or 'cylinders' column missing.")
    list_unique_cylinders = [] # Define as empty list to avoid errors in next cell

## Plotting Subsets

In [None]:
# If we call plt.scatter() twice, it will display both plots on the same graph
# We also include include plt.show() at the very end.
if 'carfeatures' in locals() and not carfeatures.empty:
    df_8 = carfeatures.query("cylinders == 8")
    df_4 = carfeatures.query("cylinders == 4")

    plt.figure() # Create a figure for the plot
    if not df_8.empty:
        plt.scatter(x = df_8["weight"],y = df_8["acceleration"])
    if not df_4.empty:
        plt.scatter(x = df_4["weight"],y = df_4["acceleration"])
    
    plt.legend(labels = ["8","4"], # Ensure labels match the plotted data
               title  = "Cylinders")
    plt.xlabel("Weight")
    plt.ylabel("Acceleration")
    plt.title("Weight vs. Acceleration by Cylinders")
    plt.show()

# Note: If we put plt.show() in between the plots, then the results will
# be shown on separate graphs instead.
else:
    print("carfeatures DataFrame not loaded.")

## Plotting Subsets

- Using a `for` loop to plot multiple subsets

In [None]:
# Compute number of unique categories
# list_unique_cylinders should be defined from a previous cell
if 'carfeatures' in locals() and not carfeatures.empty and 'list_unique_cylinders' in locals() and len(list_unique_cylinders) > 0:
    plt.figure()
    # Use a for loop to plot a scatter plot between "weight" and "acceleration"
    # for each category. Each plot  will have a different color

    for category in list_unique_cylinders:
        df_loop_subset = carfeatures.query("cylinders == @category") # Use @ to refer to the variable 'category'
        if not df_loop_subset.empty:
             plt.scatter(x = df_loop_subset["weight"],y = df_loop_subset["acceleration"])
        
    # Add labels and a legends      
    plt.xlabel("Weight")
    plt.ylabel("Acceleration")
    plt.legend(labels = list_unique_cylinders,
               title  = "Cylinders")
    plt.title("Weight vs. Acceleration for different Cylinder counts")
    plt.show()
else:
    print("carfeatures DataFrame or list_unique_cylinders not available/empty.")

## Try it yourself! 🚗 {#sec:exercise-05}

- Using the `carfeatures` DataFrame:
  - Compute a histogram of "mpg" for each unique cylinder count (from `list_unique_cylinders`).
  - Plot all these histograms on the same figure to compare distributions.
  - Make the histograms transparent by setting the `alpha` parameter in `plt.hist()` to `0.5` (e.g., `plt.hist(x = ..., alpha = 0.5)`).
  - Add an overall title, x-label ("Miles per gallon"), y-label ("Frequency"), and a legend to identify which histogram corresponds to which cylinder count.

In [None]:
# Your code here
# import matplotlib.pyplot as plt
# import pandas as pd

# if 'carfeatures' in locals() and not carfeatures.empty and 'list_unique_cylinders' in locals() and len(list_unique_cylinders) > 0:
    # plt.figure()
    # for category in list_unique_cylinders:
        # df_category = carfeatures.query("cylinders == @category")
        # if not df_category.empty:
            # plt.hist(x = df_category["mpg"], alpha = 0.5, bins = 10, label=str(category)) # Added label for legend

    # plt.xlabel("Miles per gallon")
    # plt.ylabel("Frequency")
    # plt.title("MPG Distribution by Cylinder Count")
    # plt.legend(title = "Cylinders")
    # plt.show()
# else:
#     print("carfeatures DataFrame or list_unique_cylinders not available/empty.")

# And that's it for today! 🎉

# See you next time! 🚀