# QTM 151 - Introduction to Statistical Computing II
## Lecture 08 - Data Wrangling with Pandas
**Author:** Danilo Freire (danilo.freire@emory.edu, Emory University)

# Hello again! 🥳

# Recap of last class 📚

## In our last class, we learned

- How to write functions with `def` and `return`
- What **parameters, arguments, and return values** are
- How to combine functions with `if` statements
- How to use [lambda](https://realpython.com/python-lambda/) to create quick, throwaway functions

![](figures/functions.webp)
![](figures/lambda.jpg)

## Today's plan 📅

- Introduction to `pandas`, the main library for data manipulation in Python
- Learn how to apply functions to many variables at once
- How to use the `apply` and `map` functions
- Learn how to recode and replace variables in a dataset
- Specifically focus on replacing `NaN` values ("Not a Number" - missing data)
- Cover how to convert variables from one type to another
- Learn how to create new variables based on existing ones
- Finally, we will learn about `.py` files and how to import them as modules

![](figures/pandas.png)

# Operations over many variables using Pandas 🐼

## Pandas 🐼

- `pandas` is the main library for **data manipulation** in Python 🐼
- We will use it a lot in this course (and in your life as a data scientist!)
- It is built on top of `numpy` and `matplotlib`, and has [a gazillion functions to work with data](https://pandas.pydata.org/docs/reference/index.html) 😁
- If you use `R` already, think about it as the `dplyr` of Python
  - A list of [equivalences between `dplyr` and `pandas`](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html)
- We will learn more about it in the next classes

## Applying functions to a dataset

- The `apply` function is used to **apply a function to a dataset**
  - (This course is full of surprises, isn't it? 😄)
- It is a **method of a pandas DataFrame**
- It can be used with built-in functions, custom functions, or lambda functions
  - `df.apply(function)`
- You can apply functions to rows or columns
  - `df.apply(function, axis=0)` applies the function to each column (default)
  - `df.apply(function, axis=1)` applies the function to each row

## Applying functions to a dataset

In [1]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

print(df.apply(np.sqrt))

          A         B         C
0  1.000000  2.000000  2.645751
1  1.414214  2.236068  2.828427
2  1.732051  2.449490  3.000000


In [2]:
print(df.apply(np.sum, axis=1))

0    12
1    15
2    18
dtype: int64


In [3]:
print(df.apply(lambda x: x**2))

   A   B   C
0  1  16  49
1  4  25  64
2  9  36  81


## Applying functions to a dataset

- Let's do a quick exercise

In [4]:
# Create an empty DataFrame
data = pd.DataFrame()

# Add variables
data["age"] = [18,29,15,32,6]
data["num_underage_siblings"] = [0,0,1,1,0]
data["num_adult_siblings"] = [1,0,0,1,0]

from IPython.display import display # To match Quarto's display output
display(data)

Unnamed: 0,age,num_underage_siblings,num_adult_siblings
0,18,0,1
1,29,0,0
2,15,1,0
3,32,1,1
4,6,0,0


## Applying functions to a dataset

- Now let's define some functions

In [5]:
# The first two functions return True/False depending on age constraints
# The third function returns the sum of two numbers
# The fourth function returns a string with the age bracket

fn_iseligible_vote = lambda age: age >= 18

fn_istwenties = lambda age: (age >= 20) & (age < 30)

fn_sum = lambda x,y: x + y

def fn_agebracket(age):
    if (age >= 18):
        status = "Adult"
    elif (age >= 10) & (age < 18):
        status = "Adolescent"
    else:
        status = "Child"
    return(status)

## Applying functions to a dataset

- Now let's apply the functions to the `data["age"]` column

In [6]:
data["can_vote"]    = data["age"].apply(fn_iseligible_vote)
data["in_twenties"] = data["age"].apply(fn_istwenties)
data["age_bracket"] = data["age"].apply(fn_agebracket)

display(data)

Unnamed: 0,age,num_underage_siblings,num_adult_siblings,can_vote,in_twenties,age_bracket
0,18,0,1,True,False,Adult
1,29,0,0,True,True,Adult
2,15,1,0,False,False,Adolescent
3,32,1,1,True,False,Adult
4,6,0,0,False,False,Child


## Creating a new variable

- You can also create a new variable using the `apply` function

In [7]:
# Creating a new variable
data["new_var"] = data["age"].apply(lambda age: age >= 18)

display(data)

Unnamed: 0,age,num_underage_siblings,num_adult_siblings,can_vote,in_twenties,age_bracket,new_var
0,18,0,1,True,False,Adult,True
1,29,0,0,True,True,Adult,True
2,15,1,0,False,False,Adolescent,False
3,32,1,1,True,False,Adult,True
4,6,0,0,False,False,Child,False


## Deleting a variable

- You can also delete a variable using the `drop` function

In [8]:
data = data.drop(columns = ["new_var"])

display(data)

Unnamed: 0,age,num_underage_siblings,num_adult_siblings,can_vote,in_twenties,age_bracket
0,18,0,1,True,False,Adult
1,29,0,0,True,True,Adult
2,15,1,0,False,False,Adolescent
3,32,1,1,True,False,Adult
4,6,0,0,False,False,Child


## Mapping functions to a list, array, or series

- The `map` function is used to **apply a function to a list, an array, or a series**
  - A series is a single column of a pandas DataFrame
- **In pandas**, `map` works very similarly to the `apply` function, and they are interchangeable when working with series
- `map` can be faster than `apply` for simple functions, but `apply` is more flexible as it can be used with DataFrames (many columns)
- However, if you are using regular lists (e.g., `list01 = [1,2,3]`), you should use `map` instead of `apply`
  - `apply` is not a built-in Python function for lists in the same way `map` is.

In [9]:
data["age_bracket01"] = data["age"].map(fn_agebracket)

display(data[["age","age_bracket01"]])

Unnamed: 0,age,age_bracket01
0,18,Adult
1,29,Adult
2,15,Adolescent
3,32,Adult
4,6,Child


In [10]:
data["age_bracket02"] = data["age"].apply(fn_agebracket)

display(data[["age","age_bracket02"]])

Unnamed: 0,age,age_bracket02
0,18,Adult
1,29,Adult
2,15,Adolescent
3,32,Adult
4,6,Child


## Mapping functions to a list, array, or series

- Using `map` with a list and an array

In [11]:
# Create a list
list01 = [1,2,3,4,5]

# Map a function to the list
list02 = list(map(lambda x: x**2, list01))

print(list02)

[1, 4, 9, 16, 25]


In [12]:
# Create a numpy array
array01 = np.array([1,2,3,4,5])

# Map a function to the array
array02 = np.array(list(map(lambda x: x**2, array01)))

print(array02)

[ 1  4  9 16 25]


- Trying to use `apply` with a list or an array will raise an error
```python
# Create a list
list01 = [1,2,3,4,5]

# Apply a function to the list
# list02 = list(apply(lambda x: x**2, list01)) # This would cause NameError

# print(list02)
```
```verbatim
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[168], line 5
      2 list01 = [1,2,3,4,5]
      4 # Apply a function to the list
----> 5 list02 = list(apply(lambda x: x**2, list01))
      7 print(list02)

NameError: name 'apply' is not defined
```

## Try it yourself! 🚀 {#sec:exercise-02}

- Write a lambda function checking whether `num_siblings` $\ge 1$
- Add a variable to the dataset called `has_siblings`
- Assign True/False to this variable using `apply()`

In [13]:
# Your code here
# fn_has_siblings = lambda ... : ...

# data["has_siblings"] = data[...].apply(...)

# display(data[["num_adult_siblings","has_siblings"]]) # Assuming you use num_adult_siblings

# Importing modules 📦

## Importing modules
### What is a module?

- While `.ipynb` files are great for learning and teaching, they are not the best for sharing code
- When you write a lot of functions, you should save them in a `.py` file, which is a **Python script**
- A Python script, or module, is just a file containing Python code
- This code can be functions, classes, or variables
- A folder containing Python scripts is called a **package**
- You can import modules to use their code in your own code

- We can import functions into the working environment from a file 

```python
# import scripts.example_functions as ef

# print(ef.fn_quadratic(2))
# print(ef.fn_cubic(3))

# ef.message_hello("Juan")
```
*(This code assumes you have a folder named `scripts` with a file `example_functions.py` in it, containing the respective functions. For this notebook, we won't run this cell as the file structure isn't set up here.)*

## Importing modules
### Importing variables

- You can also import variables from a module
- However, it is not recommended to import variables
- It is better to import functions and use them to create variables
- This is because variables can be changed in the module, leading to unexpected results

- Example:

```python
# import scripts.example_variables as ev

# # When we run this code
# # the value of alpha will be overwritten

# alpha = 1
# print(alpha)
# print(ev.alpha)

# from scripts.example_variables import *

# print(alpha)
# print(beta)
# print(gamma)
# print(delta)
```
*(This code assumes a `scripts/example_variables.py` file. We won't run this.)*

# Loading packages and dataset 📦

## Our dataset: Formula 1 World Championships 🏁🏎️

- First, we will load the packages we need

In [14]:
import pandas as pd
import numpy as np

- Then, we will load the dataset

In [15]:
# Ensure the file 'data_raw/circuits.csv' is in the correct path relative to the notebook
# Or use the direct URL
try:
    circuits = pd.read_csv("data_raw/circuits.csv")
except FileNotFoundError:
    print("Local file 'data_raw/circuits.csv' not found. Attempting to download from internet...")
    circuits_url = "https://raw.githubusercontent.com/danilofreire/qtm151-summer/main/lectures/lecture-08/data_raw/circuits.csv"
    try:
        circuits = pd.read_csv(circuits_url)
        print("Successfully downloaded circuits.csv from GitHub.")
    except Exception as e:
        print(f"Could not download from URL. Error: {e}")
        circuits = pd.DataFrame() # Create an empty DataFrame if loading fails

from IPython.display import display
if not circuits.empty:
    display(circuits.head(2))
else:
    print("Failed to load the circuits dataset.")

Unnamed: 0,circuitId,circuitRef,name,location,country,lat,lng,alt,url
0,1,albert_park,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.8497,144.968,10,http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1,2,sepang,Sepang International Circuit,Kuala Lumpur,Malaysia,2.76083,101.738,18,http://en.wikipedia.org/wiki/Sepang_Internatio...


## Our dataset: Formula 1 World Championships 🏁🏎️

- The dataset contains information about F1 circuits, such as its name, location, latitude, longitude, and more
- You can find more information about the dataset [here](https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020/data)
- The dataset is available in the course's GitHub repository [here](https://github.com/danilofreire/qtm151-summer/blob/main/lectures/lecture-08/data_raw/circuits.csv)
  - Or you can download it using the command above
- Let's see how the codebook looks like
- More information about [Formula 1 here](https://en.wikipedia.org/wiki/Formula_One)

## Codebook 📚

![](figures/codebook.png)

- `Field` - Name of the variable
- `Type` - Type of the variable
  - Integer (`int`), string (`str` - `varchar`), and float (`float`)
- `Description` - Label with a description of the variable
- **Quick discussion**: What does `varchar(255)` mean?

The dataset has {circuits.shape[1] if 'circuits' in locals() and not circuits.empty else 'N/A'} columns (variables) and {circuits.shape[0] if 'circuits' in locals() and not circuits.empty else 'N/A'} rows (observations).

The columns are:
  - `circuitId`: Unique identifier for the circuit
  - `circuitRef`: Unique reference for the circuit
  - `name`: Name of the circuit
  - `location`: Location 
  - `country`: Country where the circuit is located
  - `lat`: Latitude 
  - `lng`: Longitude
  - `alt`: Altitude
  - `url`: URL of the circuit's Wikipedia page

# NaN values 🚫

## What is a `NaN` value?

- `NaN` stands for "Not a Number"
- It is a special value in Python that represents missing data
- `NaN` values can be found in datasets for various reasons
  - Data entry errors
  - Data cleaning and processing errors
  - Data collection errors
  - Data transformation errors
- We (often) need to handle `NaN` values before we can analyse the data

- `NaN` values can be found in different types of variables
  - Numeric variables
  - Categorical variables
  - Date variables
  - Text variables
- We will focus on numeric variables today
- `pandas` and `numpy` have functions to handle `NaN` values
  - Note: they handle `NaN` values differently!

## Operations with `NaN` values

- `NaN` is a special number, available in `numpy`

In [16]:
import numpy as np # Ensure numpy is imported
np.nan

nan

- Often, we cannot perform operations with `NaN` values
- Thus, we need to handle them before we can analyse the data

- Let's see some examples. We start with `numpy` arrays

In [17]:
# Create two array with and without "NaNs"
# The "np.array()" functions converts 
# a list to an array

vec_without_nans = np.array([1,1,1])
vec_with_nans    = np.array([np.nan,4,5])

# When you add the vectors
# you will produce a NaN 
# on any entries with "NaNs"
print(vec_without_nans * vec_with_nans)
print(vec_without_nans / vec_with_nans)
print(vec_without_nans + vec_with_nans)
print(vec_without_nans - vec_with_nans)

[nan  4.  5.]
[ nan 0.25 0.2 ]
[nan  5.  6.]
[nan -3. -4.]


## Summary statistics with `NaN` values
### Arrays

- Some summary statistics functions will not work with `NaN` values
- For example, the `mean()` function

In [18]:
print(np.mean(vec_with_nans))

nan


- The `mean()` function will return `NaN` if there are `NaN` values in the array

- To calculate the mean without `NaN` values, we can use the `nanmean()` function

In [19]:
print(np.nanmean(vec_with_nans))

4.5


- The `nanmean()` function will ignore `NaN` values and calculate the mean with the remaining values

## Summary statistics with `NaN` values
### Pandas DataFrames

- Let's create an empty DataFrame and create a new column `x` with `NaN` values

In [20]:
import pandas as pd # Ensure pandas is imported
dataset = pd.DataFrame()
dataset["x"] = vec_with_nans # vec_with_nans was defined in a previous cell
dataset

Unnamed: 0,x
0,
1,4.0
2,5.0


- You will see that `pandas` will handle `NaN` values differently: it will **ignore them**

In [21]:
print(dataset["x"].mean())

4.5


- **For R users**: This is the same as `na.rm = TRUE` in R. `pandas` does that by default

# Data Cleaning 🧹🧽

## Data cleaning

- Data cleaning is the process of preparing data for analysis
- It involves identifying and handling missing data, outliers, and other data quality issues
- **You guys have no idea** how much time you will spend cleaning data in your life 😅
- According to a [Forbes survey](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/), data scientists spend 60% of their time cleaning and preparing data, and 57% say it's the least enjoyable part of their work
  - I can **really** relate to that 😂
- But remember that **clean data are good data** 🥳

- Let's get the data types of the columns in the `circuits` dataset
- We use the command `dtypes` for that
- `object` means that the variable is a string or a variable with mixed types (e.g., numbers and strings)

In [22]:
if 'circuits' in locals() and not circuits.empty:
    print(circuits.dtypes)
else:
    print("Circuits DataFrame not loaded.")

circuitId       int64
circuitRef     object
name           object
location       object
country        object
lat           float64
lng           float64
alt            object
url            object
dtype: object


## Check rows with numeric values

- Here we will use the `.str.isnumeric()` function
- This function actually combines two functions: `.str` and `.isnumeric()`
- The `.str` accessor is used to apply string methods to each element in a Series.
- The `.isnumeric()` method then checks if each string consists only of numeric characters.
- **Why do we need both?** Because DataFrame columns can sometimes be of `object` type (which can hold strings, numbers, or mixed types). We need to treat the elements as strings first (`.str`) before checking if those strings represent numbers (`.isnumeric()`).
- If we used only `.isnumeric()` on a Series that isn't of string type or on a Series with non-string elements, it might not work as expected or could raise an error.

- The two dots between the functions are called **method chaining**
- It is a way to call multiple methods sequentially on an object
- If you use `R`, this is similar to the `%>%` operator in `dplyr`
- Let's see how it works

In [23]:
# Check if the variable "alt" is numeric
if 'circuits' in locals() and not circuits.empty and 'alt' in circuits.columns:
    # Pandas .str.isnumeric() returns False for NaN, which is desired.
    # It also returns False for decimals and negative signs, as it checks for unicode numeric characters.
    # For a more robust check if a string can be a float, pd.to_numeric with errors='coerce' is better.
    # However, for this specific example from slides, we'll use .str.isnumeric()
    print(circuits["alt"].astype(str).str.isnumeric()) # Convert to string first to use .str methods safely
else:
    print("Circuits DataFrame not loaded or 'alt' column missing.")

0      True
1      True
2      True
3      True
4      True
      ...  
72     True
73     True
74     True
75    False
76    False
Name: alt, Length: 77, dtype: bool


## Other examples of chaining methods

In [24]:
# Check if the variable 
# "circuitRef" is numeric
if 'circuits' in locals() and not circuits.empty and 'circuitRef' in circuits.columns:
    print(circuits["circuitRef"].astype(str).str.isnumeric())
else:
    print("Circuits DataFrame not loaded or 'circuitRef' column missing.")

0     False
1     False
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75    False
76    False
Name: circuitRef, Length: 77, dtype: bool


In [25]:
# Convert the variable 
# `location` to lowercase
if 'circuits' in locals() and not circuits.empty and 'location' in circuits.columns:
    print(circuits["location"].str.lower())
else:
    print("Circuits DataFrame not loaded or 'location' column missing.")

0        melbourne
1     kuala lumpur
2           sakhir
3         montmeló
4         istanbul
          ...     
72        portimão
73         mugello
74          jeddah
75       al daayen
76           miami
Name: location, Length: 77, dtype: object


## Extract list of non-numeric values

- We can use the function `query()` to filter rows in a DataFrame based on a condition expressed as a string.
  - `query()` is a method of a pandas DataFrame and it has [many useful functions](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html)!
  - We will use it more in the future!
- Here we will combine `query()` with `pd.unique()` to extract a list of unique non-numeric-like string values in the 'alt' column.
- The `pd.unique()` function will return an array of unique values in a Series.

In [26]:
# Extract a list of non-numeric values
# The pd.unique() function extracts unique values from a list
# Check each value in the alt column to see if it is not numeric
# True if it is not numeric, False if it is numeric
if 'circuits' in locals() and not circuits.empty and 'alt' in circuits.columns:
    # Ensure 'alt' is string type for .str.isnumeric()
    # .str.isnumeric() will be False for NaN, empty strings, decimals, negatives.
    # We are looking for strings that are *not* purely digits.
    condition = circuits['alt'].astype(str).str.isnumeric() == False
    subset = circuits[condition]
    list_unique = pd.unique(subset["alt"])
    print(list_unique)
else:
    print("Circuits DataFrame not loaded or 'alt' column missing.")

['\\N' '-7']


## Replace certain values

- The `replace` function is used to replace values in a variable
- The syntax is `dataframe["variable"].replace(list_old, list_new)`
- More information about the function can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html)

In [27]:
if 'circuits' in locals() and not circuits.empty and 'alt' in circuits.columns:
    # "list_old" encodes values we want to change
    # From the list_unique, we see that the values we want to change are '\N' and potentially others like '-7' (if it was read as string)
    # "list_new" encodes the values that will replace the old
    list_old = ['\\N','-7'] # Note: '\N' needs to be escaped as '\\N' if it's a literal string in the data
    list_new = [np.nan, -7] # np.nan for missing, -7 as a number

    # This command replaces the values of the "alt" column
    circuits["alt"] = circuits["alt"].replace(list_old, list_new)
    print("Values in 'alt' after replacement (first 5 unique values):")
    print(pd.unique(circuits['alt'].dropna())[:5]) # Show some unique non-NaN values
else:
    print("Circuits DataFrame not loaded or 'alt' column missing.")

Values in 'alt' after replacement (first 5 unique values):
['10' '18' '7' '109' '130']


- After the cleaning process is done, you may want to store the dataset again
- It's **strongly recommended** to do this in a separate file from the original
- Use `to_csv()` to save the dataset as a `.csv` file

```python
# circuits.to_csv("data_clean/circuits_clean.csv", index=False)
```
*(Make sure you have a `data_clean` directory or adjust the path)*

## Try it yourself! 🧠 {#sec:exercise-04}

- Use `.replace()` with the "country" column in the `circuits` DataFrame.
- Replace "UK" with "United Kingdom".
- Display the unique values of the "country" column after replacement to verify.

In [28]:
# Your code here
# if 'circuits' in locals() and not circuits.empty and 'country' in circuits.columns:
    # circuits["country"] = ... .replace(... , ...)
    # display(circuits[["country"]].head())
    # print(pd.unique(circuits['country']))
# else:
#     print("Circuits DataFrame not loaded or 'country' column missing.")

## Try it yourself! 🧠 {#sec:exercise-05}

- What is the column type of "lat" or "lng" in the `circuits` DataFrame?
- Does it have any string variables (values that are strings, not just the column type being 'object' if it contains mixed types)?
- Can we use ```.str.isnumeric()``` here directly? Why or why not?

In [29]:
# Your code and explanations here
# if 'circuits' in locals() and not circuits.empty:
    # print("Data type of 'lat':", circuits['lat'].dtype)
    # print("Data type of 'lng':", circuits['lng'].dtype)
    
    # # To check for string values within a potentially numeric column:
    # has_string_in_lat = circuits['lat'].apply(type).eq(str).any()
    # print(f"Does 'lat' column contain any string values? {has_string_in_lat}")

    # # Attempting to use .str.isnumeric() on a float column will typically raise an AttributeError
    # try:
    #     print(circuits['lat'].str.isnumeric())
    # except AttributeError as e:
    #     print(f"Error using .str.isnumeric() on 'lat': {e}")
    # print("Explanation: .str accessor is for string-like operations. 'lat' is likely float64. ")
    # print("If it were object type containing strings, you'd first convert to string or check types.")
# else:
#     print("Circuits DataFrame not loaded.")

# Recoding Numeric Variables 🔄

## Recoding numeric variables

- Recoding is the process of changing the values of a variable
- We can recode variables for various reasons
  - To create new variables
  - To standardise variables
  - To simplify the analysis
- Please remember to convert the variable to the correct type before recoding

In [30]:
# Check the data type of the "alt" column
if 'circuits' in locals() and not circuits.empty and 'alt' in circuits.columns:
    print(circuits["alt"].dtype)
else:
    print("Circuits DataFrame not loaded or 'alt' column missing.")

object


- `pd.to_numeric()` is used to convert a variable to a numeric type

In [31]:
# pd.to_numeric() converts 
# a column to numeric
# Before you use this option, 
# make sure to "clean" the variable
# as we did before by checking what
# the non-numeric values are
if 'circuits' in locals() and not circuits.empty and 'alt' in circuits.columns:
    circuits["alt_numeric"] = pd.to_numeric(circuits["alt"], errors='coerce') # errors='coerce' will turn unparseable strings into NaN
    print(circuits["alt_numeric"].mean())
else:
    print("Circuits DataFrame not loaded or 'alt' column missing.")

248.1891891891892


In [32]:
if 'circuits' in locals() and not circuits.empty and 'alt_numeric' in circuits.columns:
    print(circuits["alt_numeric"].min())
    print(circuits["alt_numeric"].max())
else:
    print("Circuits DataFrame not loaded or 'alt_numeric' column missing.")

-7.0
2227.0


## Recode variables based on an interval {#sec:recoding}

- Imagine that we want to recode the `alt` variable into an interval

$$x_{bin} = \begin{cases} "A" &\text{ if } x_1 < x \le x_2 \\
                             "B" &\text{ if } x_2 < x \le x_3 \end{cases} $$

- We can use the `pd.cut()` function to do this
- The syntax is `df["new_variable"] = pd.cut(df["variable"], bins = [x1, x2, x3], labels = ["A", "B"])`
- Where `bins` are the intervals and `labels` are the new values
- More information about the function can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)

In [None]:
# Recode the "alt" variable into an interval
bins_x = [0, 2500, 5000]
labels_x = ["Between 0 and 2500",
            "Between 2500 and 5000"]

circuits["bins_alt"] = pd.cut(circuits["alt_numeric"],
                              bins = bins_x,
                              right = True,
                              labels = labels_x)
np.random.seed(1)
display(circuits.sample(5))

Unnamed: 0,circuitId,circuitRef,name,location,country,lat,lng,alt,url,alt_numeric,bins_alt
31,32,rodriguez,Autódromo Hermanos Rodríguez,Mexico City,Mexico,19.4042,-99.0907,2227,http://en.wikipedia.org/wiki/Aut%C3%B3dromo_He...,2227.0,Between 0 and 2500
43,44,las_vegas,Las Vegas Street Circuit,Nevada,USA,36.1162,-115.174,639,http://en.wikipedia.org/wiki/Las_Vegas_Street_...,639.0,Between 0 and 2500
26,27,estoril,Autódromo do Estoril,Estoril,Portugal,38.7506,-9.39417,130,http://en.wikipedia.org/wiki/Aut%C3%B3dromo_do...,130.0,Between 0 and 2500
74,77,jeddah,Jeddah Corniche Circuit,Jeddah,Saudi Arabia,21.6319,39.1044,15,http://en.wikipedia.org/wiki/Jeddah_Street_Cir...,15.0,Between 0 and 2500
58,59,boavista,Circuito da Boavista,Oporto,Portugal,41.1705,-8.67325,28,http://en.wikipedia.org/wiki/Circuito_da_Boavista,28.0,Between 0 and 2500


# And that's it for today! 🎉

# Thanks very much! 😊