
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Pandas Overview Lab

In this lab, you will use <a href="https://pandas.pydata.org/docs/" target="_blank">pandas</a> for basic data manipulation.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.
Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.
1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:
    - In the drop-down, select **More**.
    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.
**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:
1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.
1. Find the triangle icon to the right of your compute cluster name and click it.
1. Wait a few minutes for the cluster to start.
1. Once the cluster is running, complete the steps above to select your cluster.


#### Problem 1: Create a `DataFrame`

Create a **`DataFrame`** called **`df`** representing the table below of dogs. The data is included below.

| Name    | Age | Breed| 
| ----------- | ----------- | ----------- | 
| Buddy   | 3    | Australian Shepherd |
| Harley    | 10       | Labrador |
| Luna     | 2       | Golden Retriever | 
| Bailey | 8 | Chihuahua |

In [0]:
import pandas as pd

In [0]:
data = [["Buddy", 3, "Australian Shepherd"], ["Harley", 10, "Labrador"], ["Luna", 2, "Golden Retriever"], ["Bailey", 8, "Chihuahua"]]

column_names = ['Name', 'Age', 'Breed']

df = pd.DataFrame(data=data, columns=column_names)
df

Unnamed: 0,Name,Age,Breed
0,Buddy,3,Australian Shepherd
1,Harley,10,Labrador
2,Luna,2,Golden Retriever
3,Bailey,8,Chihuahua


<button onclick="myFunction2()" >Click for Hint</button>

<div id="myDIV2" style="display: none;">
  Remember to specify the data and columns attributes
</div>
<script>
function myFunction2() {
  var x = document.getElementById("myDIV2");
  if (x.style.display === "none") {
    x.style.display = "block";
  } else {
    x.style.display = "none";
  }
}
</script>


**Check your work by running the cell below**

In [0]:
assert (df.columns == ["Name", "Age", "Breed"]).all(), "The columns are named incorrectly"
assert [df.iloc[0][x] for x in ["Name", "Age", "Breed"]] == ["Buddy", 3, "Australian Shepherd"], "First row defined incorrectly"
assert [df.iloc[1][x] for x in ["Name", "Age", "Breed"]] == ["Harley", 10, "Labrador"], "Second row defined incorrectly"
assert [df.iloc[2][x] for x in ["Name", "Age", "Breed"]] == ["Luna", 2, "Golden Retriever"], "Third row defined incorrectly"
assert [df.iloc[3][x] for x in ["Name", "Age", "Breed"]] == ["Bailey", 8, "Chihuahua"], "Fourth row defined incorrectly"
print("Test passed!")

Test passed!



#### Problem 2: What are the `dtypes`?

Print out the **`dtypes`** attribute of your DataFrame to see the types of each column.

In [0]:
df.Name
df.Age
df.Breed

0    Australian Shepherd
1               Labrador
2       Golden Retriever
3              Chihuahua
Name: Breed, dtype: object


<button onclick="myFunction2()" >Click for Hint</button>

<div id="myDIV2" style="display: none;">
  Remember we access attributes like this: object.attribute
</div>
<script>
function myFunction2() {
  var x = document.getElementById("myDIV2");
  if (x.style.display === "none") {
    x.style.display = "block";
  } else {
    x.style.display = "none";
  }
}
</script>



#### Problem 3: Subset of Columns

Select only the **`Name`** and **`Age`** columns, assigning to the new variable **`name_age_df`**.

In [0]:
name_age_df = df[["Name", "Age"]]
name_age_df

Unnamed: 0,Name,Age
0,Buddy,3
1,Harley,10
2,Luna,2
3,Bailey,8


**Check your work by running the cell below**

The assert below uses [**iloc**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) to do integer-location based indexing for selection by position.

In [0]:
assert (name_age_df.columns == ["Name", "Age"]).all(), "The columns are named incorrectly"
assert name_age_df.shape == (4, 2), "There are not the right number of rows or columns"
assert [name_age_df.iloc[0][x] for x in ["Name", "Age"]] == ["Buddy", 3], "First row defined incorrectly"
assert [name_age_df.iloc[1][x] for x in ["Name", "Age"]] == ["Harley", 10], "Second row defined incorrectly"
assert [name_age_df.iloc[2][x] for x in ["Name", "Age"]] == ["Luna", 2], "Third row defined incorrectly"
assert [name_age_df.iloc[3][x] for x in ["Name", "Age"]] == ["Bailey", 8], "Fourth row defined incorrectly"
print("Test passed!")

Test passed!


#### Problem 4: Create a New Column

Let's assume one year in dog years is equal to 7 years in human years. Create a new column called **`Human Age`** in our **`df`** that takes the dog's age and multiplies it by 7.

In [0]:
df["Human Age"] = df.Age * 7


**Check your work by running the cell below**

In [0]:
assert df.shape == (4, 4), "There are not the correct number of rows or columns"
assert (df.columns == ["Name", "Age", "Breed", "Human Age"]).all(), "The columns are named incorrectly"
assert [df.iloc[0][x] for x in ["Name", "Age", "Breed", "Human Age"]] == ["Buddy", 3, "Australian Shepherd", 21], "First row defined incorrectly"
assert [df.iloc[1][x] for x in ["Name", "Age", "Breed", "Human Age"]] == ["Harley", 10, "Labrador", 70], "Second row defined incorrectly"
assert [df.iloc[2][x] for x in ["Name", "Age", "Breed", "Human Age"]] == ["Luna", 2, "Golden Retriever", 14], "Third row defined incorrectly"
assert [df.iloc[3][x] for x in ["Name", "Age", "Breed", "Human Age"]] == ["Bailey", 8, "Chihuahua", 56], "Fourth row defined incorrectly"
print("Test passed!")

Test passed!


#### Problem 5: Extract a value

Programmatically extract Buddy's **`Breed`** from the DataFrame and assign it to the given **`breed`** variable.

In [0]:
breed = df['Breed'][0]
breed

'Australian Shepherd'

**Check your work by running the cell below**

In [0]:
assert breed == "Australian Shepherd", "Breed is not defined correctly"
print("Test passed!")

Test passed!


&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>