
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Data Visualization

<!-- ## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png)  -->

In this lesson you:
Explore data visualization of **`pandas`** DataFrames using:
- Databricks built-in plotting
- **`pandas`** plotting methods
- **`seaborn`** plotting functionality

Let's import **`pandas`** and our Airbnb Dataset

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.
Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.
1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:
    - In the drop-down, select **More**.
    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.
**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:
1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.
1. Find the triangle icon to the right of your compute cluster name and click it.
1. Wait a few minutes for the cluster to start.
1. Once the cluster is running, complete the steps above to select your cluster.

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
import pandas as pd

In [0]:
file_path = f"{DA.paths.datasets}/sf-airbnb/sf-airbnb.csv".replace("dbfs:", "/dbfs")
df = pd.read_csv(file_path)
df.head(3)

## Built-in Plotting

Databricks provides built in data visualization tools we can use in a Databricks notebook. 

In order to use them, we use the built-in **`display()`** function Databricks provides on a pandas DataFrame.

In [0]:
display(df)


## Plot Options

In order to create a different type of plot, click the + icon in the result below.
From there we can specify what kind of plot we want on which columns of the DataFrame.

For example, let's say we wanted to view the average number of bedrooms per neighborhood. 

To do this:

1. Click the **`+`** icon and select Visualization.
2. Set the visualization type to **Bar**.
3. For the **X column**, select **`neighborhood`**.
4. Click Add Column for the **Y column** and select **`bedrooms`**.
5. Change the aggregate function to **Average**.

In [0]:
display(df)

Note that initially it only will show a preview of the first 1000 rows, but when we click **`apply`** it works on all of them.

## Pandas Plotting

**`pandas`** also provides some plotting functionality. 

We can create a histogram using the **`hist()`** method on a **`Series`**.

Let's create a histogram of the number of bedrooms:

In [0]:
df["bedrooms"].hist()

We can also specify the number of bins by passing an argument to **`bins`** parameter.

In [0]:
df["bedrooms"].hist(bins=20)

We can also create box plots with pandas.

We use the method **`boxplot([cols])`** on a **`DataFrame`** to create a box plot for each of the specified columns:

In [0]:
df.boxplot(["bedrooms", "bathrooms"])

# Seaborn

[seaborn](https://seaborn.pydata.org/) is a very popular data visualization library that works with pandas DataFrames. 

It is popular for both being relatively easy to use and for producing nice looking visualizations.

Let's import **`seaborn`**: it is common practice to use **`sns`** as the alias.

In [0]:
import seaborn as sns

## Scatter plot

Let's first create a scatter plot. We'll plot **`bedrooms`** cases on the x-axis and **`bathrooms`** on the y-axis. 

In order to do this, we call **`sns.scatterplot(data=, x=, y=)`**

We provide a **`DataFrame`** as the data parameter, and the column names we want for the x and y parameters.

In [0]:
sns.scatterplot(data=df, x="bedrooms", y="bathrooms")

You might also want to plot a line of best fit for the scatter plot. 

We can do this by using the same parameters but for the **`regplot()`** function:

In [0]:
sns.regplot(data=df, x="bedrooms", y="bathrooms")

&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>