# More Plotly Visualizations


▶️ Import the following Python packages.

1. `pandas`: Use alias `pd`.
2. `numpy`: Use alias `np`.
3. `plotly.express`: Use alias `px`.
4. `plotly.graph_objects`: Use alias `go`.


In [1]:
import pandas as pd
import numpy as np

import plotly.graph_objects as go
import plotly.express as px

---

### 📌 Import dataset

![BLUEbikes](https://github.com/bdi475/images/blob/main/lecture-notes/dataviz-python/bluebike-transparent-bike.png?raw=true)

Today, we work with bikesharing trips dataset 🚲 to uncover insights about trips made by subscribers and casual riders of Bluebikes (in Boston). The original dataset has been downloaded from [https://www.bluebikes.com/system-data](https://www.bluebikes.com/system-data) and was preprocessed for this exercise.


▶️ Import the dataset. This dataset is a fairly large with ~2 million rows, **so it may take up to a few minutes**.


In [2]:
# Display all columns
pd.set_option("display.max_columns", 50)

df_trips = pd.read_csv(
    "https://github.com/bdi475/datasets/blob/main/bluebikes-trip-data-2020-sampled.csv.gz?raw=true",
    compression="gzip",
    parse_dates=["start_time", "stop_time"],
)

df_trips_backup = df_trips.copy()

display(df_trips)

Unnamed: 0,trip_duration,start_station_name,end_station_name,user_type,start_time,stop_time
0,344,175 N Harvard St,Harvard Kennedy School at Bennett St / Eliot St,Customer,2020-08-06 16:52:57,2020-08-06 16:58:42
1,451,Harvard University River Houses at DeWolfe St ...,175 N Harvard St,Subscriber,2020-08-04 10:43:49,2020-08-04 10:51:21
2,750,Prudential Center - 101 Huntington Ave,Central Sq Post Office / Cambridge City Hall a...,Subscriber,2020-01-19 18:44:08,2020-01-19 18:56:38
3,1422,Cambridge St - at Columbia St / Webster Ave,Mass Ave T Station,Customer,2020-08-15 18:11:50,2020-08-15 18:35:32
4,3405,Mattapan T Stop,Mattapan T Stop,Subscriber,2020-01-11 12:51:52,2020-01-11 13:48:37
...,...,...,...,...,...,...
199995,255,MIT Vassar St,MIT Stata Center at Vassar St / Main St,Subscriber,2020-03-04 15:54:25,2020-03-04 15:58:41
199996,1451,Farragut Rd at E. 6th St,Aquarium T Stop - 200 Atlantic Ave,Subscriber,2020-07-07 20:20:51,2020-07-07 20:45:03
199997,454,Harvard Square at Mass Ave/ Dunster,Lesley University,Subscriber,2020-09-21 15:33:06,2020-09-21 15:40:41
199998,322,Harvard Law School at Mass Ave / Jarvis St,Verizon Innovation Hub 10 Ware Street,Subscriber,2020-02-05 17:07:50,2020-02-05 17:13:13


---

## 📦 Box plots and histograms review


▶️ Create a box plot of the trip duration for trips less than 30 minutes.


In [3]:
fig = px.box(
    df_trips[df_trips["trip_duration"] < 1800],
    x="trip_duration",
    title="Trip Duration in Seconds (for trips shorter than 30 minutes)",
)
fig.show()
# YOUR CODE ENDS

▶️ Create a box plot of the trip duration for trips by user type.


In [4]:
fig = px.box(
    df_trips,
    x="trip_duration",
    y="user_type",
    title="Trip Duration in Seconds by User Type",
)
fig.show()
# YOUR CODE ENDS

▶️ Create a histogram of the trip duration with 36 bins.


In [5]:
fig = px.histogram(
    df_trips, x="trip_duration", title="Trip Duration Distribution", nbins=36
)
fig.show()
# YOUR CODE ENDS

---

## 📈 Line chart

A line chart is used to visualize data points connected by lines. It is particularly useful for showing trends over time or continuous data.


▶️ Import a gold price dataset for a simple demo.


In [6]:
df_gold = pd.read_csv(
    "https://github.com/bdi475/datasets/raw/main/gold-annual-closing-price.csv"
)
df_gold

Unnamed: 0,Year,Closing Price
0,1950,34.720
1,1951,34.660
2,1952,34.790
3,1953,34.850
4,1954,35.040
...,...,...
66,2016,1152.165
67,2017,1265.674
68,2018,1249.887
69,2019,1480.025


▶️ Create the line chart.


In [7]:
fig = px.line(
    df_gold, x="Year", y="Closing Price", title="Annual Closing Price of Gold"
)
fig.show()

▶️ Create an aggregated DataFrame with number of trips by date.


In [8]:
df_num_trips_by_date = df_trips.groupby(
    df_trips["start_time"].dt.date, as_index=False
).size()

df_num_trips_by_date.rename(
    columns={"start_time": "date", "size": "num_trips"}, inplace=True
)

display(df_num_trips_by_date)

Unnamed: 0,date,num_trips
0,2020-01-01,156
1,2020-01-02,364
2,2020-01-03,421
3,2020-01-04,160
4,2020-01-05,214
...,...,...
330,2020-11-26,153
331,2020-11-27,411
332,2020-11-28,429
333,2020-11-29,443


▶️ Create a line chart that displays the number of trips by date.


In [9]:
fig = px.line(
    df_num_trips_by_date,
    x="date",
    y="num_trips",
    title="Number of Trips by Date in 2020",
)
fig.show()
# YOUR CODE ENDS

▶️ Create a scatter plot that displays the number of trips by date.


In [10]:
fig = px.scatter(
    df_num_trips_by_date,
    x="date",
    y="num_trips",
    title="Number of Trips by Date in 2020",
)
fig.show()
# YOUR CODE ENDS

:::{tip} `px.line()` vs. `px.scatter()`

Did you notice that changing the line chart to a scatter plot is as simple as changing `px.line()` to `px.scatter()`? Both functions share similar parameters, making it easy to switch between line and scatter plots based on your visualization needs.

This is what makes Plotly Express so versatile and user-friendly! The ability to easily switch between different types of plots allows you to explore your data from various perspectives without needing to learn new syntax for each plot type.

:::


▶️ Create an aggregated DataFrame with number of trips by date & user type.


In [11]:
df_num_trips_by_date_and_user_type = df_trips.groupby(
    [df_trips["start_time"].dt.date, "user_type"], as_index=False
).size()

df_num_trips_by_date_and_user_type.rename(
    columns={"start_time": "date", "size": "num_trips"}, inplace=True
)

display(df_num_trips_by_date_and_user_type)
# YOUR CODE ENDS

Unnamed: 0,date,user_type,num_trips
0,2020-01-01,Customer,36
1,2020-01-01,Subscriber,120
2,2020-01-02,Customer,38
3,2020-01-02,Subscriber,326
4,2020-01-03,Customer,49
...,...,...,...
665,2020-11-28,Subscriber,312
666,2020-11-29,Customer,115
667,2020-11-29,Subscriber,328
668,2020-11-30,Customer,18


▶️ create a line chart that displays the number of trips by date.


In [12]:
fig = px.line(
    df_num_trips_by_date_and_user_type,
    x="date",
    y="num_trips",
    color="user_type",
    title="Number of Trips by Date and User Type in 2020",
)
fig.show()
# YOUR CODE ENDS

▶️ Create a scatter plot that displays the number of trips by date.


In [13]:
fig = px.scatter(
    df_num_trips_by_date_and_user_type,
    x="date",
    y="num_trips",
    color="user_type",
    title="Number of Trips by Date and User Type in 2020",
)
fig.show()
# YOUR CODE ENDS

---

## 📊 Bar chart

A bar chart is used to represent categorical data with rectangular bars. The length of each bar is proportional to the value it represents. Bar charts are useful for comparing different categories or groups.

There are two types of bar charts: vertical and horizontal. In vertical bar charts, the bars extend vertically from the x-axis, while in horizontal bar charts, the bars extend horizontally from the y-axis. Vertical bar charts are sometimes called column charts.


---

▶️ Create an aggregated DataFrame with number of trips by month.


In [14]:
df_num_trips_by_month = df_trips.groupby(
    df_trips["start_time"].dt.month, as_index=False
).size()

df_num_trips_by_month.rename(
    columns={"start_time": "month", "size": "num_trips"}, inplace=True
)

display(df_num_trips_by_month)
# YOUR CODE ENDS

Unnamed: 0,month,num_trips
0,1,13256
1,2,13850
2,3,11010
3,4,4398
4,5,11553
5,6,18624
6,7,25870
7,8,28652
8,9,30937
9,10,25324


▶️ create a bar chart that displays the number of trips by month.


In [15]:
fig = px.bar(
    df_num_trips_by_month,
    x="month",
    y="num_trips",
    title="Number of Trips by Month in 2020",
)
fig.show()
# YOUR CODE ENDS

---

## More Types of Visualizations with Plotly


▶️ Import Chicago Airbnb listings dataset.

We will use this dataset to create other types of charts.


In [16]:
df_listings = pd.read_csv(
    "https://github.com/bdi475/datasets/raw/main/case-studies/airbnb-sql/Chicago.csv"
)
df_listings_backup = df_listings.copy()
df_listings.head(3)

Unnamed: 0,name,neighbourhood,room_type,bedrooms,bathrooms,accommodates,minimum_nights,price,availability_365,number_of_reviews,review_score,latitude,longitude,is_superhost
0,"Hyde Park - Walk to UChicago, 10 min to McCormick",Hyde Park,Private room,1.0,1.0,1,2,65.0,355,181,100.0,41.7879,-87.5878,1
1,394 Great Reviews. 127 y/o House. 40 yds to tr...,South Lawndale,Entire home/apt,3.0,1.0,7,2,117.0,184,395,96.0,41.85495,-87.69696,1
2,Tiny Studio Apartment 94 Walk Score,West Town,Entire home/apt,3.0,1.0,2,2,70.0,365,389,93.0,41.90289,-87.68182,1


▶️ Sample 100 listings with price under $200.


In [17]:
df_under_200_sample = df_listings[df_listings["price"] < 200].sample(100)

### 3D scatter plot

A 3D scatter plot is used to visualize data points in three-dimensional space. It allows you to see the relationships between three variables simultaneously. Each point in the plot represents a data point with three coordinates (x, y, z).

A 3D visualization requires interactive capabilities to rotate and explore the data from different angles, which Plotly provides.


▶️ Create a 3D scatter plot with the following axes: - `x`: Number of bedrooms - `y`: Number of bathrooms - `z`: Price.


In [18]:
fig = px.scatter_3d(
    df_under_200_sample,
    title="Bedrooms, Bathrooms, Price 3D Scatter Plot",
    x="bedrooms",
    y="bathrooms",
    z="price",
    color="room_type",
    template="plotly_dark",
    width=800,
    height=600,
)
fig.show()
# YOUR CODE ENDS

:::{danger} Avoid using 3D charts

A reminder - 3D charts in general are hard to read and interpret.If you need to show multiple dimensions, consider using color, size, or facets instead of adding a third spatial dimension.

:::


▶️ Find the top 20 neighbourhoods by number of listings.


In [19]:
top_20_neighbourhoods = (
    df_listings["neighbourhood"].value_counts().head(20).index.tolist()
)

top_20_neighbourhoods

['West Town',
 'Lake View',
 'Logan Square',
 'Near North Side',
 'Lincoln Park',
 'Near West Side',
 'Lower West Side',
 'Edgewater',
 'Uptown',
 'North Center',
 'Irving Park',
 'Loop',
 'Avondale',
 'Rogers Park',
 'Near South Side',
 'Bridgeport',
 'Lincoln Square',
 'Grand Boulevard',
 'Hyde Park',
 'Armour Square']

▶️ Filter listings in the top 20 neighbourhoods.


In [20]:
df_filtered = df_listings[
    (df_listings["neighbourhood"].isin(top_20_neighbourhoods))
    & (df_listings["price"] < 300)
]

### Pie chart

A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical proportions. Each slice represents a category's contribution to the whole, making it easy to see relative sizes at a glance. Pie charts are best used when you want to show parts of a whole and compare proportions among categories.

Some best practices for using pie charts effectively include:

- Limiting the number of slices to avoid clutter,
- Using clear labels or a legend to identify each slice,
- Avoiding 3D effects or excessive decoration that can distract from the data.


▶️ Create a pie chart that shows the distribution of listings across the top 20 neighbourhoods.


In [21]:
fig = px.pie(
    df_filtered,
    names="neighbourhood",
    title="Neighbourhood breakdown",
    width=800,
    height=700,
)

fig.show()
# YOUR CODE ENDS

▶️ Find the aggregated statistics by neighbourhood and room type.


In [22]:
df_by_neighbourhood_room_type = (
    df_filtered.groupby(["neighbourhood", "room_type"], as_index=False)
    .agg(
        {
            "name": "count",
            "bedrooms": "mean",
            "bathrooms": "mean",
            "accommodates": "mean",
            "price": "mean",
        }
    )
    .rename(columns={"name": "num_listings"})
)

display(df_by_neighbourhood_room_type.head(5))

Unnamed: 0,neighbourhood,room_type,num_listings,bedrooms,bathrooms,accommodates,price
0,Armour Square,Entire home/apt,22,3.090909,1.545455,9.0,167.454545
1,Armour Square,Private room,25,1.0,1.76,1.92,44.48
2,Avondale,Entire home/apt,59,1.932203,1.152542,4.932203,92.084746
3,Avondale,Private room,9,1.111111,1.055556,1.666667,70.888889
4,Avondale,Shared room,1,1.0,1.0,1.0,30.0


### Treemap chart

A treemap is a visualization that displays hierarchical data using nested rectangles. Each rectangle represents a category or subcategory, and its size is proportional to a specific value, such as count or sum. Treemaps are useful for visualizing large datasets with multiple levels of hierarchy, allowing you to see the relative sizes of different categories at a glance.


▶️ Create a treemap chart that shows the distribution of listings across the top 20 neighbourhoods.


In [23]:
fig = px.treemap(
    df_by_neighbourhood_room_type,
    path=["neighbourhood"],
    title="Top 20 neighbourhoods breakdown",
    values="num_listings",
    height=700,
)

fig.show()
# YOUR CODE ENDS

A treemap can also represent multiple levels of hierarchy. For example, we can visualize both neighbourhoods and room types within each neighbourhood.


▶️ Create a treemap chart that shows the distribution of listings across the top 20 neighbourhoods, broken down by room type.


In [24]:
fig = px.treemap(
    df_by_neighbourhood_room_type,
    path=["neighbourhood", "room_type"],
    title="Top 20 neighbourhoods breakdown",
    values="num_listings",
    height=700,
)

fig.show()
# YOUR CODE ENDS

### Sunburst chart

A sunburst chart is a radial visualization that displays hierarchical data using concentric circles. Each level of the hierarchy is represented by a ring, with the innermost circle representing the root node and outer rings representing child nodes. Sunburst charts are useful for visualizing hierarchical relationships and proportions within a dataset.

Think of a sunburst chart as a circular version of a treemap, where the size of each segment corresponds to a specific value, such as count or sum. Or you can think of it as a pie chart with multiple levels, where each level represents a different layer of the hierarchy.


▶️ Create a sunburst chart that shows the distribution of listings across the top 20 neighbourhoods, broken down by room type.


In [25]:
fig = px.sunburst(
    df_by_neighbourhood_room_type,
    path=["neighbourhood", "room_type"],
    title="Listings Breakdown by Neighbourhood and Room Type",
    values="num_listings",
    width=800,
    height=800,
)

fig.show()
# YOUR CODE ENDS

### Back to the basics

While these advanced visualizations can be powerful, it's essential to remember the importance of clarity and simplicity in data visualization. Always consider your audience and the message you want to convey when choosing the type of chart to use. Sometimes, sticking to basic charts like bar charts or line charts can be more effective in communicating your insights clearly.


▶️ Create a bar chart that shows the distribution of listings across the top 20 neighbourhoods, broken down by room type.


In [26]:
fig = px.bar(
    df_by_neighbourhood_room_type,
    x="num_listings",
    y="neighbourhood",
    color="room_type",
    template="plotly_dark",
    title="Listings Breakdown by Neighbourhood and Room Type",
    height=600,
)

fig.update_yaxes(categoryorder="total ascending")

fig.show()
# YOUR CODE ENDS