# Visualizations

The three most popular visualization libraries in Python are:
- `matplotlib`: old, but powerful, not the most convenient to use
- `seaborn`: very nice plots out of the box, often used for pulications, built upon `matplotlib`
- `plotly`: newest library, interactive plots, easy syntax, often used for analytics

We will be using `plotly` in this course, more specifically `plotly.express` as the most convenient option.

In [39]:
import plotly.express as px
import pandas as pd
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.colors import n_colors
import numpy as np

In [40]:
print(pd.__version__)

2.3.3


## Scatter Plots

Let's start with scatter plots using the `iris` dataset as example: https://en.wikipedia.org/wiki/Iris_flower_data_set  

**Tasks:**
1. Load the IRIS data set by executing the following cell. Then display the DataFrame by writing `df` as last (or only) line of a cell and then executing it.

In [41]:
df = px.data.iris()

In [42]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,3
146,6.3,2.5,5.0,1.9,virginica,3
147,6.5,3.0,5.2,2.0,virginica,3
148,6.2,3.4,5.4,2.3,virginica,3


Sepals and petals are different leafs of the flower. What we see here are the measured lenghts and widths of different iris flowers belonging to 3 different species.

2. Visualize the `sepal_width` on the x-axis and the `sepal_length` on the y-axis, by completing the following code:

In [None]:
""" "px.scatter_matrix(df,
dimensions=["sepal_length", "sepal_width"],
color="species")
"""

'"px.scatter_matrix(df,\n    dimensions=["sepal_length", "sepal_width"],\n    color="species")\n    '

In [None]:
fig = px.scatter(
    df,
    x="sepal_width",
    y="sepal_length",
)
fig.show()

3. Add a color template and adjust the width and the height of the plot:

In [None]:
px.scatter(
    df,
    x="sepal_width",
    y="sepal_length",
    template="plotly_white",  # a simple color template without grid lines; you could e.g. also use "plotly_white" if you prefer grid lines
    width=800,  # the width in pixel; try out 800
    height=500,  # the height in pixel; try out 500
)

4. Color the markers by their respective iris species:

In [46]:
px.scatter(
    df,
    x="sepal_width",
    y="sepal_length",
    color="sepal_length",  # column name that gives the categories to be colored
    template="simple_white",
    width=800,
    height=500,
)

5. Add the `petal_length` to be displayed as marker size:

In [47]:
px.scatter(
    df,
    x="sepal_width",
    y="sepal_length",
    color="species",
    size="petal_length",  # column name that gives the values for the marker size
    template="simple_white",
    width=800,
    height=500,
)

6. There is one more property left that we didn't include into the plot so far. Include the `petal_width` to be shown as well when hovering over the data points:

In [None]:
px.scatter(
    df,
    x="sepal_width",
    y="sepal_length",
    color="species",
    size="petal_length",
    hover_data=[
        "sepal_width",
        "petal_length",
    ],  # list of column names to be displayed when hovering over data points, additional to those already shown
    template="simple_white",
    width=800,
    height=500,
)

7. Add a trendline using `"lowess"`, which stands for "Locally WEighted Scatterplot Smoothing (LOWESS)". Check out https://plotly.com/python/linear-fits/ for more trendline options.

In [None]:
px.scatter(
    df,
    x="sepal_width",
    y="sepal_length",
    color="species",
    size="petal_length",
    hover_data=["sepal_width", "petal_length"],
    trendline="lowess",
    # trendline="owl",
    template="simple_white",
    width=800,
    height=500,
)

8. Lastly, give the plot a title and adjust with which labels the columns should be displayed. Use the labels in the following table:
| Column Name | Label |
| ----------- | ----------- |
| "sepal_length"  | "Sepal Length [cm]" |
| "sepal_width" | "Sepal Width [cm]" |
| "petal_length" | "Petal Length [cm]" |
| "petal_width" | "Petal Width [cm]" |
| "species" | "IRIS Species" |

In [None]:
px.scatter(
    df,
    x="sepal_width",
    y="sepal_length",
    color="species",
    size="petal_length",
    hover_data=["petal_width"],
    trendline="lowess",
    template="simple_white",
    width=800,
    height=500,
    title="My Plot",  # title to be displayed
    labels={
        "sepal_length": "sepal length in cm",
        "sepal_width": "sepal width in cm",
        "petal_length": "petal length cm",
        "petal_width": "petal width cm",
        "species": "IRIS species",
    },  # a dictionary where you can write for each column name the label that should be displayed
)

## Histograms

Let's use a different data set now, that displays the recorded tip amounts, depending on different other properties.

**Tasks:**
1. Load the Tips data set by executing the following cell. Then display the DataFrame by writing `df` as last (or only) line of a cell and then executing it.

In [51]:
df = px.data.tips()

In [52]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


2. Create a histogram, that shows how often each `total_bill` occured in the dataset:  
   *Note:* Since histograms display counts (or percentages) on the y-axis, we don't need to specify a y-axis from the dataset.

In [53]:
px.histogram(
    df,
    x="total_bill",  # column name that gives the x-values
)

3. Adjust the number of bins, as well as the width, height and template to suitable values:

In [54]:
px.histogram(
    df,
    x="total_bill",
    nbins=60,  # this specificies the maximum number of bins (not the exact number - plotly tries to round to nice numbers)
    width=1200,
    height=1000,
    # template="",
)

4. So far we can only see the distribution of a single variable. Let's add some color to see some effects. Use `color` to color the histogram by sex. Use `barmode` and try out three different options: `stack`, `group`, `overlay`. What are the use-cases for the different options?  
*Note:* When using `overlay`, you can also add the option `opacity` and give it a value between 0 and 1 to adjust how transparent the plots are.

In [55]:
px.histogram(
    df,
    x="total_bill",
    color="sex",
    barmode="group",  # choice of "stack", "group" or "overlay"
    nbins=10,
    template="simple_white",
    width=500,
    height=400,
)

5. Another way to display distributions are box-plots. We can easily add a box-plot to the histogram, by setting `marginal` to `box`. Feel free to try out the other options as well:

In [56]:
px.histogram(
    df,
    x="total_bill",
    color="sex",
    marginal="rug",  # choice of "rug", "box", "violin", "histogram"
    barmode="overlay",
    nbins=10,
    opacity=0.8,
    template="simple_white",
    width=500,
    height=400,
)

6. When comparing distributions we are often not only interested in absolute amounts, but also in relative differences. To visualize those better, let's create a 100% stacked histogram: Set `barnorm` (careful: not `barmode`) to `percent` and observe what changes in the histogram. Which information do you gain, which do you lose?

In [None]:
px.histogram(
    df,
    x="total_bill",
    color="sex",
    barnorm="percent",
    nbins=10,
    template="simple_white",
    width=500,
    height=400,
)

7. Let's check if there is a difference in smoking frequency between male- and female-identifying individuals. Add a `pattern_shape` that shows the distribution between smokers and non-smokers.

In [58]:
px.histogram(
    df,
    x="sex",  # we are now using sex for both x and color, just to keep the colorization that we used before (consistency in your plots is important!)
    color="sex",
    pattern_shape="smoker",
    template="simple_white",
    width=400,
    height=400,
)

8. Since we have one category that is overrepresented in the data set, create a 100% stacked histogram again to come to a conclusion.

In [59]:
px.histogram(
    df,
    x="sex",  # we are now using sex for both x and color, just to keep the colorization that we used before (consistency in your plots is important!)
    color="sex",
    pattern_shape="smoker",
    template="simple_white",
    barnorm="fraction",
    width=400,
    height=400,
)

9. How much in tips were earned per weekday?  Create a histogram that shows the sum of the tips per day.

In [60]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [61]:
px.histogram(
    df,
    x="total_bill",
    y="tip",  # by adding a y-value, the histogram will not display the count but the summed values of that column
    template="simple_white",
    width=800,
    height=500,
)

10. Since our days have an order, we would also like our histogram to show them in the proper order. Use `category_orders` to sort the days on your x-axis correctly.

In [62]:
px.histogram(
    df,
    x="day",
    y="tip",
    category_orders={"column_name": ["categorical_value1", "categorical_value2", ...]},
    template="simple_white",
    width=800,
    height=500,
)

11. We already saw one way to set a title and x- and y-axis labels. Here is another, that can be a bit easier to read:

In [63]:
px.histogram(
    df,
    x="day",
    y="tip",
    category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
    template="simple_white",
    width=800,
    height=500,
).update_layout(
    title="Day Tips",  # add a title
    yaxis_title="lable",  # add y-axis lable
    xaxis_title=None,  # you can use None if you don't want any axis lable
)

12. Add again the `sex` as `color` and make it a grouped histogram.

In [None]:
px.histogram(
    df,
    x="day",
    y="tip",
    category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
    template="simple_white",
    width=800,
    height=500,
    color="sex",
).update_layout(
    title="Day Tips",  # add a title
    yaxis_title="lable",  # add y-axis lable
    xaxis_title=None,
    # you can use None if you don't want any axis lable
)

## Box Plots

A box plot shows the distribution of the data around the *median*. It shows a box that reaches from the lower quartile (Q1) to the upper quartile (Q3). The "whiskers" reach until the last data point that is inside 1.5 times the IQR. Datapoints outside 1.5*IQR are generally considered outliers w.r.t. the distribution of this variable.

**Tasks:**:
1. Create a box plot showing the distribution of the `total_bill`.

In [65]:
px.box(
    df,
    y="total_bill",
    template="plotly_white",
    width=900,
    height=700,
).update_layout(
    yaxis_title="Total Bill",
)

2. Add the `day` as second variable, using the `color` argument. Also add again the proper `category_order`.

In [66]:
px.box(
    df,
    y="total_bill",
    color="day",
    category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
    template="simple_white",
    width=600,
    height=500,
).update_layout(
    yaxis_title="Total Bill",
)

3. Use `notched=True` to add a "rough guide of the significance of the difference of medians" of the box plots. For the interpreations of the notches, the plotly library refers to https://en.wikipedia.org/wiki/Box_plot#Variations  
   The size of the notches is calculated by $\pm \dfrac{1.58\times IQR}{\sqrt{n}}$.

In [67]:
px.box(
    df,
    y="total_bill",
    color="day",
    notched=True,
    category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
    template="simple_white",
    width=600,
    height=500,
).update_layout(
    yaxis_title="Total Bill",
)

4. Let's see how the notches change, when we artificially increase the number of datapoints in our dataset. The following code repeats every datapoint in our dataset 10 times. Repeat the visualization with this new dataset and observe the changes.

In [68]:
df_big = pd.concat([df] * 10, ignore_index=True)
px.box(
    df_big,
    y="total_bill",
    color="day",
    notched=True,
    category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
    template="simple_white",
    width=600,
    height=500,
).update_layout(
    yaxis_title="Total Bill",
)

## Distribution Plots

Box-Plots only make sense when the data is *unimodal*, i.e. it has only one peak, such that the distribution of the data around the median gives us a good idea about the data distribution, the IQR and the outliers.

**Tasks:**
1. Create a distribution plot to check if our data is unimodal. This is a bit more complicated from the syntax, so let's start simple. Plot the distribution of the `tip` variable.

In [None]:
# Define histogram data and the names to be shown
hist_data = [
    df["tip"],
]
group_labels = [
    "Tip",
]

# Create the distplot
ff.create_distplot(
    hist_data,
    group_labels,
)

2. Add our usual stylings using `.update_layout()`:

In [None]:
# Define data and names
hist_data = [
    df["tip"],
]
group_labels = [
    "Tip",
]

# Create the distplot
ff.create_distplot(
    hist_data,
    group_labels,
).update_layout(
    title="Day Tips",  # add a title
    yaxis_title="lable",  # add y-axis lable
    xaxis_title=None,
    template="simple_white",
    width=1000,
    height=1000,
)

3. Disable the histogram and the "rug" on the bottom to only show the distribution by setting `show_hist` and `show_rug` to `False`.

In [None]:
# Define histogram data and the names to be shown
hist_data = [
    df["tip"],
]
group_labels = [
    "Tip",
]

# Create the distplot
ff.create_distplot(
    hist_data,
    group_labels,
    show_hist=False,
    show_rug=False,
).update_layout(
    title=None,
    xaxis_title="Tip",
    yaxis_title="Density",
    template="simple_white",
    width=600,
    height=500,
)

4. Let's see if the distribution of tips changes between smokers and non-smokers. Check out the following code to do so and try to understand what's going on. Then change the code to show the distribution of the `total_bill` for each `day` to finally be able to check if the distributions are *unimodal*:

In [None]:
# Define histogram data and the names to be shown
hist_data = [
    df.loc[df["smoker"] == value, "tip"] for value in ["No", "Yes"]
]  # creates a list of arrays, holding the data of non-smokers and smokers
group_labels = ["Non-Smoker", "Smoker"]

# Create the distplot
ff.create_distplot(
    hist_data, group_labels, show_hist=False, show_rug=False
).update_layout(
    title=None,
    xaxis_title="Tip",
    yaxis_title="Density",
    template="simple_white",
    width=600,
    height=500,
)

In [None]:
# Define histogram data and the names to be shown
hist_data = [
    df["total_bill"],
]
group_labels = [
    "day",
]

# Create the distplot
ff.create_distplot(
    hist_data,
    group_labels,
    show_hist=False,
    show_rug=False,
).update_layout(
    title=None,
    xaxis_title="Tip",
    yaxis_title="Density",
    template="simple_white",
    width=600,
    height=500,
)

## Knowledge Discovery

Assume you are a waiter/waitress at a restaurant. Who would be your favorite costumer? Also: There is one more variable, that we didn't investigate yet, which is the size of the party. Does this play a role?  
**Task:** Create as many visualization as you need to find out the perfect guest or group of guests to serve.

In [94]:
px.data.tips()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [105]:
hist_data = [
    df.loc[df["sex"] == value, "tip"] for value in ["Male", "Female"]
]
group_labels = ["Male", "Female"]

# Create the distribution plot
fig = ff.create_distplot(
    hist_data,
    group_labels,
    show_hist=False,  # Focus on the density curve
    show_rug=False
)

# Update layout with more descriptive titles and a clean theme
fig.update_layout(
    title_text="Distribution of Tips by Gender", # It's good practice to have a title
    xaxis_title="Tip Amount ($)", # Adding units for clarity
    yaxis_title="Probability Density", # More explicit than just "Density"
    template="simple_white",
    legend_title_text="Gender",
    width=1200, # Adjusted for better readability
    height=700
)

fig.show()

## Extra: Map of natural disasters.

To give you an idea, what else is possible, execute the following two code cells. The first one fetches current data from natural desasters from the EONET project of NASA (this will take a moment to run). The second code cell renders the events on an interactive globe. Have fun!

In [None]:
import requests


def fetch_all_eonet():
    BASE_URL = "https://eonet.gsfc.nasa.gov/api/v3/events"
    params = {
        "status": "all",  # open + closed events
    }

    r = requests.get(BASE_URL, params=params)
    r.raise_for_status()
    data = r.json()

    records = []
    for event in data["events"]:
        for category in event["categories"]:
            for geometry in event["geometry"]:
                if geometry["type"] == "Point":
                    record = {
                        "id": event["id"],
                        "title": event["title"],
                        "description": event["description"],
                        "category": category["title"],
                        "date": geometry["date"],
                        "latitude": geometry["coordinates"][1],
                        "longitude": geometry["coordinates"][0],
                        "magnitudeValue": geometry["magnitudeValue"],
                        "magnitudeUnit": geometry["magnitudeUnit"],
                    }
                    records.append(record)

    df_records = pd.DataFrame(records)

    return df_records


df = fetch_all_eonet()
df = df.loc[df.date.between("2024", "2025")]
print(f"Fetched {len(df)} events — Dates: {df['date'].min()} to {df['date'].max()}")

Fetched 7573 events — Dates: 2024-01-01T00:00:00Z to 2024-12-31T19:00:00Z


In [None]:
# https://developer.mozilla.org/en-US/docs/Web/CSS/named-color
# https://plotly.com/python/map-configuration/
# https://plotly.com/python-api-reference/generated/plotly.express.scatter_geo
# https://plotly.com/python/reference/layout/geo/

px.scatter_geo(
    df,
    lat="latitude",
    lon="longitude",
    projection="orthographic",
    hover_data=["category", "date"],
    color="category",
    width=1200,
    height=800,
).update_traces(
    # mode='markers', marker=dict(size=4, color='crimson'),
).update_geos(
    showland=True,
    landcolor="wheat",
    showocean=True,
    oceancolor="dodgerblue",
    showlakes=True,
    lakecolor="lightskyblue",
    showcountries=True,
    countrycolor="gray",
    showrivers=True,
    rivercolor="lightskyblue",
)