# Exercise set 7

>The goal of this exercise is to learn how to perform a
>**principal component analysis (PCA)**. We will here focus
>on how we can plot and inspect the scores and loadings,
>and the variance explained by different principal components (PCs).

## Exercise 7.1

In the exercise, we will investigate if we can
"discover" the periodic table from a data set that
contains information on the first 86 elements (period 1&ndash;6).
The variables present in this data set are described
in Table 1.

|**Column**        | **Description**                                         | **Unit** |
|------------------|---------------------------------------------------------|----------|
|element           | The symbol for the element (e.g. H, He, etc.)           | —        |
|metal             | Classification of the element as a metal (yes) or not (no) | —        |
|mass              | Atomic weight                                           | u        |
|density           | Density of the element                                  | g/cm³    |
|atomic_radius     | Radius of the element                                   | Å        |
|electronegativity | The electronegativity of the element                    | —        |
|first_ionization  | The first ionization energy of the element              |          |
|neutrons          | The number of neutrons in the element                   | —        |
|protons           | The number of protons in the element                    | —        |
|electrons         | The number of electrons in the element                  | —        |
|1s, 2s, 2p, etc.  | The number of electrons in different orbitals           | —        |
|**Table 1:** *Data columns present in the file [Data/periodic_table.csv](./Data/periodic_table.csv)*        |   |

We will use principal component analysis to investigate the data set,
and in this exercise, we will focus on creating plots for
the scores, loadings, and explained variance.

### 7.1(a)
Begin by loading the data, this can be done with:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

%matplotlib inline
# For interactive use: %matplotlib notebook

sns.set_theme(style="ticks", context="notebook", palette="muted")

data = pd.read_csv("Data/periodic_table.csv")
data.head()

Investigate the correlations between the variables `mass`, `atomic_radius`,
`electronegativity`, `first_ionization`, `neutrons`, `protons`, and `electrons`. Are these variables
correlated as you expect?

In [None]:
# Select the variables to use for correlations
select = [
    "mass",
    "atomic_radius",
    "electronegativity",
    "first_ionization",
    "neutrons",
    "protons",
    "electrons",
]

In [None]:
# Your code here. Hint: Use the .corr() method of a pandas.
corr = data[select].corr()
corr.style.background_gradient(cmap="vlag")

#### Your answer to question 7.1(a):

Here are some interesting correlations:

- We have a correlation of 1 between protons and electrons,
  as expected (atoms have the same number of protons and electrons).

- The correlation between the number of protons (or neutrons) and neutrons is almost 1:
  Atoms with more protons will generally have more neutrons, but the numbers are not necessarily the same
  (as in isotopes). But if you plot the number of neutrons as a function of the
  number of protons, you will get a nice straight line (as suggested by the high
  correlation coefficient).

- Further, we have that the mass is also highly positively correlated with the number of
  protons (or electrons or neutrons), as expected since the number of particles determines the mass.
  
- The number of particles (protons, electrons, or neutrons) is not strongly correlated with
  radius, electronegativity, or the first ionization energy. There is some correlation here
  , but the number of protons does not solely determine properties like the radius. We know
  that the radius mainly increases with the number of protons (when we go down the rows in the
  periodic table), but it decreases with the number of protons when we move across one row
  (e.g., compare F and B).
  
- The electronegativity and the first ionization energy are positively correlated.
  The ionization energy is the energy needed to remove an electron from a neutral
  atom. Electronegativity quantifies how "well" an atom attract shared electrons in a
  chemical bond. These two quantities reflect the same underlying properties (the electron
  configuration and the forces between the nucleus and the electrons), so we expect
  them to be positively correlated.
  
- The atomic radius is negatively correlated with both the electronegativity and the first ionization
  energy. If we simplify things, the ionization energy is determined by the coulomb force between the
  nucleus and the outermost electron. Since this drops with increasing radius, we expect a
  negative correlation with the atomic radius. We can use the same reasoning for the
  electronegativity; the "pulling" force on extra electrons will increase with a smaller
  radius (distance), so we also expect a negative correlation here.

### 7.1(b)
Next, run a principal component analysis on the data you just
loaded. This can be done with:

In [None]:
variables = [i for i in data.columns if i not in ("element", "metal")]
elements = data["element"].values
X = data[variables]

X = scale(X)

# Run PCA and obtain the scores:
pca = PCA()
scores = pca.fit_transform(X)

Notice here that we do not include the metal classification in the
data we analyze. This is
because we will use this information
later and want to check if this classification is
something the PCA analysis can discover from the other variables.

The last line in the code above
performs the principal component analysis and returns the *scores*.
In your own words, how would you describe scores? Check the
dimensionality of the scores matrix &ndash; is this as expected?

In [None]:
print("Dimensions for scores:", scores.shape)
print("Dimensions for original data", X.shape)

#### Your answer to question 7.1(b):

The scores are the coordinates in the new coordinate system found by the PCA. The directions in the new coordinate system point toward the directions of the largest variance in the original data.

There are 86 rows and 23 columns in the scores matrix. 86 is the number of observations (number of elements) in the data set, and 23 is the number of principal components found in the PCA.
The maximum number of principal components we can obtain equals the original number of variables (23). We see here that the PCA we just did defaults to finding
the maximum number of principal components. If you rerun the analysis by setting the number of components, say ``pca = PCA(n_components=5)`` you should see a change in the dimensionality of the scores:

In [None]:
scores2 = PCA(n_components=5).fit_transform(X)
print("Dimensions for scores with 5 principal components:", scores2.shape)

### 7.1(c)
Plot the scores for principal component number 1 against the scores
of principal component number 2. This can be done with:

In [None]:
# Plot scores for the two first principal components:
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(scores[:, 0], scores[:, 1])
ax.set(xlabel=f"Scores PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax.set(ylabel=f"Scores PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")

Here, you can also
show labels for the elements with the following modification to the code above:

In [None]:
# Plot scores for the two first principal components
# + add element symbols
fig, ax = plt.subplots(constrained_layout=True)
ax.scatter(scores[:, 0], scores[:, 1])
ax.set(xlabel=f"Scores PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax.set(ylabel=f"Scores PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")

# Add labels for the elements:
for i, symbol in enumerate(elements):
    ax.text(scores[i, 0], scores[i, 1], symbol, fontsize="small")

Do you observe any groupings or trends in the data? How does this compare with the periodic system?

#### Your answer to question 7.1(c):

Along PC1, we have six groups in our data. By closer inspection, we see that each group contains elements from one row in the periodic table. So the PCA analysis has found the rows in the periodic table! Further, the elements within one such group seem to be "sorted" (more or less!) along PC2 according to the number of protons (or electrons) so that the elements with more protons are higher up on the y-axis. 

### 7.1(d)
Add some color to your scores plot by coloring the elements according to their
classification as metals or not. This can be done by using the `c` argument
of the `ax.scatter` method:

In [None]:
# Plot scores for the two first principal components + add element symbols
fig, ax = plt.subplots(constrained_layout=True)
sns.scatterplot(data=data, x=scores[:, 0], y=scores[:, 1], hue="metal")
ax.set(xlabel=f"Scores PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax.set(ylabel=f"Scores PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")

# Add labels for the elements:
for i, symbol in enumerate(elements):
    ax.text(scores[i, 0], scores[i, 1], symbol, fontsize="small")
sns.despine(fig=fig)

Do you observe any new groupings/trends in the data after
adding this extra color? Here, you can also experiment with using different
columns for coloring the data, for instance, the number of electrons.

In [None]:
# Plot scores for the two first principal components + add element symbols
fig, ax = plt.subplots(constrained_layout=True)
ax.set(xlabel="Scores for PC1", ylabel="Scores for PC2")

scat = sns.scatterplot(
    data=data, x=scores[:, 0], y=scores[:, 1], hue="metal", size="protons"
)
ax.set(xlabel=f"Scores PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax.set(ylabel=f"Scores PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")

# Add labels for the elements:
# Add labels for the elements:
for i, symbol in enumerate(elements):
    ax.text(scores[i, 0], scores[i, 1], symbol, fontsize="small")
sns.despine(fig=fig)

#### Your answer to question 7.1(d):

The metallic elements are positioned in the figure's lower part and the lower part of each group.
There is no clear separation between the elements labeled as metals and non-metals. This is expected; many elements have the characteristics of both metals and non-metals!

### 7.1(e)
Next, we will investigate how much of the variance we explain
with the different principal components.
The variance explained by a particular
component can be accessed by using `pca.explained_variance_ratio_`.
Below, you can find some code that will plot
the explained variance per component in a bar plot:

In [None]:
# Plot the explained variance:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, figsize=(8, 4)
)
variance = pca.explained_variance_ratio_ * 100
components = 1 + np.arange(len(variance))
ax1.bar(components, variance)
ax1.set_xticks(components[::2])
ax1.set(
    xlabel="No. of principal components",
    ylabel="Percentage of variance explained",
);

Add a line plot to the second axis, `ax2`, in the plot above that
shows the total variance explained by $x$ components. That is, the $x$-axis should
show the number of components used, and the $y$-axis should show the summed explained variance
when using $x$ components. For calculating the summed explained variance, you can
use the cumulative sum which can be obtained by `np.cumsum(pca.explained_variance_ratio_)`.

Based on the plot you just created for the explained variance,
how many principal components are needed
to explain at least 90\% of the variance?

In [None]:
# Plot the explained variance:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, sharex=True, figsize=(8, 4)
)
variance = pca.explained_variance_ratio_ * 100
components = 1 + np.arange(len(variance))
ax1.bar(components, variance)
ax1.set(
    xlabel="No. of principal components",
    ylabel="Percentage of variance explained",
)
ax2.plot([0] + list(components), [0] + list(np.cumsum(variance)), marker="o")
ax2.set(
    xlabel="No. of principal components",
    ylabel="Percentage of variance explained",
)
ax2.axhline(y=90, ls=":", color="k")
sns.despine(fig=fig)

variance_sum = np.cumsum(variance)
idx = np.where(variance_sum > 90)[0]
print("Number of components needed:", components[idx[0]])

#### Your answer to question 7.1(e): How many components are needed to explain at least 90% of the variance?

From the plot above: We need at least six components.

### 7.1(f)
Next, we will investigate the loadings. In your own words, how would you
explain what the loadings are?

#### Your answer to question 7.1(f):
The loadings describe how we transform the original variables to the new coordinate system (to the principal
components). Specifically, the principal components are linear combinations of the original variables, and
the loadings contain the coefficients for this linear combination.

### 7.1(g)
Let $\mathbf{p}_1$ and $\mathbf{p}_2$ be the
vectors with loadings for the first and second principal components from the analysis you have
just carried out. Verify that the vectors are normalized (e.g., $\mathbf{p}_1 \cdot \mathbf{p}_1 = 1$) and
that they are orthogonal to each other (i.e., $\mathbf{p}_1 \cdot \mathbf{p}_2 = 0$).

The loadings can be accessed with:

In [None]:
# Get the loadings for PC1 and PC2:
loadings = pca.components_.T
pc1 = loadings[:, 0]
pc2 = loadings[:, 1]

**Hint:** You can use `np.dot` to take the dot product.

In [None]:
print("pc1 * pc1 =", np.dot(pc1, pc1))
print("pc2 * pc2 =", np.dot(pc2, pc2))
print("pc1 * pc2 =", np.dot(pc1, pc2))

#### Your answer to question 7.1(g):
When we take the dot product, we find that
the vectors in question are normalized (dot product equal to one) and orthogonal to each other (dot product equal to zero).

### 7.1(h)
For a particular loadings vector, the $i$'th component contains the
contribution from the original variable $i$ to the principal
component described by this loadings vector. This contribution
is a number between $-1$ and $1$.

We can get an overview of the contributions to
principal component number 1 and 2 by plotting the loadings in a bar plot as follows:

In [None]:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, nrows=2, sharex=True, sharey=True
)
position = np.arange(len(pc1))
ax1.bar(position, pc1)
ax1.set_xticks(position)
ax1.set_xticklabels(variables, rotation=90)
ax1.axhline(y=0, ls=":", color="k")  # Horizontal line to show 0
ax2.bar(position, pc2)
ax2.set_xticks(position)
ax2.set_xticklabels(variables, rotation=90)
ax2.axhline(y=0, ls=":", color="k")  # Horizontal line to show 0
ax1.set_title("PC1", loc="left")
ax2.set_title("PC2", loc="left")
sns.despine(fig=fig)

Make a bar plot for the two first principal components and inspect
the contributions from the different variables. The plots
should indicate that the variables neutrons, protons, and electrons contribute
almost equally to both the first and second
principal components. Can you provide an interpretation of this
observation?

#### Your answer to question 7.1(h):

We have already discussed the correlations in [7.1(a)](#Your-answer-to-question-7.1(a):). The
contributions to the principal components reflect this:

- The number of protons and electrons should always be the same for the elements. So when we calculate the scores, we will weigh the protons and electrons equally (we could use one of them since they are equal), and we see this in the identical coefficients for these two variables.

- We also expect something similar for the number of neutrons, but not a perfect correlation, as discussed in  [7.1(a)](#Your-answer-to-question-7.1(a):). Here, this shows up in the coefficients; they are similar but unequal.

### 7.1(i)
The loadings plot is usually easier to interpret than the plot we just made.
Here, we will make the loadings plot
for principal component 1 and principal component 2.
The current case we are investigating is a bit complex since we have many
variables. We will show the loadings by drawing arrows.

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
ax.set(xlabel=f"Loadings PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax.set(ylabel=f"Loadings PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")
# Make the scale for the x- and y-axis the same:
ax.set_xlim(-0.4, 0.4)
ax.set_ylim(ax.get_xlim())
ax.set_aspect("equal")
# Add x=0 and y=0 lines to help location positive and negative values:
ax.axhline(y=0, ls=":", color="k")
ax.axvline(x=0, ls=":", color="k")
# Add the arrows:
for i, vari in enumerate(variables):
    x, y = pc1[i], pc2[i]
    ax.text(x, y, vari, fontsize="x-small")
    # Draw arrow from the origin to the point:
    ax.annotate(
        "",
        xy=(x, y),
        xytext=(0, 0),
        arrowprops=dict(
            arrowstyle="-|>", lw=2, color="red", mutation_scale=25
        ),
    )

One way to make this plot easier to read is to remove the text and use different colors for the arrows.
We can select different colors using the [color_palette](https://seaborn.pydata.org/generated/seaborn.color_palette.html) method from seaborn:

In [None]:
sns.color_palette("husl", as_cmap=True)

In [None]:
sns.color_palette("flare", as_cmap=True)

In [None]:
sns.color_palette("pastel")

Select a color map you like and generate some colors with:

In [None]:
colors = sns.color_palette("husl", len(variables))
# We use len(variables) to get one color per variable

We can use the colors as follows:

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
ax.set(xlabel=f"Loadings PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax.set(ylabel=f"Loadings PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")
# Make the scale for the x- and y-axis the same:
ax.set_xlim(-0.4, 0.4)
ax.set_ylim(ax.get_xlim())
ax.set_aspect("equal")
# Add x=0 and y=0 lines to help location positive and negative values:
ax.axhline(y=0, ls=":", color="k")
ax.axvline(x=0, ls=":", color="k")
# Add the arrows:
arrows = []
for i, vari in enumerate(variables):
    x, y = pc1[i], pc2[i]
    # Draw arrow from the origin to the point:
    arrow = ax.annotate(
        "",
        xy=(x, y),
        xytext=(0, 0),
        arrowprops=dict(
            arrowstyle="-|>", lw=2, color=colors[i], mutation_scale=25
        ),
        label=vari,
    )
    arrows.append(arrow)

ax.legend(
    [i.arrow_patch for i in arrows],
    [i.get_label() for i in arrows],
    fontsize="xx-small",
)


After you have made the loadings plot, locate the
electronegativity and the atomic radius. Are these located (relative
to each other) as you would expect? How about the electrons and protons?

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
ax.set(xlabel=f"Loadings PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax.set(ylabel=f"Loadings PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")
# Make the scale for the x- and y-axis the same:
ax.set_xlim(-0.4, 0.4)
ax.set_ylim(ax.get_xlim())
ax.set_aspect("equal")
# Add x=0 and y=0 lines to help location positive and negative values:
ax.axhline(y=0, ls=":", color="k")
ax.axvline(x=0, ls=":", color="k")
# Add the arrows:
arrows = []
colors2 = sns.color_palette("husl", 4)
j = 0
for i, vari in enumerate(variables):
    if vari not in (
        "atomic_radius",
        "electronegativity",
        "electrons",
        "protons",
    ):
        continue

    x, y = pc1[i], pc2[i]
    ax.text(x, y, vari, fontsize="x-small")
    # Draw arrow from the origin to the point:
    arrow = ax.annotate(
        "",
        xy=(x, y),
        xytext=(0, 0),
        arrowprops=dict(
            arrowstyle="-|>", lw=2, color=colors2[j], mutation_scale=25
        ),
        label=vari,
    )
    arrows.append(arrow)
    j += 1

ax.legend(
    [i.arrow_patch for i in arrows],
    [i.get_label() for i in arrows],
    fontsize="xx-small",
)

#### Your answer to question 7.1(i):

The plot places the atomic radius and electronegativity along a diagonal line. This is an example
of a negative correlation: when the atomic radius decreases, the electronegativity increases, as discussed in
[7.1(a)](#Your-answer-to-question-7.1(a):).

The electron and proton are harder to see, but they are on top of each other (so very much positively correlated)!

### 7.1(j)
Interpreting the scores and loadings together can be instructive. Create a new figure where you show the scores and loadings next to each other. You can create such a figure with:

In [None]:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, figsize=(8, 4)
)
ax1.text(0.5, 0.5, "Plot the scores in ax1", ha="center")
ax2.text(0.5, 0.5, "Plot the loadings in ax2", ha="center")

In [None]:
fig, (ax1, ax2) = plt.subplots(
    constrained_layout=True, ncols=2, figsize=(10, 5)
)

# Scores:
ax1.scatter(scores[:, 0], scores[:, 1])
ax1.set(xlabel=f"Scores PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax1.set(ylabel=f"Scores PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")

# Loadings:
ax2.set(xlabel=f"Loadings PC1 ({pca.explained_variance_ratio_[0]*100:.2g}%)")
ax2.set(ylabel=f"Loadings PC2 ({pca.explained_variance_ratio_[1]*100:.2g}%)")
# Make the scale for the x- and y-axis the same:
ax2.set_xlim(-0.4, 0.4)
ax2.set_ylim(ax.get_xlim())
ax2.set_aspect("equal")
# Add x=0 and y=0 lines to help location positive and negative values:
ax2.axhline(y=0, ls=":", color="k")
ax2.axvline(x=0, ls=":", color="k")
# Add the arrows:
arrows = []
for i, vari in enumerate(variables):
    x, y = pc1[i], pc2[i]
    # Draw arrow from the origin to the point:
    arrow = ax2.annotate(
        "",
        xy=(x, y),
        xytext=(0, 0),
        arrowprops=dict(
            arrowstyle="-|>", lw=2, color=colors[i], mutation_scale=25
        ),
        label=vari,
    )
    arrows.append(arrow)

ax2.legend(
    [i.arrow_patch for i in arrows],
    [i.get_label() for i in arrows],
    fontsize="xx-small",
)

After you have completed the plot above:

1. Does the direction of increasing mass correspond to what you would expect?

2. In this case, it is not so easy to interpret the loadings since we have many variables,
   and many seem to be equally important. But, if you were to give a simplified description of
   the two principal components, how would you describe them, and does this fit with your
   understanding of the periodic system?

#### Your answer to question 7.1(j):
1. Yes, the mass is increasing upwards and to the left. This is consistent with the increasing size
   of the elements when moving to the left in the scores plot.
   
2. Let us give some overall interpretation of PC1 and PC2:

   * For PC1, the most dominating factors can be related to size. The mass and atomic radius point toward the left, which is also consistent with the elements being "bigger" (in terms of the number of protons/electrons/neutrons and mass) to the left.

   * For PC2, there are also size contributions and impact of properties like electronegativity. The
     interpretation of the electronic structure is more complex, as there are correlations here. If
     we have some electrons in a higher orbital, say 5s, then we know that 1s, 2s, 2p, and so on are filled. Also, if we have filled orbitals up to, say, 3s and we do not have more electrons, then we know that all the higher
     orbitals are not filled (and thus do not contribute to the scores). In general, lower orbitals point toward the negative PC2 direction, while higher orbitals point toward the positive PC2 direction.
     A simplified interpretation of the PC2 direction is that it reflects the distribution of electrons: Within a group (a group in the plot!), elements further down have electrons in lower orbitals. If we
     check the elements, we find the noble gasses ("full" electron configuration) at the top (along PC2)
     in their respective groups.
     
   In general, it fits with the periodic system that reflects the electron configuration of the elements!

# Extra: Interactive plots with [bokeh](https://docs.bokeh.org/en/latest/)

Some of the plots we have made here are a bit crowded, and it can be difficult to make out the labels. One solution is to add some interactivity; for instance, we can display the name of the variables in the loadings plot when we hoover the mouse of the symbol. This is not so easy with matplotlib, but it is relatively easy with 
[bokeh](https://docs.bokeh.org/en/latest/). Below is some code to make more interactive versions of the scores and loadings plot. It is included here as an "extra" part since we have to use a new Python library that requires some extra coding. The method defined below might be overly complex;
the [bokeh gallery](https://docs.bokeh.org/en/latest/docs/gallery.html)
has more to-the-point examples.

In [None]:
# Imports for bokeh:
from bokeh.io import output_notebook
from bokeh.models import (
    ColorBar,
    ColumnDataSource,
    HoverTool,
    LabelSet,
)
from bokeh.plotting import figure, show
from bokeh.transform import linear_cmap

In [None]:
# Set ut outot for Jupyter notebook:
output_notebook()

In [None]:
def bokeh_2d_scatter(
    x_data,
    y_data,
    names,
    title="Scatter plot",
    xlabel="x",
    ylabel="y",
    color_by=None,
    color_by_feature_name="Color feature",
    add_labels=False,
):
    """Create a 2D scatter plot with bokeh.

    Parameters
    ----------
    x_data : object like numpy.array
        The x-coordinates for the scatter plot.
    y_data : object like numpy.array
        The y-coordinates for the scatter plot.
    names : list of strings
        The name of the items in the scatter plot.
    title : string, optional
        Title of the plot.
    xlabel : string, optional
        Label for the x-axis.
    ylabel : string, optional
        Label for the y-axis.
    color_by : object like numpy.array, optional
        Numbers to color the items in the scatter plot by.
        These numbers will be used to set up a color map.
    color_by_feature_name : string, optional
        Name of the feature the color in `color_by` represents.
    add_labels : boolean, optional
        If True, also write the names above the symbols.
    """
    plot_data = {
        "x": x_data,
        "y": y_data,
        "name": names,
    }

    tool_html = [
        '<div><span style="font-weight: bold;">@name</span></div>',
    ]

    color_mapper = None
    extra_kw = {}

    if color_by is not None:
        plot_data["color_by"] = color_by
        tool_html.append(f"<div>{color_by_feature_name}: @color_by</div>")
        color_mapper = linear_cmap(
            field_name="color_by",
            palette="Viridis256",
            low=min(color_by),
            high=max(color_by),
        )
        extra_kw = {"color": color_mapper, "marker": "circle"}

    tool_html = "<div>" + "\n".join(tool_html) + "</div>"
    source = ColumnDataSource(data=plot_data)

    fig = figure(
        title=title,
        active_scroll="wheel_zoom",
        background_fill_color="#fafafa",
    )
    fig.scatter(
        x="x",
        y="y",
        size=12,
        fill_alpha=0.6,
        name="points",
        source=source,
        **extra_kw,
    )

    hover = HoverTool(
        name="points",
        tooltips=tool_html,
    )
    fig.add_tools(hover)
    fig.xaxis.axis_label = xlabel
    fig.yaxis.axis_label = ylabel

    if color_by is not None:
        color_bar = ColorBar(
            color_mapper=color_mapper["transform"],
            width=10,
            title=color_by_feature_name,
        )
        fig.add_layout(color_bar, "right")
    if add_labels:
        labels = LabelSet(
            x="x",
            y="y",
            text="name",
            y_offset=8,
            text_font_size="11px",
            text_color="#555555",
            source=source,
            text_align="center",
        )
        fig.add_layout(labels)
    return fig

In [None]:
fig = bokeh_2d_scatter(
    scores[:, 0],
    scores[:, 1],
    data["element"].values,
    title="Plot of scores",
    xlabel="PC1",
    ylabel="PC2",
    color_by=data["atomic_radius"].to_numpy(),
    color_by_feature_name="Atomic radius",
    add_labels=True,
)
show(fig)

In [None]:
fig = bokeh_2d_scatter(
    pc1,
    pc2,
    variables,
    title="Plot of loadings",
    xlabel="PC1",
    ylabel="PC2",
    add_labels=True,
)
show(fig)