Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and email below:

In [None]:
# Full name
NAME = ""
# Institutional email (hm.edu or hmtm.de)
EMAIL = ""

---

# Day 5 - Visualizing painters' biographies similarity

+ **AI in Culture and Arts - Tech Crash Course**
+ **Date:** 11.06.2024
+ **Author:** Lenny Martinez Dominguez, Ph.D candidate at Sorbonne Université

<a href="https://colab.research.google.com/github/aica-wavelab/aica-assignments/blob/main/A5_semantic_similarities_visualization/painter_biography_analysis.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## 0. Getting Started

### Introduction

This fifth day of class will teach you:

- How to browse machine learning models on [HuggingFace](https://huggingface.co/), a platform for developing and hosting machine learning models;
- How to compute and visualize similarities between artists' biographies;

### Content of the repository

- `data`: A folder containing the summary information for artists gathered from [Wikipedia](https://en.wikipedia.org/).
- `painter_semantic_distance.ipynb`: This notebook you are reading right now, in which you will perform your analysis.

### Assignment

Your task is to cluster and visualize painters according to the similarity of their biographies, found on [Wikipedia](https://en.wikipedia.org/). The dataset comprises 3939 artists' biographies.

### Installation required

Please execute the next cell to make sure you have the necessary packages installed for today.

In [None]:
!pip install pandas numpy matplotlib seaborn sentence-transformers umap-learn plotly

---
## 4.1. The dataset

The dataset was extracted using the [`wikipedia-api` package](https://pypi.org/project/Wikipedia-API/). It's a collection of summaries from painter pages on Wikipedia. The painter pages come from the Wikipedia article ["List of painters by name"](https://en.wikipedia.org/wiki/List_of_painters_by_name). While it has a lot of painters, it is important to note that it is far to cover _all_  painters documented on Wikipedia.

The dataset is divided into two sections:

- The main file is `painter_summaries_all.csv`; it has data on all 3900+ painters listed in the Wikipedia article. One listed painter has been removed from this dataset, which appears in the partial files, and the IDs have not been changed.
- There are also six files in the `partial` directory with the format `painter_summaries_part#.csv.` These files have the data split into smaller chunks based on how the data was gathered.

### Inspecting the data

Open the `painter_summaries_all.csv` file in a spreadsheet program (Excel, Numbers, Sheets, etc.) and look at the data.

<div class="alert alert-info">
<b>Instruction:</b> What are the columns in this dataset? What do they each contain?
</div>

YOUR ANSWER HERE

### Loading the data

Let's load the complete dataset and inspect it using pandas.

In [None]:
import pandas as pd

painter_summaries_df = pd.read_csv("data/painter_summaries_all.csv")

painter_summaries_df.head(5)

<div class="alert alert-info">
<b>Instruction:</b> How many painters are in the dataset?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<div class="alert alert-info">
<b>Instruction:</b> Print all rows of painters with identical names. Are they likely homonyms or duplicates?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Cleaning the dataset 
<div class="alert alert-info">
<b>Instruction:</b> Create a new dataframe <strong>painter_summaries_clean</strong> that does not have duplicates based on the <em>painter_name</em> column.
</div>
To do that, you can use the [`drop_duplicates`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) method from pandas.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Now that the dataset is duplicate free we can start working it for our analysis.

If you look at the data file in a spreadsheet program, you will notice that the summaries are of various lengths. Let's keep track of that somehow because we may want to filter later on.

In [None]:
def count_words(text):
    return len(text.split())

In [None]:
painter_summaries_clean["summary_length"] = painter_summaries_clean["summary"].apply(count_words)
painter_summaries_clean.head(10)

We'll save the data as it is now and then we can work with these summaries.

In [None]:
painter_summaries_clean.to_csv("data/painter_summaries_clean.csv", index=False)


---

## 4.2. Sentence Similarity

Let's take a step back and think about where we want to end up and where we are currently. Right now, we have a dataset of different painters' biographies (with some length differences). We want to end up with a visual of the painters clustered based on their biographies.

We could manually take each biography, interpret the text, and try to group the painters ourselves. In some cases, we might group painters by their nationality (e.g., Dutch painters), their style (e.g., Surrealist painters), their subject matter (e.g., still life painters), or the period they lived in (e.g., Renaissance painters). 

<div class="alert alert-info">
<b>Instruction:</b> How many painter biographies would you go through before getting bored or burn out?
</div>

YOUR ANSWER HERE

We can use machine learning to assist us in clustering these biographies by comparing how similar or different the summaries are. This task is also known as Sentence Similarity and you can read more about it here: [https://huggingface.co/tasks/sentence-similarity](https://huggingface.co/tasks/sentence-similarity). 

For now we will play a bit with the widget on the page. First let's get a series of painter summaries to work with. I picked names that might have some obvious groupings so we can do sanity checks as we work.

In [None]:
select_painter_names = [
    "Albrecht Dürer",
    "Leonardo da Vinci",
    "Michelangelo",
    "Raphael",
    "Titian",
    "Joaquín Sorolla",
    "Pablo Picasso",
    "Salvador Dalí",
    "Andy Warhol",
    "Vincent van Gogh",
    "Johannes Vermeer",
    "Sandro Botticelli",
    "Hokusai"
]

select_painter_bios = painter_summaries_clean[
    painter_summaries_clean["painter_name"].isin(select_painter_names)
]

# For this short dataset, we don't care about the other columns.
select_painter_bios = select_painter_bios[["painter_name", "summary"]]
select_painter_bios

<div class="alert alert-info">
<b>Instruction:</b> Cluster the 13 painters based on what you may know, can quickly read about them.
</div>

YOUR ANSWER HERE

Now let's play with the sentence similarity widget on Hugging face. For that we need the full summaries for each painter. I will save the previous table to a CSV for faster copy+paste, but you can also use the Python code under that to get the bios for a particular artist

In [None]:
select_painter_bios.to_csv("data/select_painter_bios.csv", index=False)

In [None]:
painter_name = "Vincent van Gogh"
select_painter_bios[select_painter_bios["painter_name"] == painter_name]["summary"].values[0]

<div class="alert alert-info">
<b>Instruction:</b> Pick 5 painters from our test set. Put their bios in the <a href="https://huggingface.co/tasks/sentence-similarity">Sentence Similarity demo</a> and write down the values. Then add your interpretation of the values. Are they high or low? Why might that be? Fill in the table below:
</div>

YOUR ANSWER HERE

This Sentence Similarity demo is quite cool. It takes each summary and converts it into an **embedding**, a numerical vector representation of the text that does a good job of capturing the semantics of the text. This is the part connected to machine learning. In the demo, the pre-trained model `all-MiniLM-L6-v2` is used to compute the embeddings. We'll work with this same model below.

Once all the embeddings are computed, then it's a math game. The demo takes the source embedding (whichever artist you introduced first) and compares that embedding with each of the other embeddings in pairs. For each pair that is compared, say *source_painter* and *painter_1*, it produces a score between 0 and 1, where 0 means there is no similarity, and 1 means they are identical. There are many ways to compute similarity and a popular one is Cosine Similarity. There is some info on the demo page linked above, but reproduced here:
>     The similarity of the embeddings is evaluated mainly on cosine similarity. It is calculated as the cosine of the angle between two vectors. It is particularly useful when your texts are not the same length

---

## 4.3. Visualizing the `select_painter_bios`

### Create embeddings
The first step to being able to cluster and visualize the painters is to compute the embeddings. We will do this as an extra column in our dataframe of `select_painter_bios`

In [None]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

select_painter_bios["embeddings"] = select_painter_bios["summary"].apply(
    lambda x: model.encode(x).tolist()
)
select_painter_bios

<div class="alert alert-info">
<b>Instruction:</b> Save the dataframe with the embeddings as <em>select_painter_embeddings.csv</em>
</div>

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

We have created our embeddings using the specific model, `all-MiniLM-L6-v2`. That is one of many many models we can use. See a full list here: [https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers).

<div class="alert alert-info">
<b>Instruction:</b> Pick a model from the link above and create a new set of embeddings. Name that new column <em>embeddings2</em>. Fill in the table below, and save the data for that run as well.
</div>

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Reducing dimensions

Now that we have at least one set of embeddings, we can work to visualize them. This is the embedding using the first model for Vincent van Gogh:

In [None]:
van_gogh = select_painter_bios[select_painter_bios["painter_name"] == "Vincent van Gogh"]["embeddings"].values[0]

van_gogh

<div class="alert alert-info">
<b>Instruction:</b> How long is this vector?
</div>

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

This embedding has 384 components. It will be very difficult to visualize all 384 dimensions of this vector directly in a way that is interpretable. We are better off if we can somehow get these 384 dimensions into 2 or 3 dimensions (using 1 dimension might be too simplistic). This process of taking a large number of dimensions and reducing them to less dimensions is also known as projection. 

The technique that we will use is called [UMAP](https://umap-learn.readthedocs.io/en/latest/), or Uniform Manifold Approximation and Projection for Dimension Reduction. There are others, like SNE and t-SNE that are worth looking into.

In [None]:
import umap

umap_model = umap.UMAP(n_components=2, n_neighbors=5, min_dist=0.3, metric="cosine")
embeddings = select_painter_bios["embeddings"].tolist()
embedded_data_2d = umap_model.fit_transform(embeddings)
embedded_data_2d

What we have done is use the UMAP technique to project all 384 dimensions of the original embedding into 2 dimensions that we can now visualize.

Each of the parameters in `umap.UMAP()` can affect our output:
* `n_components`: This parameter controls the dimensionality of the reduction. We set it to 2 because we want to end up with 2 components in the end (that we can visualize).
* `n_neighbors`: This parameter tweaks how UMAP balances local vs global patterns. Play around with this if your visualization later looks off.
* `min_dist`: This parameter controls how packed points can be. 

You can read more about these parameters, and see some visuals of how they affect the output at the UMAP website [here](https://umap-learn.readthedocs.io/en/latest/parameters.html).

Let's add those dimensions to our dataframe. We'll name these new columns `umap1_x` and `umap1_y` because we're using the first set of embeddings that were created using the `all-MiniLM-L6-v2` model.

In [None]:
select_painter_bios["umap1_x"] = embedded_data_2d[:, 0]
select_painter_bios["umap1_y"] = embedded_data_2d[:, 1]

### Scatterplot visualization

Now that we have reduced the 384-component long embeddings to 2 dimensions. Let's visualize them using a scatterplot.

In [None]:
import plotly.express as px

# Create a scatter plot with Plotly
fig = px.scatter(select_painter_bios, x="umap1_x", y="umap1_y", hover_data=["painter_name"], width=800, height=800)

# Show the plot
fig.show()


<div class="alert alert-info">
<b>Instruction:</b> How do you interpret your figure?
</div>

YOUR ANSWER HERE

---

## 4.5 - Visualizing the embeddings for the model you chose

Now that you have visualized the first embedding using the `all-MiniLM-L6-v2` model, do it for the model you chose. Feel free to reuse code that is above, but be sure to write comments and notes explaining your process.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<div class="alert alert-info">
<b>Instruction:</b> Use the space below to interpret the final visualization of your embedding. How does it compare with the previous visual?
</div>

YOUR ANSWER HERE

---

## 4.6 - Visualizing the entire dataset

Taking all the tools from above, visualize the entire dataset of artists. The code may take longer to run, but the process is still the same.

1. Compute embeddings using one of the models from [this page](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers)
1. Reduce dimensions using UMAP
1. Plot the result.



In [None]:
# YOUR CODE HERE
raise NotImplementedError()