Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and email below:

In [None]:
# Full name
NAME = ""
# Institutional email (hm.edu or hmtm.de)
EMAIL = ""

---

# Day 5 - Visualizing painters' biographies similarity

+ **AI in Culture and Arts - Tech Crash Course**
+ **Date:** 11.06.2024
+ **Author:** Lenny Martinez Dominguez, Ph.D candidate at Sorbonne Université

<a href="https://colab.research.google.com/github/aica-wavelab/aica-assignments/blob/main/A5_semantic_similarities_visualization/painter_biography_analysis.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## 0. Getting Started

### Introduction

This fifth day of class will teach you:

- How to browse machine learning models on [HuggingFace](https://huggingface.co/), a platform for developing and hosting machine learning models;
- How to compute and visualize similarities between artists' biographies;

### Content of the repository

- `data`: A folder containing the summary information for artists gathered from [Wikipedia](https://en.wikipedia.org/).
- `painter_semantic_distance.ipynb`: This notebook you are reading right now, in which you will perform your analysis.

### Assignment

Your task is to cluster and visualize painters according to the similarity of their biographies, found on [Wikipedia](https://en.wikipedia.org/). The dataset comprises 3939 artists' biographies.

### Installation required

Please execute the next cell to make sure you have the necessary packages installed for today.

In [1]:
!pip install pandas numpy matplotlib seaborn sentence-transformers umap-learn plotly



---
## 4.1. The dataset

The dataset was extracted using the [`wikipedia-api` package](https://pypi.org/project/Wikipedia-API/). It's a collection of summaries from painter pages on Wikipedia. The painter pages come from the Wikipedia article ["List of painters by name"](https://en.wikipedia.org/wiki/List_of_painters_by_name). While it has a lot of painters, it is important to note that it is far to cover _all_  painters documented on Wikipedia.

The dataset is divided into two sections:

- The main file is `painter_summaries_all.csv`; it has data on all 3900+ painters listed in the Wikipedia article. One listed painter has been removed from this dataset, which appears in the partial files, and the IDs have not been changed.
- There are also six files in the `partial` directory with the format `painter_summaries_part#.csv.` These files have the data split into smaller chunks based on how the data was gathered.

### Inspecting the data

Open the `painter_summaries_all.csv` file in a spreadsheet program (Excel, Numbers, Sheets, etc.) and look at the data.

<div class="alert alert-info">
<b>Instruction:</b> What are the columns in this dataset? What do they each contain?
</div>

- `painter_id`: some numerical data for the painters
- `painter_name`: name of the painter
- `summary`: the painter's bio according to Wikipedia
- `url`: you never know if you need it later

### Loading the data

Let's load the complete dataset and inspect it using pandas.

In [18]:
import pandas as pd

painter_summaries_df = pd.read_csv("data/painter_summaries_all.csv")

painter_summaries_df.head(5)

Unnamed: 0,painter_id,painter_name,summary,url
0,1,Alfred Richard Gurrey Sr.,Alfred Richard Gurrey Sr. (1852–1944) was an ...,https://en.wikipedia.org/wiki/Alfred_Richard_G...
1,2,Edward Otho Cresap Ord II,"Edward Otho Cresap Ord, II (November 9, 1858 –...",https://en.wikipedia.org/wiki/Edward_Otho_Cres...
2,3,George Barret Jr.,"George Barret Jr. (1767–1842), sometimes refer...",https://en.wikipedia.org/wiki/George_Barret_Jr.
3,4,George Barret Sr.,George Barret Sr. (c. 1730 – 29 May 1784) was...,https://en.wikipedia.org/wiki/George_Barret_Sr.
4,5,Henry Ives Cobb Jr.,"Henry Ives Cobb Jr. (March 24, 1883 – August 1...",https://en.wikipedia.org/wiki/Henry_Ives_Cobb_Jr.


<div class="alert alert-info">
<b>Instruction:</b> How many painters are in the dataset?

In [19]:
painter_summaries_df["painter_name"].count() - painter_summaries_df["painter_name"].nunique()

9

<div class="alert alert-info">
<b>Instruction:</b> Print all rows of painters with identical names. Are they likely homonyms or duplicates?

In [11]:
painter_summaries_df["painter_name"].value_counts()

painter_name
Galli da Bibiena family    4
Walter Emerson Baum        2
Hristofor Žefarović        2
Giulio Clovio              2
Domenichino                2
                          ..
Giorgio de Chirico         1
Giorgio De Vincenzi        1
Giorgio Morandi            1
Giorgione                  1
Þórarinn B. Þorláksson     1
Name: count, Length: 3925, dtype: int64

### Cleaning the dataset 
<div class="alert alert-info">
<b>Instruction:</b> Create a new dataframe <strong>painter_summaries_clean</strong> that does not have duplicates based on the <em>painter_name</em> column.
</div>
To do that, you can use the [`drop_duplicates`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) method from pandas.

In [21]:
painter_summaries_clean = painter_summaries_df.drop_duplicates(subset="painter_name")
painter_summaries_clean.head(10)


painter_summaries_clean["painter_name"].count() - painter_summaries_clean["painter_name"].nunique()

0

Now that the dataset is duplicate free we can start working it for our analysis.

If you look at the data file in a spreadsheet program, you will notice that the summaries are of various lengths. Let's keep track of that somehow because we may want to filter later on.

In [23]:
def count_words(text):
    return len(text.split())


sample_text = "I am Lenny."
count_words(sample_text)

3

In [24]:
painter_summaries_clean["summary_length"] = painter_summaries_clean["summary"].apply(count_words)
painter_summaries_clean.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  painter_summaries_clean["summary_length"] = painter_summaries_clean["summary"].apply(count_words)


Unnamed: 0,painter_id,painter_name,summary,url,summary_length
0,1,Alfred Richard Gurrey Sr.,Alfred Richard Gurrey Sr. (1852–1944) was an ...,https://en.wikipedia.org/wiki/Alfred_Richard_G...,164
1,2,Edward Otho Cresap Ord II,"Edward Otho Cresap Ord, II (November 9, 1858 –...",https://en.wikipedia.org/wiki/Edward_Otho_Cres...,84
2,3,George Barret Jr.,"George Barret Jr. (1767–1842), sometimes refer...",https://en.wikipedia.org/wiki/George_Barret_Jr.,27
3,4,George Barret Sr.,George Barret Sr. (c. 1730 – 29 May 1784) was...,https://en.wikipedia.org/wiki/George_Barret_Sr.,263
4,5,Henry Ives Cobb Jr.,"Henry Ives Cobb Jr. (March 24, 1883 – August 1...",https://en.wikipedia.org/wiki/Henry_Ives_Cobb_Jr.,71
5,6,John Byrne (English artist),John Byrne (1786–1847) was an English painter ...,https://en.wikipedia.org/wiki/John_Byrne_(Engl...,29
6,7,John Frederick Herring Jr.,John Frederick Herring Jr. (1820–1907) was an ...,https://en.wikipedia.org/wiki/John_Frederick_H...,17
7,8,John Frederick Herring Sr.,John Frederick Herring Sr. (12 September 1795 ...,https://en.wikipedia.org/wiki/John_Frederick_H...,61
8,9,A. B. Jackson (painter),"Alexander Brooks Jackson (April 18, 1925 – Mar...",https://en.wikipedia.org/wiki/A._B._Jackson_(p...,14
9,10,A. J. Casson,"Alfred Joseph Casson (May 17, 1898 – February...",https://en.wikipedia.org/wiki/A._J._Casson,66


We'll save the data as it is now and then we can work with these summaries.

In [25]:
painter_summaries_clean.to_csv("data/painter_summaries_clean.csv", index=False)


---

## 4.2. Sentence Similarity

Let's take a step back and think about where we want to end up and where we are currently. Right now, we have a dataset of different painters' biographies (with some length differences). We want to end up with a visual of the painters clustered based on their biographies.

We could manually take each biography, interpret the text, and try to group the painters ourselves. In some cases, we might group painters by their nationality (e.g., Dutch painters), their style (e.g., Surrealist painters), their subject matter (e.g., still life painters), or the period they lived in (e.g., Renaissance painters). 

<div class="alert alert-info">
<b>Instruction:</b> How many painter biographies would you go through before getting bored or burn out?
</div>

YOUR ANSWER HERE

We can use machine learning to assist us in clustering these biographies by comparing how similar or different the summaries are. This task is also known as Sentence Similarity and you can read more about it here: [https://huggingface.co/tasks/sentence-similarity](https://huggingface.co/tasks/sentence-similarity). 

For now we will play a bit with the widget on the page. First let's get a series of painter summaries to work with. I picked names that might have some obvious groupings so we can do sanity checks as we work.

In [27]:
select_painter_names = [
    "Albrecht Dürer",
    "Leonardo da Vinci",
    "Michelangelo",
    "Raphael",
    "Titian",
    "Joaquín Sorolla",
    "Pablo Picasso",
    "Salvador Dalí",
    "Andy Warhol",
    "Vincent van Gogh",
    "Johannes Vermeer",
    "Sandro Botticelli",
    "Hokusai",
]

select_painter_bios = painter_summaries_clean[
    painter_summaries_clean["painter_name"].isin(select_painter_names)
]

# For this short dataset, we don't care about the other columns.
select_painter_bios = select_painter_bios[["painter_name", "summary"]]
select_painter_bios

Unnamed: 0,painter_name,summary
116,Albrecht Dürer,Albrecht Dürer (; German: [ˈʔalbʁɛçt ˈdyːʁɐ]; ...
255,Andy Warhol,Andy Warhol (; born Andrew Warhola Jr.; August...
1559,Hokusai,"Katsushika Hokusai (葛飾 北斎, c. 31 October 1760 ..."
1936,Joaquín Sorolla,Joaquín Sorolla y Bastida (Valencian: Joaquim ...
1975,Johannes Vermeer,"Johannes Vermeer (, Dutch: [vərˈmeːr], see bel..."
2375,Leonardo da Vinci,Leonardo di ser Piero da Vinci (15 April 1452 ...
2685,Michelangelo,Michelangelo di Lodovico Buonarroti Simoni (It...
2874,Pablo Picasso,Pablo Ruiz Picasso (25 October 1881 – 8 April ...
3062,Raphael,Raffaello Sanzio da Urbino (Italian: [raffaˈɛl...
3263,Salvador Dalí,Salvador Domingo Felipe Jacinto Dalí i Domènec...


In [29]:
select_painter_names2 = [
    "Albrecht Dürer",
    "Leonardo da Vinci",
    "Michelangelo",
    "Raphael",
    "Titian",
    "Joaquín Sorolla",
    "Pablo Picasso",
    "Salvador Dalí",
    "Andy Warhol",
    "Vincent van Gogh",
    "Johannes Vermeer",
    "Sandro Botticelli",
    "Hokusai",
    "Frida Kahlo",
    "Agnes Lawrence Pelton",
    "Agnes Martin",
    "Alison Debenham",
    "Alison Geissler",
    "Alison Kinnaird",
    "Alison Watt (Scottish painter)",
    "Caro Niederer",
    "Cecilia Beaux",
    "Cecily Brown",
    "Celia Fiennes (artist)",
    "Celia Frances Bedford"
]

select_painter_bios = painter_summaries_clean[
    painter_summaries_clean["painter_name"].isin(select_painter_names2)
]

# For this short dataset, we don't care about the other columns.
select_painter_bios = select_painter_bios[["painter_name", "summary"]]
select_painter_bios

Unnamed: 0,painter_name,summary
74,Agnes Lawrence Pelton,"Agnes Lawrence Pelton (August 22, 1881 – March..."
75,Agnes Martin,"Agnes Bernice Martin (March 22, 1912 – Decemb..."
116,Albrecht Dürer,Albrecht Dürer (; German: [ˈʔalbʁɛçt ˈdyːʁɐ]; ...
182,Alison Debenham,Alison Edith Debenham (later Le Plat; 1903–196...
183,Alison Geissler,"Alison Cornwall Geissler MBE, née McDonald (13..."
184,Alison Kinnaird,"Alison Kinnaird MBE, MA, FGE (born 30 April 19..."
185,Alison Watt (Scottish painter),Alison Watt OBE FRSE RSA (born 1965) is a Bri...
255,Andy Warhol,Andy Warhol (; born Andrew Warhola Jr.; August...
539,Caro Niederer,Caro Niederer (born 1963 in Zürich) is a conte...
556,Cecilia Beaux,"Eliza Cecilia Beaux (May 1, 1855 – September 1..."


<div class="alert alert-info">
<b>Instruction:</b> Cluster the 13 painters based on what you may know, can quickly read about them.
</div>

YOUR ANSWER HERE

Now let's play with the sentence similarity widget on Hugging face. For that we need the full summaries for each painter. I will save the previous table to a CSV for faster copy+paste, but you can also use the Python code under that to get the bios for a particular artist

In [30]:
select_painter_bios.to_csv("data/select_painter_bios.csv", index=False)

In [34]:
painter_name = "Leonardo da Vinci"
select_painter_bios[select_painter_bios["painter_name"] == painter_name]["summary"].values[0]

"Leonardo di ser Piero da Vinci (15 April 1452 – 2 May 1519) was an Italian polymath of the High Renaissance who was active as a painter, draughtsman, engineer, scientist, theorist, sculptor, and architect. While his fame initially rested on his achievements as a painter, he has also become known for his notebooks, in which he made drawings and notes on a variety of subjects, including anatomy, astronomy, botany, cartography, painting, and palaeontology. Leonardo is widely regarded to have been a genius who epitomised the Renaissance humanist ideal, and his collective works comprise a contribution to later generations of artists matched only by that of his younger contemporary Michelangelo.\r\nBorn out of wedlock to a successful notary and a lower-class woman in, or near, Vinci, he was educated in Florence by the Italian painter and sculptor Andrea del Verrocchio. He began his career in the city, but then spent much time in the service of Ludovico Sforza in Milan. Later, he worked in F

<div class="alert alert-info">
<b>Instruction:</b> Pick 5 painters from our test set. Put their bios in the <a href="https://huggingface.co/tasks/sentence-similarity">Sentence Similarity demo</a> and write down the values. Then add your interpretation of the values. Are they high or low? Why might that be? Fill in the table below:
</div>

YOUR ANSWER HERE

This Sentence Similarity demo is quite cool. It takes each summary and converts it into an **embedding**, a numerical vector representation of the text that does a good job of capturing the semantics of the text. This is the part connected to machine learning. In the demo, the pre-trained model `all-MiniLM-L6-v2` is used to compute the embeddings. We'll work with this same model below.

Once all the embeddings are computed, then it's a math game. The demo takes the source embedding (whichever artist you introduced first) and compares that embedding with each of the other embeddings in pairs. For each pair that is compared, say *source_painter* and *painter_1*, it produces a score between 0 and 1, where 0 means there is no similarity, and 1 means they are identical. There are many ways to compute similarity and a popular one is Cosine Similarity. There is some info on the demo page linked above, but reproduced here:
>     The similarity of the embeddings is evaluated mainly on cosine similarity. It is calculated as the cosine of the angle between two vectors. It is particularly useful when your texts are not the same length

---

## 4.3. Visualizing the `select_painter_bios`

### Create embeddings
The first step to being able to cluster and visualize the painters is to compute the embeddings. We will do this as an extra column in our dataframe of `select_painter_bios`

In [47]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

select_painter_bios["embeddings"] = select_painter_bios["summary"].apply(
    lambda x: model.encode(x).tolist()
)
select_painter_bios


`resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.



Unnamed: 0,painter_name,summary,embeddings,embeddings2,umap1_x,umap1_y
74,Agnes Lawrence Pelton,"Agnes Lawrence Pelton (August 22, 1881 – March...","[0.00925831962376833, 0.020879311487078667, -0...","[-0.3555273413658142, 0.6258305907249451, -0.3...",10.906144,3.149799
75,Agnes Martin,"Agnes Bernice Martin (March 22, 1912 – Decemb...","[0.04175714775919914, 0.08346845954656601, 0.0...","[-0.05190335959196091, 0.01991790533065796, -0...",11.422478,3.53962
116,Albrecht Dürer,Albrecht Dürer (; German: [ˈʔalbʁɛçt ˈdyːʁɐ]; ...,"[-0.056997984647750854, 0.024535084143280983, ...","[-0.5610215663909912, 0.7210466861724854, -0.3...",7.799601,-4.351664
182,Alison Debenham,Alison Edith Debenham (later Le Plat; 1903–196...,"[-0.06023209169507027, 0.015084055252373219, 0...","[0.3454815149307251, 0.4316037893295288, -0.27...",10.954934,1.849854
183,Alison Geissler,"Alison Cornwall Geissler MBE, née McDonald (13...","[-0.04076974838972092, -0.009458957239985466, ...","[-0.6484456062316895, 0.4758595824241638, 0.13...",11.779065,1.953383
184,Alison Kinnaird,"Alison Kinnaird MBE, MA, FGE (born 30 April 19...","[0.02746589109301567, -0.022002195939421654, -...","[-0.15582501888275146, 0.8702155351638794, 0.2...",11.899119,1.458376
185,Alison Watt (Scottish painter),Alison Watt OBE FRSE RSA (born 1965) is a Bri...,"[-0.07076051831245422, -0.017952891066670418, ...","[-0.16642820835113525, 0.9753232598304749, 0.3...",11.289492,1.380466
255,Andy Warhol,Andy Warhol (; born Andrew Warhola Jr.; August...,"[0.0025709313340485096, -0.08942723274230957, ...","[-0.14185909926891327, 0.6947842836380005, 0.0...",5.964389,-2.551351
539,Caro Niederer,Caro Niederer (born 1963 in Zürich) is a conte...,"[-0.06210412457585335, 0.05394889414310455, 0....","[-0.26326656341552734, 0.21122246980667114, 0....",6.073942,-2.057461
556,Cecilia Beaux,"Eliza Cecilia Beaux (May 1, 1855 – September 1...","[-0.00541235925629735, 0.005950777791440487, 0...","[-0.36029744148254395, 0.4289925992488861, -0....",11.97283,3.287737


<div class="alert alert-info">
<b>Instruction:</b> Save the dataframe with the embeddings as <em>select_painter_embeddings.csv</em>
</div>

In [37]:
select_painter_bios.to_csv("data/select_painter_embeddings.csv", index=False)

We have created our embeddings using the specific model, `all-MiniLM-L6-v2`. That is one of many many models we can use. See a full list here: [https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers).

<div class="alert alert-info">
<b>Instruction:</b> Pick a model from the link above and create a new set of embeddings. Name that new column <em>embeddings2</em>. Fill in the table below, and save the data for that run as well.
</div>

YOUR ANSWER HERE

In [38]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer("sentence-transformers/nli-bert-base")

select_painter_bios["embeddings2"] = select_painter_bios["summary"].apply(
    lambda x: model.encode(x).tolist()
)
select_painter_bios

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.93k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Unnamed: 0,painter_name,summary,embeddings,embeddings2
74,Agnes Lawrence Pelton,"Agnes Lawrence Pelton (August 22, 1881 – March...","[0.00925831962376833, 0.020879311487078667, -0...","[-0.3555273413658142, 0.6258305907249451, -0.3..."
75,Agnes Martin,"Agnes Bernice Martin (March 22, 1912 – Decemb...","[0.04175714775919914, 0.08346845954656601, 0.0...","[-0.05190335959196091, 0.01991790533065796, -0..."
116,Albrecht Dürer,Albrecht Dürer (; German: [ˈʔalbʁɛçt ˈdyːʁɐ]; ...,"[-0.056997984647750854, 0.024535084143280983, ...","[-0.5610215663909912, 0.7210466861724854, -0.3..."
182,Alison Debenham,Alison Edith Debenham (later Le Plat; 1903–196...,"[-0.06023209169507027, 0.015084055252373219, 0...","[0.3454815149307251, 0.4316037893295288, -0.27..."
183,Alison Geissler,"Alison Cornwall Geissler MBE, née McDonald (13...","[-0.04076974838972092, -0.009458957239985466, ...","[-0.6484456062316895, 0.4758595824241638, 0.13..."
184,Alison Kinnaird,"Alison Kinnaird MBE, MA, FGE (born 30 April 19...","[0.02746589109301567, -0.022002195939421654, -...","[-0.15582501888275146, 0.8702155351638794, 0.2..."
185,Alison Watt (Scottish painter),Alison Watt OBE FRSE RSA (born 1965) is a Bri...,"[-0.07076051831245422, -0.017952891066670418, ...","[-0.16642820835113525, 0.9753232598304749, 0.3..."
255,Andy Warhol,Andy Warhol (; born Andrew Warhola Jr.; August...,"[0.0025709313340485096, -0.08942723274230957, ...","[-0.14185909926891327, 0.6947842836380005, 0.0..."
539,Caro Niederer,Caro Niederer (born 1963 in Zürich) is a conte...,"[-0.06210412457585335, 0.05394889414310455, 0....","[-0.26326656341552734, 0.21122246980667114, 0...."
556,Cecilia Beaux,"Eliza Cecilia Beaux (May 1, 1855 – September 1...","[-0.00541235925629735, 0.005950777791440487, 0...","[-0.36029744148254395, 0.4289925992488861, -0...."


### Reducing dimensions

Now that we have at least one set of embeddings, we can work to visualize them. This is the embedding using the first model for Vincent van Gogh:

In [41]:
van_gogh = select_painter_bios[select_painter_bios["painter_name"] == "Vincent van Gogh"]["embeddings"].values[0]

van_gogh

[0.08493974059820175,
 0.03074062615633011,
 0.01686738058924675,
 0.016839053481817245,
 0.05989311635494232,
 0.03570970147848129,
 0.03619306907057762,
 0.020847732201218605,
 -0.05887269973754883,
 -0.05923779308795929,
 -0.03598632290959358,
 -0.024541286751627922,
 0.015377012081444263,
 -0.00451229652389884,
 0.02243201620876789,
 0.06546244770288467,
 -0.011373251676559448,
 0.009499303065240383,
 -0.022144265472888947,
 0.01504893135279417,
 -0.06678536534309387,
 0.005186570808291435,
 -0.006639833562076092,
 -0.1190100833773613,
 0.016290709376335144,
 0.013186655938625336,
 0.037642695009708405,
 -0.11208225786685944,
 0.01721331477165222,
 0.01671992801129818,
 0.007216630503535271,
 0.03301443159580231,
 -0.07266112416982651,
 -0.016496745869517326,
 -0.020407019183039665,
 -0.031391970813274384,
 -0.035718824714422226,
 0.09690754860639572,
 -0.02765747159719467,
 0.048770707100629807,
 -0.014674311503767967,
 -0.02429278753697872,
 -0.11335866153240204,
 -0.024795353412

<div class="alert alert-info">
<b>Instruction:</b> How long is this vector?
</div>

In [42]:
len(van_gogh)

384

This embedding has 384 components. It will be very difficult to visualize all 384 dimensions of this vector directly in a way that is interpretable. We are better off if we can somehow get these 384 dimensions into 2 or 3 dimensions (using 1 dimension might be too simplistic). This process of taking a large number of dimensions and reducing them to less dimensions is also known as projection. 

The technique that we will use is called [UMAP](https://umap-learn.readthedocs.io/en/latest/), or Uniform Manifold Approximation and Projection for Dimension Reduction. There are others, like SNE and t-SNE that are worth looking into.

In [48]:
import umap

umap_model = umap.UMAP(n_components=2, n_neighbors=5, min_dist=0.3, metric="cosine")
embeddings = select_painter_bios["embeddings"].tolist()
embedded_data_2d = umap_model.fit_transform(embeddings)
embedded_data_2d

array([[ 5.0519643 ,  3.5796976 ],
       [ 5.484154  ,  3.0990794 ],
       [ 4.7706885 ,  0.09465224],
       [ 4.193799  ,  4.233761  ],
       [ 3.4740229 ,  3.6880765 ],
       [ 3.1062825 ,  3.9613674 ],
       [ 3.5933201 ,  4.3301787 ],
       [ 8.264462  ,  3.9351776 ],
       [ 8.696705  ,  4.0380235 ],
       [ 5.1805325 ,  2.498256  ],
       [ 5.205619  ,  4.114147  ],
       [ 3.9598088 ,  2.765612  ],
       [ 4.6344633 ,  2.8937194 ],
       [ 9.4224205 ,  3.7682912 ],
       [ 4.0952725 , -1.6702468 ],
       [10.056022  ,  3.7298973 ],
       [ 7.5145636 ,  3.142597  ],
       [ 4.337919  , -0.96336174],
       [ 3.7503948 , -1.2492996 ],
       [ 9.147716  ,  3.0453088 ],
       [ 4.103911  , -0.4424793 ],
       [ 9.662557  ,  3.1186569 ],
       [ 3.1610255 , -0.86426497],
       [ 3.4506395 , -0.4267401 ],
       [ 8.165937  ,  2.9829137 ]], dtype=float32)

What we have done is use the UMAP technique to project all 384 dimensions of the original embedding into 2 dimensions that we can now visualize.

Each of the parameters in `umap.UMAP()` can affect our output:
* `n_components`: This parameter controls the dimensionality of the reduction. We set it to 2 because we want to end up with 2 components in the end (that we can visualize).
* `n_neighbors`: This parameter tweaks how UMAP balances local vs global patterns. Play around with this if your visualization later looks off.
* `min_dist`: This parameter controls how packed points can be. 

You can read more about these parameters, and see some visuals of how they affect the output at the UMAP website [here](https://umap-learn.readthedocs.io/en/latest/parameters.html).

Let's add those dimensions to our dataframe. We'll name these new columns `umap1_x` and `umap1_y` because we're using the first set of embeddings that were created using the `all-MiniLM-L6-v2` model.

In [49]:
select_painter_bios["umap1_x"] = embedded_data_2d[:, 0]
select_painter_bios["umap1_y"] = embedded_data_2d[:, 1]

select_painter_bios

Unnamed: 0,painter_name,summary,embeddings,embeddings2,umap1_x,umap1_y
74,Agnes Lawrence Pelton,"Agnes Lawrence Pelton (August 22, 1881 – March...","[0.00925831962376833, 0.020879311487078667, -0...","[-0.3555273413658142, 0.6258305907249451, -0.3...",5.051964,3.579698
75,Agnes Martin,"Agnes Bernice Martin (March 22, 1912 – Decemb...","[0.04175714775919914, 0.08346845954656601, 0.0...","[-0.05190335959196091, 0.01991790533065796, -0...",5.484154,3.099079
116,Albrecht Dürer,Albrecht Dürer (; German: [ˈʔalbʁɛçt ˈdyːʁɐ]; ...,"[-0.056997984647750854, 0.024535084143280983, ...","[-0.5610215663909912, 0.7210466861724854, -0.3...",4.770689,0.094652
182,Alison Debenham,Alison Edith Debenham (later Le Plat; 1903–196...,"[-0.06023209169507027, 0.015084055252373219, 0...","[0.3454815149307251, 0.4316037893295288, -0.27...",4.193799,4.233761
183,Alison Geissler,"Alison Cornwall Geissler MBE, née McDonald (13...","[-0.04076974838972092, -0.009458957239985466, ...","[-0.6484456062316895, 0.4758595824241638, 0.13...",3.474023,3.688076
184,Alison Kinnaird,"Alison Kinnaird MBE, MA, FGE (born 30 April 19...","[0.02746589109301567, -0.022002195939421654, -...","[-0.15582501888275146, 0.8702155351638794, 0.2...",3.106282,3.961367
185,Alison Watt (Scottish painter),Alison Watt OBE FRSE RSA (born 1965) is a Bri...,"[-0.07076051831245422, -0.017952891066670418, ...","[-0.16642820835113525, 0.9753232598304749, 0.3...",3.59332,4.330179
255,Andy Warhol,Andy Warhol (; born Andrew Warhola Jr.; August...,"[0.0025709313340485096, -0.08942723274230957, ...","[-0.14185909926891327, 0.6947842836380005, 0.0...",8.264462,3.935178
539,Caro Niederer,Caro Niederer (born 1963 in Zürich) is a conte...,"[-0.06210412457585335, 0.05394889414310455, 0....","[-0.26326656341552734, 0.21122246980667114, 0....",8.696705,4.038023
556,Cecilia Beaux,"Eliza Cecilia Beaux (May 1, 1855 – September 1...","[-0.00541235925629735, 0.005950777791440487, 0...","[-0.36029744148254395, 0.4289925992488861, -0....",5.180532,2.498256


### Scatterplot visualization

Now that we have reduced the 384-component long embeddings to 2 dimensions. Let's visualize them using a scatterplot.

In [50]:
import plotly.express as px

# Create a scatter plot with Plotly
fig = px.scatter(select_painter_bios, x="umap1_x", y="umap1_y", hover_data=["painter_name"], width=800, height=800)

# Show the plot
fig.show()


<div class="alert alert-info">
<b>Instruction:</b> How do you interpret your figure?
</div>

YOUR ANSWER HERE

---

## 4.5 - Visualizing the embeddings for the model you chose

Now that you have visualized the first embedding using the `all-MiniLM-L6-v2` model, do it for the model you chose. Feel free to reuse code that is above, but be sure to write comments and notes explaining your process.

In [53]:
umap_model = umap.UMAP(n_components=2, n_neighbors=5, min_dist=0.1, metric="cosine")
embeddings = select_painter_bios["embeddings2"].tolist()
embedded_data_2d = umap_model.fit_transform(embeddings)
embedded_data_2d

select_painter_bios["umap2_x"] = embedded_data_2d[:, 0]
select_painter_bios["umap2_y"] = embedded_data_2d[:, 1]

# Create a scatter plot with Plotly
fig = px.scatter(
    select_painter_bios,
    x="umap2_x",
    y="umap2_y",
    hover_data=["painter_name"],
    width=800,
    height=800,
)

fig.update_traces(marker=dict(size=30))
# Show the plot
fig.show()

<div class="alert alert-info">
<b>Instruction:</b> Use the space below to interpret the final visualization of your embedding. How does it compare with the previous visual?
</div>

YOUR ANSWER HERE

---

## 4.6 - Visualizing the entire dataset

Taking all the tools from above, visualize the entire dataset of artists. The code may take longer to run, but the process is still the same.

1. Compute embeddings using one of the models from [this page](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers)
1. Reduce dimensions using UMAP
1. Plot the result.



### PART 1.0 -- Show dataset

In [54]:
painter_summaries_clean.head(15)

Unnamed: 0,painter_id,painter_name,summary,url,summary_length
0,1,Alfred Richard Gurrey Sr.,Alfred Richard Gurrey Sr. (1852–1944) was an ...,https://en.wikipedia.org/wiki/Alfred_Richard_G...,164
1,2,Edward Otho Cresap Ord II,"Edward Otho Cresap Ord, II (November 9, 1858 –...",https://en.wikipedia.org/wiki/Edward_Otho_Cres...,84
2,3,George Barret Jr.,"George Barret Jr. (1767–1842), sometimes refer...",https://en.wikipedia.org/wiki/George_Barret_Jr.,27
3,4,George Barret Sr.,George Barret Sr. (c. 1730 – 29 May 1784) was...,https://en.wikipedia.org/wiki/George_Barret_Sr.,263
4,5,Henry Ives Cobb Jr.,"Henry Ives Cobb Jr. (March 24, 1883 – August 1...",https://en.wikipedia.org/wiki/Henry_Ives_Cobb_Jr.,71
5,6,John Byrne (English artist),John Byrne (1786–1847) was an English painter ...,https://en.wikipedia.org/wiki/John_Byrne_(Engl...,29
6,7,John Frederick Herring Jr.,John Frederick Herring Jr. (1820–1907) was an ...,https://en.wikipedia.org/wiki/John_Frederick_H...,17
7,8,John Frederick Herring Sr.,John Frederick Herring Sr. (12 September 1795 ...,https://en.wikipedia.org/wiki/John_Frederick_H...,61
8,9,A. B. Jackson (painter),"Alexander Brooks Jackson (April 18, 1925 – Mar...",https://en.wikipedia.org/wiki/A._B._Jackson_(p...,14
9,10,A. J. Casson,"Alfred Joseph Casson (May 17, 1898 – February...",https://en.wikipedia.org/wiki/A._J._Casson,66


### PART 2 - Compute Embeddings

In [55]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer("msmarco-distilbert-cos-v5")

painter_summaries_clean["embeddings"] = painter_summaries_clean["summary"].apply(
    lambda x: model.encode(x).tolist()
)
painter_summaries_clean.head(10)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]


`resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.



config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,painter_id,painter_name,summary,url,summary_length,embeddings
0,1,Alfred Richard Gurrey Sr.,Alfred Richard Gurrey Sr. (1852–1944) was an ...,https://en.wikipedia.org/wiki/Alfred_Richard_G...,164,"[-0.017916500568389893, 0.016814740374684334, ..."
1,2,Edward Otho Cresap Ord II,"Edward Otho Cresap Ord, II (November 9, 1858 –...",https://en.wikipedia.org/wiki/Edward_Otho_Cres...,84,"[0.04755578190088272, 0.06714501231908798, -0...."
2,3,George Barret Jr.,"George Barret Jr. (1767–1842), sometimes refer...",https://en.wikipedia.org/wiki/George_Barret_Jr.,27,"[0.046365100890398026, 0.04546622931957245, 0...."
3,4,George Barret Sr.,George Barret Sr. (c. 1730 – 29 May 1784) was...,https://en.wikipedia.org/wiki/George_Barret_Sr.,263,"[0.05642620474100113, 0.025986645370721817, -0..."
4,5,Henry Ives Cobb Jr.,"Henry Ives Cobb Jr. (March 24, 1883 – August 1...",https://en.wikipedia.org/wiki/Henry_Ives_Cobb_Jr.,71,"[0.03517553582787514, -0.013261149637401104, -..."
5,6,John Byrne (English artist),John Byrne (1786–1847) was an English painter ...,https://en.wikipedia.org/wiki/John_Byrne_(Engl...,29,"[0.05095544457435608, 0.028257250785827637, -0..."
6,7,John Frederick Herring Jr.,John Frederick Herring Jr. (1820–1907) was an ...,https://en.wikipedia.org/wiki/John_Frederick_H...,17,"[0.029810236766934395, -0.017383666709065437, ..."
7,8,John Frederick Herring Sr.,John Frederick Herring Sr. (12 September 1795 ...,https://en.wikipedia.org/wiki/John_Frederick_H...,61,"[0.028860753402113914, 0.044107433408498764, -..."
8,9,A. B. Jackson (painter),"Alexander Brooks Jackson (April 18, 1925 – Mar...",https://en.wikipedia.org/wiki/A._B._Jackson_(p...,14,"[-0.004729640204459429, 0.029076918959617615, ..."
9,10,A. J. Casson,"Alfred Joseph Casson (May 17, 1898 – February...",https://en.wikipedia.org/wiki/A._J._Casson,66,"[-0.053276173770427704, -0.010091795586049557,..."


In [56]:
# As a sanity check, let's see the length of the embeddings for Vincent van Gogh.

van_gogh = painter_summaries_clean[
    painter_summaries_clean["painter_name"] == "Vincent van Gogh"
]["embeddings"].values[0]
len(van_gogh)

768

### PART 3 - Reduce Dimensions

In [57]:
import umap

umap_model = umap.UMAP(n_components=2, n_neighbors=16, min_dist=0.4, metric="cosine")
embeddings = painter_summaries_clean["embeddings"].tolist()
embedded_data_2d = umap_model.fit_transform(embeddings)
embedded_data_2d

painter_summaries_clean["umap_x"] = embedded_data_2d[:, 0]
painter_summaries_clean["umap_y"] = embedded_data_2d[:, 1]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [65]:
import plotly.express as px

# Create a scatter plot with Plotly
fig = px.scatter(
    painter_summaries_clean,
    x="umap_x",
    y="umap_y",
    hover_data=["painter_name"],
    width=1600,
    height=1600,
)

fig.update_traces(marker=dict(size=10))
# Show the plot
fig.show()