# Visualise the Embeddings in 2D

In order to visualise how the categorisation task works, we will first show the embedding clusters. We will reduce the dimensionality of the embeddings down so that we can visualise them on a 2D Scatter plot. 

In [22]:
import pandas as pd
import numpy as np
from ast import literal_eval

# We convert the topic embeddings - gathered from COHERE, and convert them into a matrix. 
df = pd.read_csv('./topic_embeddings.csv', sep=',', header=None)
matrix =  np.array(df[2].apply(literal_eval).to_list())


Once we have converted the embeddings into a numpy matrix, we will then reduce the cardinality of them to 2, which will allow us to visualise them. We do this using t-distributed stochastic neighbor embedding, which is an unsupervised way to reduce dimensionality. 

In [None]:
import umap.umap_ as umap

umap_instance = umap.UMAP(n_components=2, n_neighbors=10, random_state=42)
vis_dims = umap_instance.fit_transform(matrix)
vis_dims.shape

We can then visualise this in a 2-Dimensional space. We can see how related topics are clustered together. This is a simplified representation of this data, in reality a lot of these topics will intercept in multiple dimensions. However, this simplified representation allows us to visualise how these topics are similarly themed. 

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

colors = ["red", "darkorange", "gold", "turquoise", "darkgreen", "yellow", "black"]
topic_names = df[0].unique().tolist()
topic_to_color = dict(zip(topic_names, colors))
color_indices = df[0].values

x = [x for x,_ in vis_dims]
y = [y for _,y in vis_dims]

fig, ax = plt.subplots()
ax.scatter(x, y, color=[topic_to_color[i] for i in df[0]], alpha=0.5)
ax.axis("off")

for idx, arr in enumerate(vis_dims):
    x,y = vis_dims[idx]
    tag = df[1][idx].replace("this is about ", "").lower()
    ax.annotate(tag, (x+ 0.05, y), size=5)

handles = [mpatches.Patch(color = value, label = key) for (key, value) in topic_to_color.items()]
legend = ax.legend(handles=handles, prop={'size': 6})

We can use this data to cluster some related information. For instance, let's have a look at an article published by ArsTechnica: 

### ***The climate is changing so fast that we haven’t seen how bad extreme weather could get: Decades-old statistics no longer represent what is possible in the present day.***

We have also embedded this story, and so we can visualise where it would sit on the above 2d scatter plot. Doing so will expose some information about what content this story contains. 


In [None]:
article_df = pd.read_csv('./article_embedding.csv', header=None)

topic_list = pd.concat([df[2].apply(literal_eval), article_df[1].apply(literal_eval)])

topic_and_article_matrix = np.array(topic_list.to_list())
topic_and_article_dims = umap_instance.fit_transform(topic_and_article_matrix)

topic_dims = topic_and_article_dims[:-1]
article_dims = topic_and_article_dims[-1:]

# Plot all the topics again. 
x = [x for x,_ in topic_dims]
y = [y for _,y in topic_dims]

fig, ax = plt.subplots()
ax.scatter(x, y, color=[topic_to_color[i] for i in df[0]], alpha=0.5)
ax.axis("off")

article_x = [x for x,_ in article_dims]
article_y = [y for _,y in article_dims]

for idx, arr in enumerate(topic_dims):
    x,y = topic_dims[idx]
    tag = df[1][idx].replace("this is about ", "").lower()
    ax.annotate(tag, (x +0.05, y), size=5)

ax.annotate("ArsTechnica Article", (article_x[0], article_y[0]), size=5)
scatterplot = ax.scatter(article_x, article_y, marker='x', color='red', s=100)

handles = [mpatches.Patch(color = value, label = key) for (key, value) in topic_to_color.items()]


Due to the articles topic, it is plotted closely to both the Climate Science, and Climate policy topic. Hopefully this gives you an idea of how these properties might be used in order to categorise stories like the above into different categories. 