Enlace al gib de ejmplo : https://github.com/dylanjcastillo/random/tree/main/self-organizing-news

Clustering Documents
You should think of the clustering process in three steps:

Generate numerical vector representations of documents using OpenAI’s embedding capabilities.
Apply a clustering algorithm on the vectors to group the documents.
Generate a title for each cluster summarizing the articles contained in it.
That’s it! Now, you’ll see how that looks in practice.

Import the Required Packages
Start by importing the required Python libraries. Copy the following code in your notebook:

In [None]:
import os

import hdbscan
import numpy as np
import pandas as pd
import plotly.express as px
import requests
from dotenv import load_dotenv
from openai import OpenAI
from umap import UMAP

load_dotenv()

In [None]:
# tomo el dataframe
df = pd.read_csv("news_data_dedup.csv")

# creo docs una lista de cadenas, cada una representa un piso y la disctipcion extendida

docs = [f"{url}\n{description_extendida}" for url, description_extendida in zip(df.url, df.description_extendida)]


Cada elemento es una cadena de texto formada por la url y la  y la descripcion_extendida del anuncio del piso en venta. Utilizamos una comprensión de lista para iterar sobre las url y las descripciones de los isos en el DataFrame df.

La función zip(df.url, df.description_extendida) combina los títulos y descripciones en pares.

La sintaxis f"{url}\n{description_extendida}" formatea cada par de url y descripcion_extendida en una cadena separada por un salto de línea \n.

Entonces, docs será una lista de cadenas, cada una representando un artículo con su título y su descripción.

Then, initialize the OpenAI client and generate the embeddings:

In [None]:
client = OpenAI()
response = client.embeddings.create(input=docs, model="text-embedding-3-small")
embeddings = [np.array(x.embedding) for x in response.data]

Cluster documents
Once you have the embeddings, you can cluster them using hdbscan:

In [None]:
hdb = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=3).fit(embeddings)

This code will generate clusters using the embeddings generated, and then create a DataFrame with the results. Itfits the hdbscan algorithm. In this case, I set min_samples and min_cluster_size to 3, but depending on your data this may change. Check HDBSCAN’s documentation to learn more about these parameters.

Next, you’ll create topic titles for each cluster based on their contents.

Visualize the clusters
After you’ve generated the clusters, you can visualize them using UMAP:

In [None]:
umap = UMAP(n_components=2, random_state=42, n_neighbors=80, min_dist=0.1)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=['x', 'y'])
    .assign(cluster=lambda df: hdb.labels_.astype(str))
    .query('cluster != "-1"')
    .sort_values(by='cluster')
)

fig = px.scatter(df_umap, x='x', y='y', color='cluster')
fig.show()

Create a Topic Title per Cluster
For each cluster, you’ll generate a topic title summarizing the articles in that cluster. Copy the following code to your notebook:

In [None]:
df["cluster_name"] = "Uncategorized"

def generate_topic_titles():
    system_message = "You're an expert journalist. You're helping me write short but compelling topic titles for groups of news articles."
    user_template = "Using the following articles, write a 4 to 5 word title that summarizes them.\n\nARTICLES:\n\n{}\n\nTOPIC TITLE:"

    for c in df.cluster.unique():
        sample_articles = df.query(f"cluster == '{c}'").to_dict(orient="records")
        articles_str = "\n\n".join(
            [
                f"[{i}] {article['title']}\n{article['description'][:200]}{'...' if len(article['description']) > 200 else ''}"
                for i, article in enumerate(
                    sample_articles, start=1
                )
            ]
        )
        messages = [
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_template.format(articles_str)},
        ]
        response = client.chat.completions.create(
            model="gpt-3.5-turbo", messages=messages, temperature=0.7, seed=42
        )

        topic_title = response.choices[0].message.content
        df.loc[df.cluster == c, "cluster_name"] = topic_title

This code takes all the articles per cluster and uses gpt-3.5-turbo to generate a relevant topic title from them. Itgoes through each cluster, takes the articles in it, and makes a prompt using that to generate a topic title for that cluster.

Finally, you can check the resulting clusters and topic titles, as follows:



In [None]:
c = 6
with pd.option_context("display.max_colwidth", None):
    print(df.query(f"cluster == '{c}'").topic_title.values[0])
    display(df.query(f"cluster == '{c}'").drop(columns=["topic_title"]).head())

In my case, running this code produces the following articles and topic titles:

