In [None]:
!which python

### News Summarizer

In this post I will try to implement a news summarizer. 

Over the past month I have been collecting a lot of news article from major congolese website news webisite. I have those article saved in a postgres database. There are lot of fun stuff I can do with them. Among them there is a news summarizer. I want to analyze the daily news and find out what are the main news the website are talking about.

In this blog or series of post I will try to build that news summarizer. As of now I will structure it as follow. 
- Kmean clustering
- Text Summarization with a Language Model
- Deployment to Production and Building the UI

### Data Collection

We have the data save as text in a postgres database in this section we will query the database and load the data in a pandas dataframe for better analyzis. I have the code to connect and read from the postgres database embedded in modules

In [None]:
%load_ext dotenv

In [None]:
%dotenv ./.env_prod -o

The above line loads the database credentatials so that we can query the database.

In [None]:
from src.rag.shared.database import execute_query, generate_database_connection

In [None]:
yesterday_article_query = "select content, title, posted_at,url from article where posted_at::date = CURRENT_DATE - interval '1 day'"

In [None]:
from os import getenv

In [None]:
database_user = getenv('POSTGRES_USER')
database_password = getenv('POSTGRES_PASSWORD')
database_host = getenv('POSTGRES_HOST')
database_port = getenv('POSTGRES_PORT')
database_name = getenv('POSTGRES_DB')

In [None]:
database_credentials = {
    'user': database_user,
    'password': database_password,
    'host': database_host,
    'port': database_port,
    'database': database_name
}

In [None]:
connection = generate_database_connection(database_crendentials=database_credentials)

With the credentials, the database connection, the query we can go ahead and query the database to retrieve the data.

In [None]:
results =execute_query(query=yesterday_article_query, database_connection=connection)

In [None]:
results[0].title

We have our results in a list now we can put them in a dataframe from further analysis.

In [None]:
import pandas as pd

In [None]:
news_df  = pd.DataFrame.from_records(results)

In [None]:
news_df.head()

In [None]:
news_df.columns =  ["content", "title", "posted_at", "url"]

In [None]:
from pathlib import Path

In [None]:
current_directory = Path.cwd()

In [None]:
news_directory = current_directory.joinpath("datasets", "today_news")

In [None]:
news_directory.mkdir(exist_ok=True)

In [None]:
from datetime import datetime

In [None]:
today = datetime.now().strftime("%Y-%m-%d")

In [None]:
news_df.to_csv(news_directory.joinpath(f"{today}-news.csv"))

In [None]:
news_df.head()

We have got our news dataset, we need to now do some preprocessing. The only preprocessing we will do will be to drop the duplicate in the content.

In [None]:
news_df = news_df.drop_duplicates(subset="content").reset_index(drop=True)

In [None]:
news_df

Once we have dataset, we will need to use an embedding  model to learn representation of our dataset in an embedding space.

We will be using the `dunzhang/stella_en_400M_v5`, it is a good model from huggingface despite his size it has a good score on different tasks  in both French and English on the MTEB leaderboard.

In [None]:
embedding_model_id = "dunzhang/stella_en_400M_v5"

In [None]:
current_directory

In [None]:
model_path  = current_directory.joinpath(embedding_model_id)

In [None]:
embedding_model_path = current_directory.joinpath("models", embedding_model_id)

In [None]:

transformer_kwargs = {"model_name_or_path": embedding_model_path.__str__(),
                      "trust_remote_code": True,
                      "device": "cpu",
                      "config_kwargs": {"use_memory_efficient_attention": False,
                                        "unpad_inputs": False},
                      "cache_folder": model_path}

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
today_news_embeddings = sentence_transformer_model.encode(
    news_df.content.tolist())

In [None]:
today_news_embeddings.shape

Now we have encoded our news in the embeddings, for each news we have an embedding vector of shape 1024. With those embedding we can now start clustering our news.



## Kmeans

In this step, we will group our news embeddings in a cluster using the Kmean algorithm. The algorithm will try to group the news in clusters based on the similarity of their embedding vectors. After the clustering, we will have similar news grouped in similar clusters.

### How do we pick the number of cluster?

We will use the Shilouette score to get the best number of clusters.

>The Silhouette Coefficient is a measure of how well samples are clustered with samples that are similar to themselves. Clustering models with a high Silhouette Coefficient are said to be dense, where samples in the same cluster are similar to each other, and well separated, where samples in different clusters are not very similar to each other.



Given the a point $x_i$, and a cluster label $c_i$ to compute the shilloute score:
- we compute the mean distance of the $x_i$ to all the point in cluster $c_i$, we call it $a_i$

  ${\displaystyle a_i={\frac {1}{|C_{I}|-1}}\sum _{j\in C_{I},i\neq j}d(i,j)}$

  Note that we divide by don't want to include the current point when we are trying to compute the distance.
  
- $b_i$ is the a measure to how the point $x_i$ in cluster $c_i$ is disimilar to all other clusters $c_j$ with $c_j != c_i$.

For each other clusters different $c_i$ we compute the mean distance between $x_i$ and all the points in the cluster.  Then we take the cluster that has the mean distance as the closest cluster to $x_i$.

We define $b_i$ as:

${\displaystyle b_i=\min _{J\neq I}{\frac {1}{|C_{J}|}}\sum _{j\in C_{J}}d(i,j)}$


With those $a_i$, and $b_I$ we define the shiloute score of the point $x_i$ as $s_i$ to be

${\displaystyle s_i={\frac {b_i-a_i}{\max\{a_i,b_i\}}}}$

This value varies between -1, and 1. 1 means our clusters are dense, and -1 means the opposite.

Let us write a python function that will perform the clustering and return the k that gives us the best cluster.


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [None]:

def find_best_estimator (X):
    """ compute the k mean clustering, and return the best k that maximize the silhouette score
    """
    k_mean_estimators = [
        (f"KMeans_{i}", KMeans(n_clusters=i, random_state=42, max_iter=3000)) for i in range(3, X.shape[0])]
    scores = []

    best_estimator = None
    best_metric = float("-inf")
    for estimator_name, estimator in k_mean_estimators :
        estimator.fit(X)
        labels = estimator.labels_
        score = silhouette_score(
            X, labels)
        if score > best_metric :
            best_metric = score
            best_estimator = estimator
        print(estimator_name, score)
        scores.append(score)
    return best_estimator, scores

In [None]:
best_estimator, scores = find_best_estimator(today_news_embeddings)

In the above function we compute the shiloutte score for values for k ranging from 2 to the max number of datapoints in our dataset.


Let plot now the similarity shilouette score and see how it grow with the number of cluster selected.

In [None]:
import matplotlib.pyplot as plt

In [None]:
axes = plt.figure(figsize=(5, 10))

In [None]:
axes = plt.figure(figsize=(5, 10))

In [None]:
fig, ax = plt.subplots()
ax.plot(range(3, today_news_embeddings.shape[0]), scores)

In [None]:
best_estimator

We can see that the best estimator gave us the n cluster equal to 29

In [None]:
news_df["k_means_labels"] = best_estimator.labels_

Now let us analyze the clustering results.

In [None]:
def analyse_embeddings(dataframe, embeddings, index, label_column="labels"):
    """ take a matrix of embeddings and the labels.
    for each label compute the cosine similarity of the document with that label.
    """
    document_in_index = dataframe.query(f"{label_column} == {index}")
    with pd.option_context('display.max_colwidth', None):
        display(document_in_index.title)
    document_index = document_in_index.index
    vectors = embeddings[document_index]
    return sentence_transformer_model.similarity(vectors,  vectors)

In [None]:
analyse_embeddings(news_df, today_news_embeddings,29, label_column="k_means_labels")

After the first look at the results we can see that the results are good, we have around 29 news cluster, for 72 news.
Some news cluster have only one article which make sense, and othe have up to 6 articles. In the later analyzis we will only keep news articles that have more than one documents.

Can we do better than that? Let now try hiearchical clustering

## Hiearchical Clustering

Hierarchical clustering is a clustering that uses an iterative approach to build the dendrogram.


**How do we build a dendrogram?**

Assuming we have `n` points that we would like to cluster, the algorithm starts with them and a similarity metric to use.
In the first step, all the `n` points are grouped in a `n ` cluster, as each observation is treated as its cluster.
Then, the next two similar clusters are fused into a cluster; at this point, we have `n-1` clusters.
The algorithm will process iteratively  by fusing each cluster into each other until we have one cluster.  
With one cluster we have our dendrogram complete.

**How do we compute similarity between clusters?**

We have the notion between similarity between two points, how do we compute the similartiy between a point and a cluster? or Between two clusters?
The notion of similarity between two points can be extend to develop the notion of `linkage` which is how we evaluate the dissimiarity between two groups of observation or clusters.


The linkage between two cluster is :

All linkage metrics start by computing the pairwise  dissimilarity between the observations in cluster A and those in cluster B. 

Depending on how we will compute the overall dissimilarity from those pairwise dissimilarities, the linkage metric will be defined.

The linkage is called:

- __complete__: When overall dissimilarity is the largest of the pairwise dissimilarity.

- __single__: When ovrrall dissimilarity is the smallest of the pairwise dissimilarity.

- __average__: When overall dissimilarity is the average of the pairwise dissimilarity.

As the result of the hierachical clustering is a tree, which can be visualized as a dendrogram.

On the _y_ axis represent the distance cutt off use while computing the merging.
On the _x_ axis represent the document which are group into cluster based on th colour.

In [None]:
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram

In [None]:
# Complete Linkage
plt.figure(figsize = (20,10))
mergings = linkage(today_news_embeddings,
                   method='complete', metric='cosine')
dendrogram(mergings)
plt.show()

The linkage method from scipy will make a hierachical clustering using the cosine similarity as the metrics for our embedding.
On the above plot, the x axis represent the document which are group into cluster based on th colour, the y axis represent the distance cutt off use while computing the merging.

In [None]:
from scipy.cluster.hierarchy import fcluster
import numpy as np

from the linkage matrix we can return the label using a metric cutt off.

How do we select the best metric cut-offm to use for the clustering?
We will use the Shilouette score and the do the same approach we used with the kmean. We will select the metric that gives us a high silhouette score

In [None]:
def select_best_distance(X, merging):
    """ start with the document embedding x, and the hierachical clustering, find the k that maximize the shilouette score"""
    max_shilouette = float("-inf")
    return_labels = np.zeros(X.shape[0])
    scores = []
    number_of_clusters = []
    best_k = 0
    for k in np.arange(0.2, 0.7, 0.01):
        labels = fcluster(merging, k, criterion="distance")
        score = silhouette_score(
            X, labels
        )
        scores.append(score)
        n_clusters = np.unique(labels).shape[0]
        number_of_clusters.append(n_clusters)
        if score > max_shilouette:
            max_shilouette = score
            return_labels = labels
            best_k = k
    return scores, return_labels, number_of_clusters, best_k

In [None]:
scores, label_hierarchical, number_of_clusters, best_k =  select_best_distance(today_news_embeddings, mergings)

In [None]:
fig, ax = plt.subplots()
ax.plot(np.arange(0.2, 0.7, 0.01), scores)
ax.set_xlabel("Distance metric")
ax.set_ylabel("silhouette score")
ax.set_title("silhouette score vs distance metric")

In [None]:
np.unique(label_hierarchical).shape

In [None]:
max(scores)

In [None]:
best_k

In [None]:
fig, ax = plt.subplots()
ax.plot(np.arange(0.2, 0.7, 0.01), number_of_clusters)
ax.set_xlabel("Distance metric")
ax.set_ylabel("silhouette score")
ax.set_title("distance vs number of clusters")

In [None]:
news_df["label_hierachical"] = label_hierarchical

In [None]:
news_df.query(f"label_hierachical == 0")

In [None]:
analyse_embeddings(news_df, today_news_embeddings, 4, "label_hierachical")

Once i have got the best labeling, i can go ahead and select the most important cluster. 

This will be all the cluster with more than 1 document, the rest of the document will be considered as noise. Each cluster with one document will be considered as noise.

In [None]:
cluster_counts = news_df.label_hierachical.value_counts()
labels_with_more_than_one = cluster_counts[cluster_counts > 1].index

In [None]:
important_news_df = news_df.loc[news_df.label_hierachical.isin(labels_with_more_than_one)]

In [None]:
important_news_df.head()

In [None]:
important_news_df.to_csv(news_directory.joinpath(f"{today}-important-news-clusters.csv"))

At this point we have a notebook with the clustering results and those results are saved back in the folder. The next step will be to build an news cluster component that will be use in a downstream application.