<a href="https://colab.research.google.com/github/azizdhaoui/Text-Similarity-and-Clustering-Project/blob/main/Text%20Similarity%20and%20Clustering%20Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering
import numpy as np

def calculate_similarity(X, method='cosine'):
    if method == 'cosine':
        return cosine_similarity(X)
    elif method == 'euclidean':
        # Convert sparse matrix to dense array for Euclidean distance calculation
        X_dense = X.toarray()
        # Calculate squared Euclidean distances
        distances = np.sum(X_dense**2, axis=1, keepdims=True) - 2 * X_dense.dot(X_dense.T) + np.sum(X_dense.T**2, axis=0)
        # Convert distances to similarities
        return 1 / (1 + distances)


# Step 1: Vectorization of documents
documents = [
    "Premier document.",
    "Deuxième document différent.",
    "Un autre document pour tester.",
    "Document similaire au premier.",
    "Encore un document pour l'exemple."
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# List of similarity methods to use
similarity_methods = ['cosine', 'euclidean']

for method in similarity_methods:
    # Step 2: Calculation of similarity
    similarities = calculate_similarity(X, method=method)

    # Step 4: Clustering algorithm
    n_clusters = 5
    clustering = AgglomerativeClustering(n_clusters=n_clusters, affinity='precomputed', linkage='average')
    clustering.fit(similarities)

    # Step 5: Segmentation into sub-corpora
    sub_corpora = {}
    for doc_id, cluster_id in enumerate(clustering.labels_):
        if cluster_id not in sub_corpora:
            sub_corpora[cluster_id] = []
        sub_corpora[cluster_id].append(documents[doc_id])

    print(f"Results for similarity method {method}:")
    for cluster_id, docs in sub_corpora.items():
        print(f"Sub-corpus {cluster_id + 1}:")
        for doc in docs:
            print(doc)
        print()
    print()


Results for similarity method cosine:
Sub-corpus 3:
Premier document.

Sub-corpus 4:
Deuxième document différent.

Sub-corpus 5:
Un autre document pour tester.

Sub-corpus 2:
Document similaire au premier.

Sub-corpus 1:
Encore un document pour l'exemple.


Results for similarity method euclidean:
Sub-corpus 3:
Premier document.

Sub-corpus 4:
Deuxième document différent.

Sub-corpus 5:
Un autre document pour tester.

Sub-corpus 2:
Document similaire au premier.

Sub-corpus 1:
Encore un document pour l'exemple.




