In [None]:
!pip install pytesseract transformers torch scikit-learn
!apt-get update
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev

## Setting the folder path and batch size

In [None]:
import os
import pytesseract
from PIL import Image
from transformers import BertTokenizerFast, BertModel, LayoutLMTokenizer, LayoutLMModel
import torch
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import gc
from google.colab import drive
import logging
import math

# Suppress transformers logging
logging.getLogger("transformers").setLevel(logging.ERROR)

In [None]:
# Mount Google Drive
drive.mount('/content/drive')

# Set up paths
folder_path = '/content/drive/My Drive/resume-processed'  # Adjust the path to your folder in Google Drive
output_folder_path = '/content/drive/My Drive/resume-clusters_bert-layoutlm'
file_limit = 2500  # Set the limit for the number of files to process
batch_size = 1  # Increase batch size to better utilize 40GB GPU

## Checking if GPU device is available and use it

---



In [None]:
# Check for CUDA device and set device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## Defining function to extract features from the images processed by denoiser

In this section, we define a function `extract_text_from_image` that utilizes Optical Character Recognition (OCR) to extract text from an image file.
The function takes an image file path as input, resizes the image to a standard width while maintaining the aspect ratio, and then applies OCR using the Tesseract library to extract and return the text content from the image.

The steps involved in the function are:
1. Open the image file.
2. Resize the image to a base width of 1000 pixels using the LANCZOS filter for high-quality downsampling.
3. Apply Tesseract OCR to the resized image to extract text.
4. Return the extracted text.

This preprocessing step ensures that the text in the images is readable and standardized for further analysis.

In [None]:
def extract_text_from_image(image_path):
    # Open the image file
    img = Image.open(image_path)

    # Resizing the image to a base width of 1000 pixels while maintaining aspect ratio
    base_width = 1000
    w_percent = (base_width / float(img.size[0]))  # Calculate the width percentage
    h_size = int((float(img.size[1]) * float(w_percent)))  # Calculate the new height based on the width percentage
    img = img.resize((base_width, h_size), Image.LANCZOS)  # Resize the image using LANCZOS filter for high-quality downsampling

    # Perform OCR on the resized image to extract text
    text = pytesseract.image_to_string(img)

    # Return the extracted text
    return text

### Extract Text Features Using BERT

In this section, we define a function `extract_text_features` that utilizes a pre-trained BERT model to extract feature embeddings from a given text. The function takes the text, a tokenizer, and a model as input, tokenizes the text, processes it through the BERT model, and returns the feature embeddings.

The steps involved in the function are:
1. **Tokenize the Text**: The text is tokenized using the provided tokenizer with truncation and padding to ensure the input length is consistent and within the model's limits.
2. **Generate Input Tensors**: The tokenized inputs are converted to PyTorch tensors and moved to the appropriate device (CPU or GPU).
3. **Model Inference**: The BERT model processes the input tensors to generate output embeddings, with computations done in a `torch.no_grad()` context to avoid gradient calculation.
4. **Extract Features**: The function extracts the mean of the last hidden state embeddings from the model output, converts them to a numpy array, and returns the features.

This function is crucial for converting raw text into meaningful numerical representations (embeddings) that capture the semantic information of the text for further analysis.

#### Technical Details and Model Choice

**BERT (Bidirectional Encoder Representations from Transformers)**:
- **Architecture**: BERT uses a transformer-based architecture, specifically the encoder part of the transformer. It consists of multiple layers (12 in the base version) of bidirectional self-attention mechanisms, which allow it to consider both left and right context simultaneously.
- **Training**: BERT is pre-trained on a large corpus of text (e.g., Wikipedia, BookCorpus) using two unsupervised tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). This pre-training enables BERT to capture deep contextual representations of language.
- **Tokenization**: BERT uses WordPiece tokenization, which breaks down words into subword units, allowing it to handle a large vocabulary efficiently and manage out-of-vocabulary words effectively.
- **Fine-tuning**: BERT can be fine-tuned for specific tasks with relatively small amounts of labeled data, leveraging its pre-trained knowledge to achieve state-of-the-art performance in various NLP tasks.

**Why BERT?**:
1. **Contextual Understanding**: BERT's bidirectional attention mechanism allows it to understand the context of a word based on its surrounding words, providing rich, contextual embeddings.
2. **Pre-trained Knowledge**: BERT's pre-training on vast amounts of text data makes it highly effective at capturing semantic nuances, even with limited labeled data for fine-tuning.
3. **Versatility**: BERT can be applied to a wide range of NLP tasks, including text classification, named entity recognition, and question answering, making it a versatile choice for feature extraction.
4. **State-of-the-art Performance**: BERT has consistently achieved state-of-the-art results on numerous NLP benchmarks, demonstrating its effectiveness in understanding and generating text representations.

By using BERT for text feature extraction, we leverage its ability to generate rich, contextual embeddings that capture the semantic meaning of the text, providing a strong foundation for subsequent analysis.


In [None]:
def extract_text_features(text, tokenizer, model):
    # Tokenize the text with truncation and padding, and convert to PyTorch tensors
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512).to(device)

    # Perform model inference without gradient calculation
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract the mean of the last hidden state embeddings and convert to numpy array
    features = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

    return features

### Extract Layout Features Using LayoutLM

In this section, we define a function `extract_layout_features` that utilizes a pre-trained LayoutLM model to extract layout-aware feature embeddings from an image. The function takes an image file path, a tokenizer, and a model as input, processes the image to extract text and layout information, and returns the feature embeddings.

The steps involved in the function are:
1. **Open and Resize Image**: The image is opened and resized to a standard width while maintaining the aspect ratio to ensure text readability.
2. **Extract OCR Data**: Tesseract OCR is used to extract text and layout information (bounding boxes) from the image.
3. **Normalize Bounding Boxes**: The bounding boxes are normalized relative to the image dimensions.
4. **Tokenize Text and Layout Information**: The extracted text and bounding boxes are tokenized using the LayoutLM tokenizer with truncation and padding.
5. **Model Inference**: The LayoutLM model processes the input tensors to generate output embeddings, with computations done in a `torch.no_grad()` context.
6. **Extract Features**: The function extracts the mean of the last hidden state embeddings from the model output, converts them to a numpy array, and returns the features.

This function is crucial for capturing both textual and spatial information from document images, providing a comprehensive feature representation for further analysis.

#### Technical Details and Model Choice

**LayoutLM (Layout-Aware Language Model)**:
- **Architecture**: LayoutLM extends the BERT architecture by incorporating layout information. It uses the same transformer-based architecture but adds an additional input embedding for the spatial layout (bounding boxes) of the text.
- **Training**: LayoutLM is pre-trained on large-scale document datasets, learning to understand both text and its spatial arrangement. It uses tasks such as masked language modeling and structure-aware pre-training to capture the relationships between text and layout.
- **Tokenization**: LayoutLM uses a tokenizer similar to BERT but additionally requires bounding box coordinates for each token. These coordinates help the model understand the spatial structure of the document.
- **Fine-tuning**: LayoutLM can be fine-tuned for various document understanding tasks, such as form understanding, receipt parsing, and document classification, by leveraging its pre-trained knowledge of text and layout.

**Why LayoutLM?**:
1. **Text and Layout Integration**: LayoutLM captures both textual and spatial information, making it ideal for tasks where the layout of the text is crucial for understanding the document.
2. **Pre-trained on Document Data**: LayoutLM is pre-trained on a large corpus of documents, allowing it to generalize well to various document types and structures.
3. **Versatility**: LayoutLM can be fine-tuned for a wide range of document-related tasks, providing flexibility and robustness.
4. **State-of-the-art Performance**: LayoutLM has achieved state-of-the-art results on several document understanding benchmarks, demonstrating its effectiveness in capturing the interplay between text and layout.

By using LayoutLM for layout feature extraction, we leverage its ability to understand the spatial relationships between text elements, providing a comprehensive feature representation that includes both textual and layout information.


In [None]:
def extract_layout_features(image_path, tokenizer, model):
    # Open the image and convert to RGB
    image = Image.open(image_path).convert("RGB")

    # Resizing the image to a base width of 1000 pixels while maintaining aspect ratio
    base_width = 1000
    w_percent = (base_width / float(image.size[0]))  # Calculate the width percentage
    h_size = int((float(image.size[1]) * float(w_percent)))  # Calculate the new height based on the width percentage
    image = image.resize((base_width, h_size), Image.LANCZOS)  # Resize the image using LANCZOS filter for high-quality downsampling
    width, height = image.size  # Get the new dimensions of the image

    words, boxes, actual_boxes = [], [], []

    # Use Tesseract to extract OCR data from the image
    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
    n_boxes = len(data['level'])  # Get the number of detected text elements

    for i in range(n_boxes):
        (x, y, w, h) = (data['left'][i], data['top'][i], data['width'][i], data['height'][i])  # Get the bounding box coordinates
        words.append(data['text'][i])  # Append the detected text
        boxes.append([x, y, x + w, y + h])  # Append the bounding box coordinates
        actual_boxes.append([x / width, y / height, (x + w) / width, (y + h) / height])  # Normalize the bounding boxes

    # Tokenize the words and bounding boxes
    encoded_inputs = tokenizer(
        words,
        boxes=actual_boxes,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
        max_length=512  # Increase max_length to handle longer texts
    )
    encoded_inputs = {key: tensor.to(device) for key, tensor in encoded_inputs.items()}  # Move tensors to the appropriate device

    # Perform model inference without gradient calculation
    with torch.no_grad():
        outputs = model(**encoded_inputs)

    # Extract the mean of the last hidden state embeddings and convert to numpy array
    features = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

    return features


### Initialize Tokenizers and Models

In this section, we initialize the tokenizers and models for both text and layout feature extraction using pre-trained BERT and LayoutLM models. These models are loaded from the Hugging Face model hub and moved to the appropriate device (CPU or GPU).

The steps involved in this section are:
1. **Initialize BERT Tokenizer and Model**: Load the pre-trained BERT tokenizer and model for text feature extraction.
2. **Initialize LayoutLM Tokenizer and Model**: Load the pre-trained LayoutLM tokenizer and model for layout-aware feature extraction.
3. **Move Models to Device**: Move the models to the appropriate device (CPU or GPU) to leverage hardware acceleration for faster processing.

This setup ensures that we have the necessary tools for extracting both textual and spatial information from document images, providing a comprehensive feature representation for further analysis.


In [None]:
text_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
text_model = BertModel.from_pretrained('bert-base-uncased').to(device)
layout_tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")
layout_model = LayoutLMModel.from_pretrained("microsoft/layoutlm-base-uncased").to(device)

### Extract Features from All Images with Progress Tracking

In this section, we extract text and layout features from all images in the specified folder with progress tracking. The function processes each image to extract text and layout information, calculates the total number of words for a progress bar, and prepares the data for further analysis.

The steps involved in this section are:
1. **Initialize Lists for Features**: Create empty lists to store extracted texts, text features, and layout features.
2. **Print Extraction Message**: Inform the user that feature extraction is starting.
3. **Collect File Paths**: Gather all file paths of `.tif` images in the specified folder.
4. **Estimate Total Number of Words**: Calculate the total number of words to process for a progress bar:
   - **Initialize Total Words Counter**: Set up a counter to track the total number of words.
   - **Initialize Progress Bar**: Use `tqdm` to create a progress bar for tracking the estimation process.
   - **Process Each Image**: For each image, perform OCR to extract text elements and count the total number of words.

This setup ensures that we can monitor the progress of feature extraction and handle large datasets efficiently.


In [None]:
# Initialize lists to store texts, text features, and layout features
texts = []
text_features = []
layout_features = []


print("Extracting features from images...")

# Initialize file count
file_count = 0

# Collect file paths for all .tif images in the specified folder
file_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.tif')]

# Estimate total number of words to process for the progress bar
print("Estimating total number of words to process...")
total_words = 0

# Initialize the progress bar for word estimation
with tqdm(total=len(file_paths[:file_limit]), desc="Estimating words", unit="file", leave=False) as pbar:
    for file_path in file_paths[:file_limit]:
        # Open the image and convert to RGB
        data = pytesseract.image_to_data(Image.open(file_path).convert("RGB"), output_type=pytesseract.Output.DICT)

        # Sum up the number of detected text elements for word count
        total_words += len(data['text'])

        # Update the progress bar
        pbar.update(1)


# Process a Batch of Images

In this section, we define a series of functions to process a batch of image files, extracting text, text features, and layout features. The functions leverage previously defined methods for OCR-based text extraction, text feature extraction using BERT, and layout feature extraction using LayoutLM. This modular approach allows for efficient and scalable batch processing, especially useful for large datasets.

## Steps Involved

1. **Extract Text from Each Image:** Using `process_batch_1`, the text is extracted from each image file in the batch using OCR.
2. **Extract Text Features:** Using `process_batch_2`, text features are computed for each extracted text using the BERT model and tokenizer.
3. **Extract Layout Features:** Using `process_batch_3`, layout features are computed for each image file in the batch using the LayoutLM model and tokenizer.
4. **Clear Memory:** The `clear_memory` function is used to release GPU memory and perform garbage collection to maintain efficiency.

In [None]:
def process_batch_1(batch_files):
    batch_texts = [extract_text_from_image(file_path) for file_path in batch_files]
    return batch_texts

def process_batch_2(batch_files, batch_texts):
    batch_text_features = [extract_text_features(text, text_tokenizer, text_model) for text in batch_texts]
    return batch_text_features

def process_batch_3(batch_files):
    batch_layout_features = [extract_layout_features(file_path, layout_tokenizer, layout_model) for file_path in batch_files]
    return batch_layout_features

def clear_memory():
    torch.cuda.empty_cache()
    gc.collect()

# Process Files in Batches

In this section, we process the files in batches to extract and store text, text features, and layout features from each image. A progress bar is used to track the tokenization process, providing a visual indication of the progress. Additionally, memory is managed by clearing the cache after each batch is processed. The process is limited to a specified number of files to ensure efficiency and manageability.

## Steps Involved

1. **Initialize Progress Bar:** Create a progress bar to track the tokenization of images based on the total number of words estimated.
2. **Process Files in Batches:** Loop through the file paths in batches, processing a subset of files at a time, up to a defined file limit.
3. **Extract Features:** For each batch, extract texts, text features, and layout features using the respective functions.
4. **Store Extracted Features:** Append the extracted features to the respective lists for all files.
5. **Update Progress Bar:** Update the progress bar based on the number of words processed in each batch.
6. **Clear Cache:** Clear the cache and free up memory after processing each batch to ensure efficient memory usage.

In [None]:
file_count = 0
file_limit = 2500
with tqdm(total=total_words, desc="Tokenizing images", unit="word", leave=True) as pbar:
    for i in range(0, file_limit, batch_size):
        # Select the batch of files to process
        batch_files = file_paths[i:i+batch_size]

        # Process the batch to extract texts
        batch_texts = process_batch_1(batch_files)
        # Clear memory before proceeding to the next step
        clear_memory()

        # Process the batch to extract text features
        batch_text_features = process_batch_2(batch_files, batch_texts)
        text_features.extend(batch_text_features)
        # Clear memory before proceeding to the next step
        clear_memory()

        # Process the batch to extract layout features
        batch_layout_features = process_batch_3(batch_files)
        # Clear memory before storing results
        clear_memory()

        # Store the extracted texts along with their file names
        texts.extend([(os.path.basename(file_path), text) for file_path, text in zip(batch_files, batch_texts)])

        # Store the extracted text and layout features
        layout_features.extend(batch_layout_features)

        # Clear cache after processing each batch to free up memory
        del batch_texts, batch_layout_features, batch_text_features
        clear_memory()

        # Check if limit is reached
        file_count += len(batch_files)
        if file_count >= file_limit:
            break

        # Update the progress bar based on the number of words processed in this batch
        pbar.update(len(batch_files) * 512)  # Assuming each file processes approximately 512 words

### Calculate Maximum Feature Length

In this section, we calculate the maximum feature length from the extracted text and layout features. This is important for ensuring that all feature vectors have a consistent length, which is necessary for further processing such as clustering or dimensionality reduction.

In [None]:
max_length = max([len(np.concatenate((np.squeeze(text_feat), np.squeeze(layout_feat).flatten()))) for text_feat, layout_feat in zip(text_features, layout_features)])


### Combine and Normalize Features with Padding or Truncating

In this section, we define a function `pad_or_truncate` to ensure that all feature vectors have a consistent length by either padding or truncating them to a fixed length. We then combine the text and layout features, apply the padding/truncating function, and convert the result into a NumPy array for further analysis.

The steps involved in this section are:
1. **Define Padding/Truncating Function**: Create a function to pad or truncate feature vectors to a specified length.
2. **Combine Features**: Concatenate text and layout features for each document.
3. **Apply Padding/Truncating**: Use the padding/truncating function to ensure all combined feature vectors have the same length.
4. **Store Combined Features**: Append the normalized feature vectors to a list and convert the list to a NumPy array.



In [None]:
# Function to pad or truncate features to a fixed length
def pad_or_truncate(feature, length):
    if len(feature) > length:
        return feature[:length]
    elif len(feature) < length:
        return np.pad(feature, (0, length - len(feature)), 'constant')
    else:
        return feature

# Combine text and layout features with padding or truncating
combined_features = []
for text_feat, layout_feat in zip(text_features, layout_features):
    text_feat = np.squeeze(text_feat)
    layout_feat = np.squeeze(layout_feat).flatten()  # Ensure layout features are flattened
    combined_feature = np.concatenate((text_feat, layout_feat))
    combined_feature = pad_or_truncate(combined_feature, max_length)
    combined_features.append(combined_feature)

combined_features = np.array(combined_features)  # Convert to a NumPy array

Checking combined-features dimonesions

In [None]:
print(f"Combined features shape: {combined_features.shape}")
print(f"Combined features size in memory: {combined_features.nbytes / 1e9} GB")

# Dimensionality reduction using PCA and Kmeans for clustering

## PCA Elbow Plot

In this section, we define a function to perform Principal Component Analysis (PCA) on a dataset and plot the cumulative explained variance ratio. The plot helps determine the optimal number of principal components to retain by showing the "elbow" point where the explained variance starts to level off.

In [None]:
from sklearn.decomposition import PCA

def plot_pca_elbow(data, max_components=200):
    # Initialize PCA with the maximum number of components, limited by the number of features in the data
    pca = PCA(n_components=min(max_components, data.shape[1]))

    # Fit the PCA model on the data
    pca.fit(data)

    # Calculate the cumulative explained variance ratio
    explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

    # Plot the cumulative explained variance ratio
    plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio)
    plt.xlabel('Number of Components')
    plt.ylabel('Variance Explained')
    plt.title('Explained Variance by PCA Components')
    plt.grid(True)
    plt.show()

In [None]:
# Plot PCA elbow with reduced dimensionality data
print("Plotting PCA elbow...")
plot_pca_elbow(combined_features, max_components=1000)

## Applying PCA to Meet a Variance Threshold

In this section, we define a function to apply Principal Component Analysis (PCA) on a dataset, selecting the number of components required to meet a specified explained variance threshold.

In [None]:
def apply_pca(data, explained_variance_threshold=0.95):
    # Initialize PCA without specifying the number of components
    pca = PCA()

    # Fit the PCA model on the data
    pca.fit(data)

    # Calculate the cumulative explained variance ratio
    cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

    # Find the number of components required to meet the explained variance threshold
    n_components = np.searchsorted(cumulative_variance, explained_variance_threshold) + 1

    # Apply PCA with the estimated number of components
    pca = PCA(n_components=n_components)
    transformed_data = pca.fit_transform(data)

    return transformed_data, n_components

transformed_data, n_components = apply_pca(combined_features, explained_variance_threshold=0.95)
print(f"Number of components chosen: {n_components}")


## Finding the Optimal Number of Clusters

In this section, we define a function to determine the optimal number of clusters for K-Means clustering using the silhouette score. The silhouette score helps evaluate the quality of the clustering by measuring how similar an object is to its own cluster compared to other clusters.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def find_optimal_clusters(data, max_k):
    # Define the range of cluster numbers to try
    iters = range(2, max_k+1, 2)

    # Initialize a list to store silhouette scores
    s = []

    # Iterate through the range of cluster numbers
    for k in iters:
        # Perform K-Means clustering with k clusters
        kmeans = KMeans(n_clusters=k, random_state=42).fit(data)

        # Calculate the silhouette score for the current clustering
        s.append(silhouette_score(data, kmeans.labels_))

        # Print the number of clusters and corresponding silhouette score
        print(f'k: {k}, Silhouette Score: {s[-1]}')

    # Create a plot to visualize the silhouette scores for different cluster numbers
    f, ax = plt.subplots(1, 1)
    ax.plot(iters, s, marker='o')
    ax.set_xlabel('Cluster Centers')
    ax.set_xticks(iters)
    ax.set_xticklabels(iters)
    ax.set_ylabel('Silhouette Score')
    ax.set_title('Silhouette Scores for Various Clusters')
    plt.show()

## Perform Clustering with the Optimal Number of Clusters

In this section, we use the optimal number of clusters identified from the silhouette analysis to perform K-Means clustering.

In [None]:
# Perform clustering with the optimal number of clusters
optimal_clusters = 4  # Adjust this based on the silhouette analysis
print(f"Clustering into {optimal_clusters} clusters...")

# Initialize K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)

# Fit the K-Means model and predict the cluster for each data point
clusters = kmeans.fit_predict(transformed_data)

## Evaluate Clustering with Multiple Metrics

In this section, we evaluate the quality of the clustering using three different metrics: Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index. These metrics provide a comprehensive assessment of the clustering performance.

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Evaluate the clustering using the three metrics
silhouette_avg = silhouette_score(transformed_data, clusters)
davies_bouldin_avg = davies_bouldin_score(transformed_data, clusters)
calinski_harabasz_avg = calinski_harabasz_score(transformed_data, clusters)

print(f"Silhouette Score: {silhouette_avg}")
print(f"Davies-Bouldin Index: {davies_bouldin_avg}")
print(f"Calinski-Harabasz Index: {calinski_harabasz_avg}")

# Dimensionality reduction using t-SNE and Kmeans for clustering

## Apply t-SNE for Dimensionality Reduction

In this section, we define a function to apply t-distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction. t-SNE is a powerful technique for visualizing high-dimensional data by mapping it to a lower-dimensional space, typically 2 or 3 dimensions.

In [None]:
from sklearn.manifold import TSNE

def apply_tsne(data, n_components=2, perplexity=30.0, n_iter=1000):
    tsne = TSNE(n_components=n_components, perplexity=perplexity, n_iter=n_iter)
    transformed_data = tsne.fit_transform(data)
    return transformed_data

In [None]:
perplexity = 30  # Perplexity value
n_iter = 1000  # Number of iterations
transformed_data = apply_tsne(combined_features, n_components=2, perplexity=perplexity, n_iter=n_iter)

## Plot t-SNE Results

In this section, we define a function to apply t-distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction and plot the resulting lower-dimensional data. This visualization helps in understanding the structure and distribution of high-dimensional data in a 2D space.

In [None]:
# Function to plot t-SNE results
def plot_tsne(data, perplexity, n_iter):
    """
    Apply t-SNE and plot the results.

    Parameters:
    - data: The input data to be transformed.
    - perplexity: The perplexity parameter for t-SNE.
    - n_iter: The number of iterations for optimization.
    """
    # Initialize t-SNE with the specified parameters
    tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=n_iter)

    # Fit and transform the data using t-SNE
    transformed_data = tsne.fit_transform(data)

    # Plot the transformed data
    plt.scatter(transformed_data[:, 0], transformed_data[:, 1], s=5)
    plt.title(f't-SNE Visualization (perplexity={perplexity}, n_iter={n_iter})')
    plt.xlabel('t-SNE Component 1')
    plt.ylabel('t-SNE Component 2')
    plt.grid(True)
    plt.show()

# Example data
data = np.random.rand(1000, 100)

# Different parameter settings
plot_tsne(combined_features, perplexity=5, n_iter=300)
plot_tsne(combined_features, perplexity=30, n_iter=300)
plot_tsne(combined_features, perplexity=50, n_iter=300)
plot_tsne(combined_features, perplexity=30, n_iter=1000)


## Finding the optimal clusters

Finding the optimal clusters using the function defined before

In [None]:
print("Finding the optimal number of clusters...")
find_optimal_clusters(transformed_data, 10)

## Perform Clustering with the Optimal Number of Clusters

In this section, we use the optimal number of clusters identified from the silhouette analysis to perform K-Means clustering.

In [None]:
# Perform clustering with the optimal number of clusters
optimal_clusters = 4  # Adjust this based on the silhouette analysis
print(f"Clustering into {optimal_clusters} clusters...")

# Initialize K-Means with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)

# Fit the K-Means model and predict the cluster for each data point
clusters = kmeans.fit_predict(transformed_data)


## Evaluate Clustering with Multiple Metrics

In this section, we evaluate the quality of the clustering using three different metrics: Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index. These metrics provide a comprehensive assessment of the clustering performance.

In [None]:
# Evaluate the clustering using the three metrics
silhouette_avg = silhouette_score(transformed_data, clusters)
davies_bouldin_avg = davies_bouldin_score(transformed_data, clusters)
calinski_harabasz_avg = calinski_harabasz_score(transformed_data, clusters)

print(f"Silhouette Score: {silhouette_avg}")
print(f"Davies-Bouldin Index: {davies_bouldin_avg}")
print(f"Calinski-Harabasz Index: {calinski_harabasz_avg}")


# Saving the clustring with best metrics

In [None]:
# Create directories for each cluster
import shutil
os.makedirs(output_folder_path, exist_ok=True)

for cluster_label in np.unique(kmeans_labels):
    cluster_dir = os.path.join(output_folder_path, f"cluster_{cluster_label}")
    os.makedirs(cluster_dir, exist_ok=True)

# Move images to the corresponding cluster directories
for image_path, cluster_label in zip(file_paths, kmeans_labels):
    shutil.copy(image_path, os.path.join(output_folder_path, f"cluster_{cluster_label}"))

print("Images have been saved to corresponding cluster directories.")