<a href="https://colab.research.google.com/github/amaye15/stackoverflow-question-classifier/blob/main/code/N3_Non_Supervised_Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utilisation de Techniques de Réduction de Dimension
Utiliser des techniques appropriées de réduction en deux dimensions de données de grande dimension et les représenter graphiquement afin d'en réaliser l'analyse exploratoire.

### CE1: Mise en Œuvre de la Réduction de Dimension
- Vous avez mis en œuvre au moins une technique de réduction de dimension (via LDA, ACP, T-SNE, UMAP ou autre technique).

### CE2: Représentation Graphique en 2D
- Vous avez réalisé au moins un graphique représentant les données réduites en 2D (par exemple via LDAvis pour les Topics).

### CE3: Analyse du Graphique en 2D
- Vous avez réalisé et formalisé une analyse du graphique en 2D.

## Libraries & Functions

In [None]:
%pip install datasets --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Standard library imports
import math
import os
import re
import string
import torch
import nltk


# Third-party imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import tensorflow as tf
import tensorflow_hub as hub

# Import Functions/Classes
from datasets import load_dataset
from gensim.models import Word2Vec, FastText
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.manifold import TSNE
from transformers import BertTokenizer, BertModel
from wordcloud import WordCloud
from tqdm.notebook import trange, tqdm

# Typing imports (if needed)
from typing import List

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

def clean_text(text: str) -> str:
    """
    Cleans a given text by converting it to lowercase, removing punctuation,
    and eliminating English stop words.

    Parameters:
    text (str): The text to be cleaned.

    Returns:
    str: The cleaned text with all words in lowercase, no punctuation, and
         without stop words.

    Note:
    This function requires the NLTK library and its 'stopwords' dataset.
    Make sure to download the dataset using nltk.download('stopwords') before using this function.
    """
    # Convert text to lower case
    text = text.lower()

    # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])

    return text

def tokenize(text: str) -> List[str]:
    """
    Tokenizes the given text into individual words.

    This function splits a string into a list of words, using NLTK's word_tokenize method.
    It's useful for breaking down a text into its constituent words for further natural
    language processing tasks.

    Parameters:
    text (str): The text to be tokenized.

    Returns:
    List[str]: A list of words obtained by tokenizing the input text.

    Note:
    This function requires the NLTK library. Make sure to have NLTK installed
    and its relevant datasets downloaded as needed.
    """
    return word_tokenize(text)



def stem_sentence(sentence: str) -> str:
    """
    Stems each word in a given sentence.

    This function applies the Porter stemming algorithm (using NLTK's PorterStemmer)
    to each word in the input sentence. Stemming is a process of reducing words to their
    root or base form. For example, "running" would be stemmed to "run".

    Parameters:
    sentence (str): The sentence whose words are to be stemmed.

    Returns:
    str: A string containing the stemmed version of each word in the input sentence,
         with words separated by spaces.

    Note:
    This function requires the NLTK library and specifically the PorterStemmer module.
    Ensure that NLTK is installed and available.
    """
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in sentence.split()])


def lemmatize_sentence(sentence: str) -> str:
    """
    Lemmatizes each word in a given sentence.

    This function applies word lemmatization using NLTK's WordNetLemmatizer. Lemmatization is
    the process of reducing words to their base or root form in a meaningful way (unlike stemming).
    For instance, the word "better" would be lemmatized to "good".

    Parameters:
    sentence (str): The sentence whose words are to be lemmatized.

    Returns:
    str: A string containing the lemmatized version of each word in the input sentence,
         with words separated by spaces.

    Note:
    This function requires the NLTK library, specifically the WordNetLemmatizer. Ensure that NLTK
    and the WordNet data are installed and available.
    """
    lemmatizer = WordNetLemmatizer()
    return " ".join([lemmatizer.lemmatize(word) for word in sentence.split()])

def is_top_k(row, y_col, y_pred_col, k):
    """
    Check if the actual value in a specified column is within the top 'k' predicted values in another column.

    This function is designed to operate on a row of a pandas DataFrame. It compares the actual value from one column
    ('y_col') with a list of predicted values in another column ('y_pred_col'), and checks if the actual value is within
    the top 'k' elements of the predicted list.

    Parameters:
    row (pd.Series): A row from a pandas DataFrame.
    y_col (str): The name of the column containing the actual value.
    y_pred_col (str): The name of the column containing the list of predicted values.
    k (int): The number of top elements from the predicted values list to consider.

    Returns:
    bool: True if the actual value is within the top 'k' predicted values, False otherwise.
    """
    return row[y_col] in row[y_pred_col][:k]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Setup

In [None]:
# Constants
NAME = "amaye15/Stack-Overflow-Zero-Shot-Classification"
RESPOSITORY = "amaye15/Stack-Overflow-Zero-Shot-Classification"
STACK_KEY = "ub*oRqta6kWgck7l2tG5ng(("
HF_KEY = "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"
K = 20
COMPONENTS = 2
RANDOM_STATE = 42

# Load dataset (assuming load_dataset is a defined function)
ds = load_dataset(NAME)
df = ds["train"].to_pandas()

# Dataframe Manipulation
df["Main_Tag"] = df["Tags"].str.replace(" ", "").apply(lambda x: next(iter(x.split(","))))
df["Predicted_Main_Tag"] = df["Predicted_Tags"].str.replace(" ", "").apply(lambda x: next(iter(x.split(","))))
df["Predicted_Tags"] = df["Predicted_Tags"].str.replace(" ", "").str.split(",")

# Assuming is_top_k is a defined function
df = df[df.apply(lambda row: is_top_k(row, y_col = "Main_Tag", y_pred_col = "Predicted_Tags", k = K), axis=1)].copy()

# Text Processing
top_ten = df["Main_Tag"].value_counts().to_frame().reset_index().rename(columns={"index":"Main_Tag", "Main_Tag":"index"}).loc[:9, "Main_Tag"].to_list()

# Masking
mask = df["Main_Tag"].isin(top_ten).to_list()

Downloading readme:   0%|          | 0.00/602 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/16.6M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/111030 [00:00<?, ? examples/s]

## Count Vectoriser

In [None]:
# Text Processing
text = df["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence).apply(tokenize).to_frame().loc[:, "Title"]

# Vectorization
count_vectorizer = CountVectorizer(min_df=5, lowercase=False)
count_vectorized = count_vectorizer.fit_transform(df["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence))
count_dataframe = pd.DataFrame(count_vectorized.toarray(), columns=count_vectorizer.get_feature_names_out())
#count_dataframe

## TD-IDF Vectoriser

In [None]:
# Vectorization
tfidf_vectorizer = TfidfVectorizer(min_df=5, use_idf=True)
tfidf_vectorized = tfidf_vectorizer.fit_transform(df["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence))

# Use the appropriate method depending on your scikit-learn version
try:
    feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
except AttributeError:
    # Fallback for scikit-learn versions prior to 0.22
    feature_names_tfidf = tfidf_vectorizer.get_feature_names()

# Create a dataframe with the feature names as columns
tfidf_dataframe = pd.DataFrame(tfidf_vectorized.toarray(), columns=feature_names_tfidf)
#tfidf_dataframe

### Visualisation

In [None]:
# Principal Component Analysis (PCA) initialization with a specified number of components and a fixed random state for reproducibility.
pca = PCA(n_components=COMPONENTS, random_state=RANDOM_STATE)

# Applying PCA to the tfidf_dataframe and transforming the data.
reduced_data_pca = pca.fit_transform(tfidf_dataframe.values)

# Creating a DataFrame from the reduced data with named dimensions.
reduced_df = pd.DataFrame(reduced_data_pca, columns=["Dim 1", "Dim 2"])

# Adding a 'Label' column to the DataFrame from 'Main_Tag' column of the original df.
reduced_df["Label"] = df["Main_Tag"].values

# Applying a mask to the DataFrame and creating a copy of the filtered data.
reduced_df = reduced_df[mask].copy()

# Extracting unique labels from the 'Label' column.
unique_labels = reduced_df['Label'].unique()

# Generating all unique pairs of labels using a set to avoid duplicates.
unique_label_pairs = set()
for label1 in unique_labels:
    for label2 in unique_labels:
        if label1 != label2:
            unique_label_pairs.add(tuple(sorted([label1, label2])))

# Converting the set of unique label pairs to a sorted list, limiting to the first 9 pairs.
unique_label_pairs = [(row[2], row[3]) for row in pd.DataFrame(unique_label_pairs).sort_values(by=[0, 1]).reset_index().itertuples()][:9]

# Plotting scatter plots for each pair of labels.
for i, (label1, label2) in enumerate(unique_label_pairs):

    # Creating masks for each label.
    mask1 = reduced_df['Label'] == label1
    mask2 = reduced_df['Label'] == label2

    # Creating a temporary DataFrame with only the selected labels.
    tmp_df = reduced_df[(reduced_df['Label'] == label1) | (reduced_df['Label'] == label2)].copy()

    # Plotting a scatter plot using Plotly Express.
    fig = px.scatter(tmp_df, x="Dim 1", y="Dim 2", color="Label", title=f'{label1} vs {label2}')
    fig.show()


## BERT

In [None]:
# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

# Move model to the chosen device
model.to(device)

# Assuming df["Title"] is your dataframe and text processing functions are defined
text = df["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence).to_list()

# Define batch size
batch_size = 128  # Adjust batch size based on your GPU memory

# Placeholder for batch encoded inputs
batch_encoded_inputs = []

# Batch encode in a loop
for start_idx in tqdm(range(0, len(text), batch_size), desc="Encoding"):
    # Get the batch
    batch = text[start_idx:start_idx + batch_size]

    # Encode the batch and move to the same device as model
    batch_encoded = tokenizer(batch, padding="longest", truncation=True, return_tensors='pt').to(device)

    # Process with the model
    with torch.no_grad():
        encoded_results = model(**batch_encoded)

    # Move results to CPU for further processing/storage
    batch_results = encoded_results.last_hidden_state.mean(dim=1).cpu().tolist()

    # Store the processed batch
    batch_encoded_inputs.extend(batch_results)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Encoding:   0%|          | 0/776 [00:00<?, ?it/s]

### Visualisation

In [None]:
# Initialize PCA with specified number of components and random state for reproducibility.
pca = PCA(n_components=COMPONENTS, random_state=RANDOM_STATE)

# Fit PCA on the batch_encoded_inputs and transform the data.
reduced_data_pca = pca.fit_transform(batch_encoded_inputs)

# Create a DataFrame from the PCA reduced data with named dimensions.
reduced_df = pd.DataFrame(reduced_data_pca, columns=["Dim 1", "Dim 2"])

# Add a 'Label' column from 'Main_Tag' column of the original dataframe.
reduced_df["Label"] = df["Main_Tag"].values

# Apply a mask to the DataFrame and create a copy of the filtered data.
reduced_df = reduced_df[mask].copy()

# Extract unique labels from the 'Label' column.
unique_labels = reduced_df['Label'].unique()

# Generate all unique pairs of labels using a set to avoid duplicates.
unique_label_pairs = set()
for label1 in unique_labels:
    for label2 in unique_labels:
        if label1 != label2:
            unique_label_pairs.add(tuple(sorted([label1, label2])))

# Convert the set of unique label pairs to a sorted list, limiting to the first 9 pairs.
unique_label_pairs = [(row[2], row[3]) for row in pd.DataFrame(unique_label_pairs).sort_values(by=[0, 1]).reset_index().itertuples()][:9]

# Plot scatter plots for each pair of labels.
for i, (label1, label2) in enumerate(unique_label_pairs):

    # Creating a temporary DataFrame with only the selected labels.
    tmp_df = reduced_df[(reduced_df['Label'] == label1) | (reduced_df['Label'] == label2)].copy()

    # Plotting a scatter plot using Plotly Express.
    fig = px.scatter(tmp_df, x="Dim 1", y="Dim 2", color="Label", title=f'{label1} vs {label2}')
    fig.show()


## Word2Vec

In [None]:
# Apply a series of text preprocessing functions: cleaning, stemming, lemmatizing, and tokenizing the 'Title' column of the dataframe.
text = df["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence).apply(tokenize).to_list()

# Identify indices of entries in 'text' that are empty after preprocessing.
remove_idx = [idx for idx, val in enumerate(text) if len(val) == 0]

# Remove empty entries from 'text' and the corresponding entries from 'mask'.
for i in reversed(remove_idx):
    text.pop(i)
    mask.pop(i)

# Initialize and train a FastText model on the preprocessed text data.
model = FastText(text, vector_size=128, window=128, min_count=1, workers=4, sg=1)

# Generate word vectors by averaging the embeddings of words in each sentence.
word2vec = np.vstack([model.wv[sentence].mean(axis=0) for sentence in text])


### Visualisation

In [None]:
# Initialize PCA with specified number of components and random state for reproducibility.
pca = PCA(n_components=COMPONENTS, random_state=RANDOM_STATE)

# Fit PCA on the word vectors and transform the data.
reduced_data_pca = pca.fit_transform(word2vec)

# Create a DataFrame from the PCA reduced data with named dimensions.
reduced_df = pd.DataFrame(reduced_data_pca, columns=["Dim 1", "Dim 2"])

# Add a 'Label' column from 'Main_Tag' column of the original dataframe, dropping removed indices.
reduced_df["Label"] = df["Main_Tag"].drop(index=remove_idx[0]).values

# Apply a mask to the DataFrame and create a copy of the filtered data.
reduced_df = reduced_df[mask].copy()

# Extract unique labels from the 'Label' column.
unique_labels = reduced_df['Label'].unique()

# Generate all unique pairs of labels using a set to avoid duplicates.
unique_label_pairs = set()
for label1 in unique_labels:
    for label2 in unique_labels:
        if label1 != label2:
            unique_label_pairs.add(tuple(sorted([label1, label2])))

# Convert the set of unique label pairs to a sorted list, limiting to the first 9 pairs.
unique_label_pairs = [(row[2], row[3]) for row in pd.DataFrame(unique_label_pairs).sort_values(by=[0, 1]).reset_index().itertuples()][:9]

# Plot scatter plots for each pair of labels.
for i, (label1, label2) in enumerate(unique_label_pairs):

    # Creating a temporary DataFrame with only the selected labels.
    tmp_df = reduced_df[(reduced_df['Label'] == label1) | (reduced_df['Label'] == label2)].copy()

    # Plotting a scatter plot using Plotly Express.
    fig = px.scatter(tmp_df, x="Dim 1", y="Dim 2", color="Label", title=f'{label1} vs {label2}')
    fig.show()



## USE

In [None]:
# Check if GPU is available and set memory growth to avoid memory allocation errors
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

# Load the Universal Sentence Encoder model
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Assuming df["Title"] is your dataframe and text processing functions are defined
text = df["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence).to_list()

# Define batch size
batch_size = 128  # Adjust based on your memory availability

# Placeholder for batch encoded inputs
batch_encoded_inputs = []

# Batch encode in a loop
for start_idx in tqdm(range(0, len(text), batch_size), desc="Encoding"):
    # Get the batch
    batch = text[start_idx:start_idx + batch_size]

    # Encode the batch using the model
    encoded_results = model(batch)

    # Store the encoded batch
    for result in encoded_results.numpy().tolist():
        batch_encoded_inputs.append(result)


Encoding:   0%|          | 0/776 [00:00<?, ?it/s]

### Visualisation

In [None]:
# Create a mask to filter rows where 'Main_Tag' is in the top ten categories.
mask = df["Main_Tag"].isin(top_ten).to_list()

# Initialize PCA with specified components and random state.
pca = PCA(n_components=COMPONENTS, random_state=RANDOM_STATE)

# Apply PCA transformation on the batch_encoded_inputs.
reduced_data_pca = pca.fit_transform(batch_encoded_inputs)

# Create a DataFrame from PCA results with named dimensions.
reduced_df = pd.DataFrame(reduced_data_pca, columns=["Dim 1", "Dim 2"])

# Add a 'Label' column from 'Main_Tag'.
reduced_df["Label"] = df["Main_Tag"].values

# Apply the mask to filter rows based on the top ten tags.
reduced_df = reduced_df[mask].copy()

# Extract unique labels from the filtered DataFrame.
unique_labels = reduced_df['Label'].unique()

# Generate all unique pairs of labels avoiding duplicates.
unique_label_pairs = set()
for label1 in unique_labels:
    for label2 in unique_labels:
        if label1 != label2:
            unique_label_pairs.add(tuple(sorted([label1, label2])))

# Convert the set of unique label pairs to a sorted list, limit to first 9 pairs.
unique_label_pairs = [(row[2], row[3]) for row in pd.DataFrame(unique_label_pairs).sort_values(by=[0, 1]).reset_index().itertuples()][:9]

# Plot scatter plots for each pair of labels.
for i, (label1, label2) in enumerate(unique_label_pairs):

    # Create a temporary DataFrame for the selected labels.
    tmp_df = reduced_df[(reduced_df['Label'] == label1) | (reduced_df['Label'] == label2)].copy()

    # Plot using Plotly Express.
    fig = px.scatter(tmp_df, x="Dim 1", y="Dim 2", color="Label", title=f'{label1} vs {label2}')
    fig.show()

