<a href="https://colab.research.google.com/github/amaye15/stackoverflow-question-classifier/blob/main/code/N4_Supervised_Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utilisation de Techniques de Réduction de Dimension
Utiliser des techniques appropriées de réduction en deux dimensions de données de grande dimension et les représenter graphiquement afin d'en réaliser l'analyse exploratoire.

## CE1: Mise en Œuvre de la Réduction de Dimension
- Vous avez mis en œuvre au moins une technique de réduction de dimension (via LDA, ACP, T-SNE, UMAP ou autre technique).

## CE2: Représentation Graphique en 2D
- Vous avez réalisé au moins un graphique représentant les données réduites en 2D (par exemple via LDAvis pour les Topics).

## CE3: Analyse du Graphique en 2D
- Vous avez réalisé et formalisé une analyse du graphique en 2D.

# Libraries

In [1]:
%pip install pyLDAvis==3.4.0 datasets --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
# Standard library imports
import math
import os
import re
import string
import torch
import nltk
import pyLDAvis


# Third-party imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import tensorflow as tf
import tensorflow_hub as hub

# Import Functions/Classes
from datasets import load_dataset
from gensim.models import Word2Vec, FastText
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.manifold import TSNE
from transformers import BertTokenizer, BertModel
from wordcloud import WordCloud
from tqdm.notebook import trange, tqdm
from gensim.models import LdaModel
from gensim.corpora import Dictionary
import pyLDAvis.gensim_models as gensimvis

def is_top_k(row, y_col, y_pred_col, k):
    """
    Check if the actual value in a specified column is within the top 'k' predicted values in another column.

    This function is designed to operate on a row of a pandas DataFrame. It compares the actual value from one column
    ('y_col') with a list of predicted values in another column ('y_pred_col'), and checks if the actual value is within
    the top 'k' elements of the predicted list.

    Parameters:
    row (pd.Series): A row from a pandas DataFrame.
    y_col (str): The name of the column containing the actual value.
    y_pred_col (str): The name of the column containing the list of predicted values.
    k (int): The number of top elements from the predicted values list to consider.

    Returns:
    bool: True if the actual value is within the top 'k' predicted values, False otherwise.
    """
    return row[y_col] in row[y_pred_col][:k]



# Data Preparation

In [9]:
# Define the dataset name and repository name for Stack Overflow Zero-Shot Classification
NAME = "amaye15/Stack-Overflow-Zero-Shot-Classification"
RESPOSITORY = "amaye15/Stack-Overflow-Zero-Shot-Classification"
K = 20

# Initialize API keys for Stack Overflow and Hugging Face
STACK_KEY = "ub*oRqta6kWgck7l2tG5ng(("
HF_KEY = "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"

# Load the dataset from the Hugging Face hub using the dataset name
ds = load_dataset(NAME)

# Convert the 'train' split of the dataset to a pandas DataFrame
df = ds["train"].to_pandas()

# Uncomment the following line to push the dataset to Hugging Face hub
# ds2.push_to_hub(RESPOSITORY, token = hf_key)

# Extract the main tag from the 'Tags' column and store it in a new column 'Main_Tag'
df["Main_Tag"] = df["Tags"].str.replace(" ", "").apply(lambda x: next(iter(x.split(","))))

# Extract the main predicted tag from 'Predicted_Tags' and store it in 'Predicted_Main_Tag'
df["Predicted_Main_Tag"] = df["Predicted_Tags"].str.replace(" ", "").apply(lambda x: next(iter(x.split(","))))

# Process 'Predicted_Tags' to create a list of tags by removing spaces and splitting on commas
df["Predicted_Tags"] = df["Predicted_Tags"].str.replace(" ", "").str.split(",")

# Filter the DataFrame to keep only rows where the main tag is within the top k predicted tags
# Here, 'is_top_k' is a function that checks if the main tag is in the top k predicted tags
df = df[df.apply(lambda row: is_top_k(row, y_col = "Main_Tag", y_pred_col = "Predicted_Tags", k = K), axis=1)].copy()

# Calculate the value counts of unique values in the 'Main_Tag' column and convert it into a DataFrame
  # Rename columns for clarity: 'index' (original column names) to 'Main_Tag' and 'Main_Tag' (counts) to 'index'
    # Select the top 10 most frequent 'Main_Tag' values
      # Convert the top 10 'Main_Tag' values into a list
top_ten = df["Main_Tag"].value_counts().to_frame().reset_index().rename(columns={"index":"Main_Tag", "Main_Tag":"index"}).loc[:9, "Main_Tag"].to_list()

# Create a mask (a list of boolean values) indicating whether each row's 'Main_Tag' is in the top_ten list
mask = df["Main_Tag"].isin(top_ten).to_list()



# LDA Visualisation

In [12]:
topics = 10
documents = df["Title"].str.split().tolist()

# Assuming 'documents' is a list of documents (each document is a list of tokens)
dictionary = Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=topics)

# Prepare the visualization
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, dictionary)

# Display the interactive visualization
pyLDAvis.display(vis)


  and should_run_async(code)
  by='saliency', ascending=False).head(R).drop('saliency', 1)
