
## Utilisation de Techniques de Réduction de Dimension
Utiliser des techniques appropriées de réduction en deux dimensions de données de grande dimension et les représenter graphiquement afin d'en réaliser l'analyse exploratoire.

### CE1: Mise en Œuvre de la Réduction de Dimension
- Vous avez mis en œuvre au moins une technique de réduction de dimension (via LDA, ACP, T-SNE, UMAP ou autre technique).

### CE2: Représentation Graphique en 2D
- Vous avez réalisé au moins un graphique représentant les données réduites en 2D (par exemple via LDAvis pour les Topics).

### CE3: Analyse du Graphique en 2D
- Vous avez réalisé et formalisé une analyse du graphique en 2D.


## Prétraitement de Données Textuelles
Prétraiter des données non structurées de type texte en prenant en compte les normes liées à la propriété intellectuelle et réaliser un feature engineering adapté aux modèles d'apprentissage afin d’obtenir un jeu de données exploitables.

### CE1: Nettoyage de Texte
- Vous avez nettoyé les champs de texte (suppression de la ponctuation et des mots).

### CE2: Tokenisation
- Vous avez écrit une fonction permettant de “tokeniser” une phrase.

### CE3: Stemming
- Vous avez écrit une fonction permettant de “stemmer” une phrase.

### CE4: Lemmatisation
- Vous avez écrit une fonction permettant de “lemmatiser” une phrase.

### CE5: Feature Engineering
- Vous avez construit des features ("feature engineering") de type bag-of-words (bag-of-words standard : comptage de mots, et Tf-idf), avec des étapes de nettoyage supplémentaires : seuil de fréquence des mots, normalisation des mots.

### CE6: Test sur Exemple
- Vous avez testé une phrase ou un court texte d'exemple, pour illustrer la bonne réalisation des 5 étapes précédentes.

### CE7: Techniques d'Embedding Avancées
- Vous avez, en complément de la démarche de type “bag-of-words”, mis en oeuvre 3 démarches de word/sentence embedding : Word2Vec (ou Doc2Vec ou Glove ou FastText), BERT, et USE (Universal Sentence Encoder).

### CE8: Respect de la Propriété Intellectuelle
- Vous vous êtes assuré que le texte traité ne relève pas d’une propriété intellectuelle dont l’utilisation ou la modification est interdite.


In [1]:
import os
import nltk
import umap
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec, FastText
from transformers import BertTokenizer, BertModel
from tqdm.notebook import tqdm
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px
import tensorflow as tf
import tensorflow_hub as hub

from functions import clean_text, tokenize, stem_sentence, lemmatize_sentence, downsample_to_number

nltk.download('punkt')
nltk.download('wordnet')

ROOT_PATH = os.getcwd()
DATA_LOAD_FILE = os.path.join(ROOT_PATH, "data/StackOverFlow.csv.gz")
DATA_SAVE_FILE = os.path.join(ROOT_PATH, "data/StackOverFlowEmbedded.csv.gz")

SAMPLE_SIZE = 1000
TOP_N = 2

df = pd.read_csv(DATA_LOAD_FILE)
df["Label"] = df.Tags.str.split(",").apply(lambda x: x[0])
TopLanguages = list(df.Label.value_counts().keys())[:TOP_N]
TopLanguagesDataframe = df.loc[df.Label.isin(TopLanguages)].copy()
# Use the function to downsample each class to 1000 observations
DataframeDownSampled = downsample_to_number(TopLanguagesDataframe, 'Label', SAMPLE_SIZE)
DataframeDownSampled["TitleClean"] = DataframeDownSampled["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence)
DataframeDownSampled["TitleCleanTokenised"] = DataframeDownSampled["TitleClean"].apply(tokenize)
DataframeDownSampled

2023-11-14 14:33:35.103137: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /Users/typhaine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/typhaine/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Date,Title,Tags,Score,Label,TitleClean,TitleCleanTokenised
387977,2013-12-10 14:33:34,document.querySelector(...) is null error,javascript,66,javascript,documentqueryselector is null error,"[documentqueryselector, is, null, error]"
551201,2016-01-28 05:57:25,How to set activeClassName for wrapper element...,"javascript, reactjs, react-router, redux",50,javascript,how to set activeclassnam for wrapper element ...,"[how, to, set, activeclassnam, for, wrapper, e..."
403081,2012-05-23 10:01:29,Javascript decoding html entities,"javascript, jquery",64,javascript,javascript decod html entiti,"[javascript, decod, html, entiti]"
373313,2013-05-26 01:50:13,check if function is a generator,"javascript, generator, yield, ecmascript-6",68,javascript,check if function is a gener,"[check, if, function, is, a, gener]"
175550,2012-05-15 13:40:52,Convert long number into abbreviated string in...,"javascript, numbers",120,javascript,convert long number into abbrevi string in jav...,"[convert, long, number, into, abbrevi, string,..."
...,...,...,...,...,...,...,...
6296,2014-10-08 23:00:19,How do I count the NaN values in a column in p...,"python, pandas, dataframe",785,python,how do i count the nan valu in a column in pan...,"[how, do, i, count, the, nan, valu, in, a, col..."
109783,2010-06-29 14:54:07,how to delete files from amazon s3 bucket?,"python, amazon-s3, bucket",165,python,how to delet file from amazon s3 bucket,"[how, to, delet, file, from, amazon, s3, bucket]"
3585,2013-06-04 18:46:56,Writing a pandas DataFrame to CSV file,"python, csv, pandas, dataframe",1091,python,write a panda datafram to csv file,"[write, a, panda, datafram, to, csv, file]"
11026,2013-02-08 11:41:57,How do I change the figure size with subplots?,"python, matplotlib, subplot, figure",640,python,how do i chang the figur size with subplot,"[how, do, i, chang, the, figur, size, with, su..."


# Count Vectoriser

In [2]:
count_vectorizer = CountVectorizer(min_df=5, lowercase=False)
count_vectorized = count_vectorizer.fit_transform(DataframeDownSampled["TitleClean"])
count_dataframe = pd.DataFrame(count_vectorized.toarray(), columns=count_vectorizer.get_feature_names_out())
count_dataframe

Unnamed: 0,access,activ,ad,add,after,ajax,all,alreadi,altern,amp,...,with,within,without,work,worker,write,year,you,zero,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1998,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


# Tf-idf Vectoriser

In [3]:
tfidf_vectorizer = TfidfVectorizer(min_df=5, use_idf=True)
tfidf_vectorized = tfidf_vectorizer.fit_transform(DataframeDownSampled["TitleClean"])

# Use the appropriate method depending on your scikit-learn version
try:
    feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
except AttributeError:
    # Fallback for scikit-learn versions prior to 0.22
    feature_names_tfidf = tfidf_vectorizer.get_feature_names()

# Create a dataframe with the feature names as columns
tfidf_dataframe = pd.DataFrame(tfidf_vectorized.toarray(), columns=feature_names_tfidf)
tfidf_dataframe

Unnamed: 0,access,activ,ad,add,after,ajax,all,alreadi,altern,amp,...,with,within,without,work,worker,write,year,you,zero,zoom
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.243027,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.506763,0.0,0.0,0.0,0.0
1998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.282944,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [4]:
COMPONENTS = 2
RANDOM_STATE = 42

pca = PCA(n_components=COMPONENTS, random_state=RANDOM_STATE)
reduced_data_pca = pca.fit_transform(tfidf_dataframe.values)

reduced_df = pd.DataFrame(reduced_data_pca, columns=["Dim 1", "Dim 2"])
reduced_df["Label"] = DataframeDownSampled["Label"].values

fig1 = px.scatter(reduced_df, x = "Dim 1", y = "Dim 2", color="Label")
fig1.show()

reducer = umap.UMAP(n_components=COMPONENTS, random_state=RANDOM_STATE)
reduced_data_umap = reducer.fit_transform(tfidf_dataframe.values)

reduced_df = pd.DataFrame(reduced_data_umap, columns=["Dim 1", "Dim 2"])
reduced_df["Label"] = DataframeDownSampled["Label"].values

fig2 = px.scatter(reduced_df, x = "Dim 1", y = "Dim 2", color="Label")
fig2.show()

tsne = TSNE(n_components=COMPONENTS, random_state=RANDOM_STATE, init="random")
reduced_data_tsne = tsne.fit_transform(tfidf_dataframe.values)

reduced_df = pd.DataFrame(reduced_data_tsne, columns=["Dim 1", "Dim 2"])
reduced_df["Label"] = DataframeDownSampled["Label"].values

fig3 = px.scatter(reduced_df, x = "Dim 1", y = "Dim 2", color="Label")
fig3.show()


n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.





Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md




# Bert Embeddings

In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

# Define batch size
batch_size = 128  # Choose a batch size that fits your memory

# Create a placeholder for the batch encoded inputs
batch_encoded_inputs = []

# Batch encode in a loop
for start_idx in tqdm(range(0, len(DataframeDownSampled), batch_size), desc="Encoding"):
    # Get the batch
    batch = DataframeDownSampled["TitleClean"][start_idx:start_idx + batch_size].tolist()
    # Encode the batch
    batch_encoded = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')
    # Encoded Results
    encoded_results = model(**batch_encoded)
    # Store the encoded batch
    for result in encoded_results.last_hidden_state.mean(dim=1).tolist():
        batch_encoded_inputs.append(result)

DataframeDownSampled["BertEmbeddings"] = batch_encoded_inputs

Encoding:   0%|          | 0/16 [00:00<?, ?it/s]

# Word2Vec

In [6]:
model = FastText(DataframeDownSampled["TitleCleanTokenised"].values, vector_size=128, window=128, min_count=1, workers=4, sg=1)
DataframeDownSampled["FastTextEmbeddings"] = [model.wv[sentence].mean(axis=0) for sentence in DataframeDownSampled["TitleCleanTokenised"].values]

# USE Embeddings

In [7]:
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Define batch size
batch_size = 128  # Choose a batch size that fits your memory

# Create a placeholder for the batch encoded inputs
batch_encoded_inputs = []

# Batch encode in a loop
for start_idx in tqdm(range(0, len(DataframeDownSampled), batch_size), desc="Encoding"):
    # Get the batch
    batch = DataframeDownSampled["TitleClean"][start_idx:start_idx + batch_size].tolist()
    # Encode the batch
    #batch_encoded = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')
    # Encoded Results
    encoded_results = model(batch)
    # Store the encoded batch
    for result in encoded_results.numpy().tolist():
        batch_encoded_inputs.append(result)

DataframeDownSampled["UseEmbeddings"] = batch_encoded_inputs

Encoding:   0%|          | 0/16 [00:00<?, ?it/s]

# Save Embedding

In [8]:
DataframeDownSampled.to_csv(DATA_SAVE_FILE, index = False,  compression = "gzip")

In [9]:
#from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Assuming `X` is your feature matrix and `y` is your target labels
#lda = LDA(n_components=2)
#X_lda = lda.fit_transform(X, y)
