
## Utilisation de Techniques de Réduction de Dimension
Utiliser des techniques appropriées de réduction en deux dimensions de données de grande dimension et les représenter graphiquement afin d'en réaliser l'analyse exploratoire.

### CE1: Mise en Œuvre de la Réduction de Dimension
- Vous avez mis en œuvre au moins une technique de réduction de dimension (via LDA, ACP, T-SNE, UMAP ou autre technique).

### CE2: Représentation Graphique en 2D
- Vous avez réalisé au moins un graphique représentant les données réduites en 2D (par exemple via LDAvis pour les Topics).

### CE3: Analyse du Graphique en 2D
- Vous avez réalisé et formalisé une analyse du graphique en 2D.


## Prétraitement de Données Textuelles
Prétraiter des données non structurées de type texte en prenant en compte les normes liées à la propriété intellectuelle et réaliser un feature engineering adapté aux modèles d'apprentissage afin d’obtenir un jeu de données exploitables.

### CE1: Nettoyage de Texte
- Vous avez nettoyé les champs de texte (suppression de la ponctuation et des mots).

### CE2: Tokenisation
- Vous avez écrit une fonction permettant de “tokeniser” une phrase.

### CE3: Stemming
- Vous avez écrit une fonction permettant de “stemmer” une phrase.

### CE4: Lemmatisation
- Vous avez écrit une fonction permettant de “lemmatiser” une phrase.

### CE5: Feature Engineering
- Vous avez construit des features ("feature engineering") de type bag-of-words (bag-of-words standard : comptage de mots, et Tf-idf), avec des étapes de nettoyage supplémentaires : seuil de fréquence des mots, normalisation des mots.

### CE6: Test sur Exemple
- Vous avez testé une phrase ou un court texte d'exemple, pour illustrer la bonne réalisation des 5 étapes précédentes.

### CE7: Techniques d'Embedding Avancées
- Vous avez, en complément de la démarche de type “bag-of-words”, mis en oeuvre 3 démarches de word/sentence embedding : Word2Vec (ou Doc2Vec ou Glove ou FastText), BERT, et USE (Universal Sentence Encoder).

### CE8: Respect de la Propriété Intellectuelle
- Vous vous êtes assuré que le texte traité ne relève pas d’une propriété intellectuelle dont l’utilisation ou la modification est interdite.


In [1]:
import os
import nltk
import pandas as pd
import re
import string
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')

ROOT_PATH = os.getcwd()
DATA_LOAD_FILE = os.path.join(ROOT_PATH, "data/StackOverFlow.csv.gz")

df = pd.read_csv(DATA_LOAD_FILE)
df

[nltk_data] Downloading package punkt to /Users/typhaine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/typhaine/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Date,Title,Tags,Score
0,2010-07-21 00:27:42,Iterating over dictionaries using &#39;for&#39...,"python, dictionary",4235
1,2013-05-10 09:04:49,How to iterate over rows in a DataFrame in Pandas,"python, pandas, dataframe, loops",4000
2,2011-06-24 17:55:08,Catch multiple exceptions in one line (except ...,"python, exception",3783
3,2010-08-09 04:52:50,Does Python have a string &#39;contains&#39; s...,"python, string, substring, contains",3587
4,2010-07-08 21:31:22,How do I list all files of a directory?,"python, directory",3466
...,...,...,...,...
245,2014-04-25 15:31:47,Asking the user for input until they give a va...,"python, validation, input",750
246,2011-04-15 14:22:01,How can I fill out a Python string with spaces?,"python, string, string-formatting, padding",746
247,2010-12-14 02:41:29,How do I append one string to another in Python?,"python, string, append",746
248,2015-05-01 04:25:16,Where does pip install its packages?,"python, django, pip, virtualenv",745


In [2]:

def clean_text(text):
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    return text

def tokenize(text):
    return word_tokenize(text)

def stem_sentence(sentence):
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in sentence.split()])

def lemmatize_sentence(sentence):
    lemmatizer = WordNetLemmatizer()
    return " ".join([lemmatizer.lemmatize(word) for word in sentence.split()])


In [3]:
df["TitleClean"] = df["Title"].apply(clean_text).apply(stem_sentence).apply(lemmatize_sentence)
df["TitleCleanTokenised"] = df["TitleClean"].apply(tokenize)
df

Unnamed: 0,Date,Title,Tags,Score,TitleClean,TitleCleanTokenised
0,2010-07-21 00:27:42,Iterating over dictionaries using &#39;for&#39...,"python, dictionary",4235,iter over dictionari use 39for39 loop,"[iter, over, dictionari, use, 39for39, loop]"
1,2013-05-10 09:04:49,How to iterate over rows in a DataFrame in Pandas,"python, pandas, dataframe, loops",4000,how to iter over row in a datafram in panda,"[how, to, iter, over, row, in, a, datafram, in..."
2,2011-06-24 17:55:08,Catch multiple exceptions in one line (except ...,"python, exception",3783,catch multipl except in one line except block,"[catch, multipl, except, in, one, line, except..."
3,2010-08-09 04:52:50,Does Python have a string &#39;contains&#39; s...,"python, string, substring, contains",3587,doe python have a string 39contains39 substr m...,"[doe, python, have, a, string, 39contains39, s..."
4,2010-07-08 21:31:22,How do I list all files of a directory?,"python, directory",3466,how do i list all file of a directori,"[how, do, i, list, all, file, of, a, directori]"
...,...,...,...,...,...,...
245,2014-04-25 15:31:47,Asking the user for input until they give a va...,"python, validation, input",750,ask the user for input until they give a valid...,"[ask, the, user, for, input, until, they, give..."
246,2011-04-15 14:22:01,How can I fill out a Python string with spaces?,"python, string, string-formatting, padding",746,how can i fill out a python string with space,"[how, can, i, fill, out, a, python, string, wi..."
247,2010-12-14 02:41:29,How do I append one string to another in Python?,"python, string, append",746,how do i append one string to anoth in python,"[how, do, i, append, one, string, to, anoth, i..."
248,2015-05-01 04:25:16,Where does pip install its packages?,"python, django, pip, virtualenv",745,where doe pip instal it packag,"[where, doe, pip, instal, it, packag]"


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(min_df=1, lowercase=False)
count_vectorized = count_vectorizer.fit_transform(df["TitleClean"])
count_dataframe = pd.DataFrame(count_vectorized.toarray(), columns=count_vectorizer.get_feature_names_out())

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df=2, use_idf=True)
tfidf_vectorized = tfidf_vectorizer.fit_transform(df["TitleClean"])

# Use the appropriate method depending on your scikit-learn version
try:
    feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()
except AttributeError:
    # Fallback for scikit-learn versions prior to 0.22
    feature_names_tfidf = tfidf_vectorizer.get_feature_names()

# Create a dataframe with the feature names as columns
tfidf_dataframe = pd.DataFrame(tfidf_vectorized.toarray(), columns=feature_names_tfidf)
tfidf_dataframe

Unnamed: 0,after,all,an,and,anoth,append,are,argpars,argument,array,...,version,virtualenv,way,what,whi,with,without,work,write,you
0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,0.0,0.500384,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
246,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.389439,0.0,0.0,0.0,0.0
247,0.0,0.000000,0.0,0.0,0.514226,0.465822,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
248,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.00000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [6]:
from gensim.models import Word2Vec
sentences = [word_tokenize(clean_text(sentence)) for sentence in df['Title']]
model_w2v = Word2Vec(sentences, vector_size=128, window=5, min_count=1)
w2v_embeddings = [[model_w2v.wv[word].tolist()for word in sentence] for sentence in sentences]

In [7]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

# Tokenize all titles with padding and truncation up to a max length
# Change `max_length` to the desired sequence length but <= 512
encoded_inputs = tokenizer(df["TitleClean"].values.tolist(), padding=True, truncation=True, max_length=tokenizer.model_max_length, return_tensors="pt")

encoded_results = model(**encoded_inputs)

bert_embeddings = encoded_results.last_hidden_state.tolist()

In [8]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed(["sample sentence"])


2023-11-07 06:50:47.114361: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [10]:
embeddings 

<tf.Tensor: shape=(1, 512), dtype=float32, numpy=
array([[-0.03938251,  0.00283756,  0.0224392 ,  0.02451151, -0.04985985,
         0.05275183,  0.01906134,  0.03593647, -0.0448441 ,  0.01787178,
        -0.0338993 , -0.00872533, -0.02777039,  0.09363514, -0.0239977 ,
        -0.08983202, -0.03871281,  0.00450138, -0.02363321, -0.03881206,
        -0.01647639,  0.04881011, -0.00920217,  0.05666861, -0.06821937,
         0.06926924, -0.01130836, -0.0776007 ,  0.00864497, -0.0289582 ,
         0.00346942,  0.01833374, -0.06229272,  0.02653736, -0.09129824,
         0.02586212, -0.00045969,  0.03273475, -0.06651255,  0.03915773,
        -0.03467301,  0.0575624 , -0.00661325, -0.0138374 , -0.04131497,
        -0.02762498,  0.0424963 , -0.00791725,  0.04456054, -0.01393929,
        -0.00280851,  0.07165937, -0.03305272,  0.07435055, -0.06612381,
        -0.01491496,  0.00656823,  0.04087039,  0.06835364,  0.01885627,
        -0.01099553, -0.05027118, -0.02475891,  0.02324308,  0.00917101,
 

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

high_dim_data = tfidf_vectorized

tsne = TSNE(n_components=2, random_state=42, init="random")
reduced_data_tsne = tsne.fit_transform(high_dim_data)

plt.figure(figsize=(8, 8))
plt.scatter(reduced_data_tsne[:, 0], reduced_data_tsne[:, 1])
plt.title('2D t-SNE of High-Dimensional Data')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()


In [None]:
import umap

reducer = umap.UMAP()
reduced_data_umap = reducer.fit_transform(high_dim_data)

plt.figure(figsize=(8, 8))
plt.scatter(reduced_data_umap[:, 0], reduced_data_umap[:, 1])
plt.title('2D UMAP of High-Dimensional Data')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()


In [None]:
#from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Assuming `X` is your feature matrix and `y` is your target labels
#lda = LDA(n_components=2)
#X_lda = lda.fit_transform(X, y)


In [None]:
from sklearn.decomposition import PCA

# Assuming `high_dim_data` is your high-dimensional data
pca = PCA(n_components=2)
reduced_data_pca = pca.fit_transform(high_dim_data)

plt.figure(figsize=(8, 8))
plt.scatter(reduced_data_pca[:, 0], reduced_data_pca[:, 1])
plt.title('2D PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
