## Notebook de Feature Engineering ##

Les données textuelles étant nettoyées, traitées et explorées, il faut maintenant que je les rende exploitables pour une future modélisation.  
Problème, les modèles de Machine Learning ne prennent que des entrées numériques.   
L'objectif est d'extraire une valeur mathématique du texte, plus précisement d'obtenir une représentation vectorielle de chaque document du corpus.  
Pour cela, je vais utiliser deux familles de technique d'extraction de features:
1. Les techniques basées sur la fréquence des mots (*Bag of Words, TF-IDF*) 
2. Les techniques basées sur la sémantique des mots (techniques d'embeddings comme *Word2Vec*, *BERT* ou *USE*)

### Importation et fonctions ###


#### Environnement de travail ####

In [4]:
# Générique
import random

# Manipulation de données
import pandas as pd
import numpy as np

# NLP
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from transformers import AlbertTokenizer, AlbertModel
import torch
import tensorflow_hub as hub
import tensorflow as tf
from sentence_transformers import SentenceTransformer


# Project modules
from config.paths import DATA_DIR

#### Importation des données ###

In [5]:
data = pd.read_json(f'{DATA_DIR}/silver/processed_data.json').sample(10000)
raw_data = data

In [6]:
data.head()

Unnamed: 0,processed_title_tokens,processed_body_tokens,processed_tags
0,"[convert, decimal, double, c]","[want, assign, decimal, variable, trans, doubl...","c#,floating-point,type-conversion,double,decimal"
1,"[calculate, relative, time, c]","[given, specific, datetime, value, display, re...","c#,datetime,time,datediff,relative-time-span"
2,"[determine, user, timezone]","[standard, way, web, server, able, determine, ...","html,browser,timezone,user-agent,timezone-offset"
3,"[fastest, way, get, value, π]","[looking, fastest, way, obtain, value, π, pers...","performance,algorithm,language-agnostic,unix,pi"
4,"[use, c, socket, api, c, z, o]","[issue, getting, c, socket, api, work, properl...","c++,c,sockets,mainframe,zos"


#### Définition des fonctions ####

In [6]:
def frequency_unigram_bag_of_words(corpus):
    vectorizer = CountVectorizer()
    bow_matrix = vectorizer.fit_transform(corpus)
    bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
    return bow_df

def frequency_unigram_tf_idf(corpus):
    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform(corpus)
    tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
    return tfidf_df

def multi_label_binarizer(corpus, sep=' '):
    corpus_list = corpus.apply(lambda x: x.split(sep))
    mlb = MultiLabelBinarizer()
    y = mlb.fit_transform(corpus_list)
    tags_binarized_df = pd.DataFrame(y, columns=mlb.classes_)
    return tags_binarized_df

def display_token_info(corpus):
    print(f'Le corpus contient {len(corpus)} tokens')
    unique_tokens = set(corpus.split())
    print(f"Le corpus contient {len(unique_tokens)} tokens uniques")
    print(f"Occurences moyennes par token: {len(corpus) / len(unique_tokens)}")

def inspect_non_null_matrix_values(matrix):
    column_names = matrix.columns
    column_name = random.choice(column_names)
    print("Colonne choisie:", column_name)
    non_zero_column = matrix[matrix[column_name] > 0]
    print(non_zero_column[[column_name]].head())

def get_document_vector(doc, model):
    vectors = [model.wv[token] for token in doc if token in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

def get_bert_embeddings(model, docs):
    import torch

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    document_embeddings = []

    for i, tokens in enumerate(docs):
        print(f"{i} embeddings générés sur {len(docs)}, {i/len(docs)*100}% effectués")

        tokens = {key: value.to(device) for key, value in tokens.items()}

        with torch.no_grad():
            outputs = model(**tokens)
            cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            document_embeddings.append(cls_embedding)
    
    return document_embeddings


### Techniques basées sur la fréquence ###

Les techniques basées sur la fréquence représentent mathématiquement un document textuel à travers le nombre d'occurences de ces éléments, bien souvent des mots individuels, par rapport au vocabulaire entier du corpus.

#### Bag of Words ####

Je vais d'abord vectoriser mon corpus via un Bag of Words, qui va compter la fréquence d'apparation de chaque mot du vocabulaire, dans chaque document.  
Pour ce faire, je combine d'abord les tokens de titre et de corps, pour former un seul corpus.

In [7]:
data['combined_text'] = data['processed_title_tokens'].apply(lambda x: ' '.join(x)) + ' ' + data['processed_body_tokens'].apply(lambda x: ' '.join(x))

In [8]:
data.head()

Unnamed: 0,processed_title_tokens,processed_body_tokens,processed_tags,combined_text
0,"[convert, decimal, double, c]","[want, assign, decimal, variable, trans, doubl...","c#,floating-point,type-conversion,double,decimal",convert decimal double c want assign decimal v...
1,"[calculate, relative, time, c]","[given, specific, datetime, value, display, re...","c#,datetime,time,datediff,relative-time-span",calculate relative time c given specific datet...
2,"[determine, user, timezone]","[standard, way, web, server, able, determine, ...","html,browser,timezone,user-agent,timezone-offset",determine user timezone standard way web serve...
3,"[fastest, way, get, value, π]","[looking, fastest, way, obtain, value, π, pers...","performance,algorithm,language-agnostic,unix,pi",fastest way get value π looking fastest way ob...
4,"[use, c, socket, api, c, z, o]","[issue, getting, c, socket, api, work, properl...","c++,c,sockets,mainframe,zos",use c socket api c z o issue getting c socket ...


Analysons ce nouveau corpus combiné:

In [9]:
display_token_info(' '.join(data.combined_text))

Le corpus contient 37110150 tokens
Le corpus contient 196190 tokens uniques
Occurences moyennes par token: 189.15413629644732


Je peux donc passer au Bag of Words, où je m'attends à une matrice contenant des vecteurs d'environ 196000 dimensions.

In [10]:
bow_matrix = frequency_unigram_bag_of_words(data.combined_text)

In [11]:
bow_matrix.head()

Unnamed: 0,00,000,0000,00000,000000,0000000,00000000,000000000,000000000000,0000000000000000,...,香川県,高知県,鳥取県,鹿児島県,麗安,龥а,한글,ｻｿ,ﾎｱﾎｻﾎｷﾎｼﾎｭﾏ,ﾎｺﾏ狐πｼﾎｵ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
inspect_non_null_matrix_values(bow_matrix)

Colonne choisie: nuthin
      nuthin
6414       1


Les vecteurs ont une très forte dimensionalité, et sont majoritairement creux (beaucoup de 0, peu de 1).

#### TF-IDF ####

Pour prendre en compte la rareté relative d'un mot et pondérer sa fréquence par son importance dans le document, TF-IDF peut être utilisé. 

In [13]:
tf_idf_matrix = frequency_unigram_tf_idf(data.combined_text)

In [14]:
tf_idf_matrix.sample(5)

Unnamed: 0,00,000,0000,00000,000000,0000000,00000000,000000000,000000000000,0000000000000000,...,香川県,高知県,鳥取県,鹿児島県,麗安,龥а,한글,ｻｿ,ﾎｱﾎｻﾎｷﾎｼﾎｭﾏ,ﾎｺﾏ狐πｼﾎｵ
41760,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29645,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31954,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23830,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
inspect_non_null_matrix_values(tf_idf_matrix)

Colonne choisie: advancedautocomplete
       advancedautocomplete
12368              0.153954


Malgré la pondération, la matrice du TF-IDF garde les mêmes caractéristiques, et inconvénients.

#### Multi-label Binarizer ####

Pour les tags, une approche fréquemment utilisée est de transformer la liste de labels en vecteurs binaires; 1 si le tag est présent dans le document, 0 si non.  
L'approche est semblable au Bag of Words, mais sera très utile lors de l'étape de modélisation et de classification supervisée.

In [16]:
mlb_matrix = multi_label_binarizer(data.processed_tags, sep=',')

In [17]:
mlb_matrix.sample(5)

Unnamed: 0,16-bit,24-bit,256color,2d,2d-3d-conversion,2d-games,2phase-commit,3-tier,3-way-merge,32-bit,...,zope.interface,zorba,zos,zpl,zpl-ii,zpt,zsh,zsh-completion,zsi,zsync
19987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13192,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7988,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20798,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
37602,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
inspect_non_null_matrix_values(mlb_matrix)

Colonne choisie: cdo.message
       cdo.message
1609             1
40833            1


In [19]:
mlb_matrix['python'].sum()

2684

### Techniques basées sur la sémantique ###

Les techniques basées sur la sémantique font la promesse de représenter les mots selon leur contexte, en rapprochant mathématiquement les mots sémantiquement proches, tout en gardant une dimensionnalité mesurées par rapport aux techniques fréquentistes. Les vecteurs résultants de ces techniques sont appelés *embeddings*.

#### Word2Vec ####

Word2Vec est une des premières techniques d'embedding qui a vu le jour.  
Il génère des embeddings en capturant les contextes des mots présents, et en rapprochant les mots aux contextes ressemblants.  
Après avoir formaté le corpus en fonction des besoins du modèle, je l'entraîne pour obtenir mes embeddings.

In [20]:
sentences = data['combined_text'].apply(lambda x: simple_preprocess(x)).tolist()
model_w2v = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

In [21]:
model_w2v.wv['python']

array([ 2.4548879e-01, -5.4499847e-01,  3.5737474e+00,  7.4911016e-01,
        1.7693719e+00,  9.3174100e-01, -1.8137324e+00,  7.2189242e-01,
        1.0077080e+00, -9.2219323e-01,  5.7912534e-01, -1.9168587e+00,
       -1.3907452e+00, -6.8253934e-01,  2.1553979e+00, -4.8772052e-01,
        8.9186102e-01, -1.3156390e+00, -2.0365989e-01,  3.8702691e-01,
        2.2898646e-01,  3.7270589e+00,  2.6033237e+00,  2.6249492e+00,
        6.9601542e-01, -3.1900844e-01,  2.1758277e+00,  4.2216396e-01,
        1.5308858e+00,  2.0585020e+00,  2.6807017e+00, -2.5929253e+00,
        3.4810323e-01,  2.6474607e+00, -3.3339436e+00, -1.4467429e-01,
        1.1285890e+00,  5.1117235e-01, -8.5784709e-01,  2.7665594e-01,
        3.7365857e-01, -1.5771922e-01,  1.0731281e+00, -1.5863495e+00,
       -4.9431810e-01,  5.2018380e-01,  3.3274967e-02, -2.1991923e+00,
       -9.1936046e-01, -2.2869666e+00, -5.2085418e-01, -1.5742459e+00,
       -1.0868225e+00, -2.2610958e+00,  5.0426376e-01, -2.0774324e+00,
      

Word2Vec génère des embeddings individuels; j'utilise donc une technique d'agrégation pour obtenir un embedding par document.

In [22]:
document_vectors = [get_document_vector(doc, model_w2v) for doc in sentences]

In [23]:
len(document_vectors)

50000

In [24]:
document_vectors[0]

array([-4.2622167e-01,  2.7647746e-01,  5.1938003e-01,  1.2976211e-01,
       -5.2795641e-02, -1.1486957e+00, -5.9895384e-01,  1.2291791e+00,
        7.8579724e-01, -3.9071169e-01, -4.9083763e-01, -5.4303080e-01,
        6.2502436e-02, -4.6656066e-01,  9.9519950e-01, -1.3299716e-01,
       -5.2796316e-01, -7.3648864e-01,  9.2447102e-02, -3.0692586e-01,
        3.2516250e-01,  1.4903946e-01,  2.1990379e-02, -4.7518584e-01,
        4.9597815e-01,  4.1939402e-01, -8.4135807e-01,  2.4281338e-01,
       -5.0183588e-01,  3.8527343e-01,  9.2106491e-01,  1.2163569e-01,
       -3.5071781e-01,  6.7121655e-01, -7.0472580e-01, -1.3362548e-01,
        6.2825400e-01, -6.4591366e-01, -8.1976706e-01,  7.8207225e-01,
        4.5443037e-01,  9.1509473e-01, -1.7083572e-01, -6.5209574e-01,
       -6.0167901e-02, -6.3294250e-01,  2.6322955e-01, -5.6927526e-01,
        2.4256280e-02,  2.5795752e-01,  7.3007032e-02, -5.7895046e-01,
       -1.0459632e+00,  5.1859361e-01, -4.9308769e-02,  1.9598074e-01,
      

In [None]:
np.save(f'{DATA_DIR}/gold/embeddings_word2vec.npy', document_vectors)

J'obtiens bien les embeddings de chaque document ici.  
L'inconvénient principal de Word2Vec se situe dans son appréciation limitée du contexte des mots; un même mot aux multiples sens n'aura qu'un embedding, au lieu de un par sens différent.

#### BERT ####

BERT est un embedder pré-entraîné, basé sur l'architecture Transformer, ce qui le rend très puissant pour comprendre le contexte d'un mot en profondeur.  
Je vais donc obtenir des embeddings contextuels, basés sur les données qui ont servies à son entraînement, et adaptés au document fourni.

In [24]:
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = AlbertModel.from_pretrained('albert-base-v2')

if torch.cuda.is_available():
    model = model.to('cuda')



In [25]:
documents = data['combined_text'].tolist()
tokenized_docs = [tokenizer(doc, return_tensors='pt', truncation=True, padding=True) for doc in documents]

In [26]:
bert_embeddings = get_bert_embeddings(model, tokenized_docs)

50000

In [None]:
len(bert_embeddings)

In [None]:
bert_embeddings[0]

In [None]:
np.save(f'{DATA_DIR}/gold/embeddings_bert.npy', bert_embeddings)

#### USE ####

USE (Universal Sentence Encoder) est un autre embedder pré-entraîné, développé par Google.  
Il est plus efficace et moins gourmand en ressources que BERT, au compromis d'un contexte moins granulaire.

In [4]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

ValueError: Trying to load a model of incompatible/unknown type. '/var/folders/r2/ss_qsp851qbd7s57ljmw44480000gn/T/tfhub_modules/063d866c06683311b44b4992fd46003be952409c' contains neither 'saved_model.pb' nor 'saved_model.pbtxt'.

In [30]:
use_embeddings = embed(documents).numpy()

In [34]:
use_embeddings.shape

(50000, 512)

In [32]:
use_embeddings[0]

array([-0.04490682, -0.06579038, -0.00030356,  0.06503192,  0.05765333,
       -0.00393299, -0.02149728, -0.06094711, -0.03997018,  0.06284467,
       -0.00920186, -0.02412126, -0.05465421,  0.06364406,  0.00042385,
        0.06579082, -0.05108438,  0.00583293,  0.03009957, -0.06381157,
       -0.03322913, -0.05842859,  0.04841243, -0.01356049, -0.03274378,
        0.02555372, -0.03763048, -0.03119965, -0.05024819, -0.02886915,
       -0.00857334, -0.03321578, -0.03993635, -0.06399625, -0.06562254,
        0.03072463, -0.05723202, -0.05265672,  0.02365536,  0.04956209,
        0.03708434, -0.01723283,  0.04148487,  0.04844912,  0.06577364,
       -0.04618986, -0.06129446,  0.02435905, -0.05565225,  0.06529625,
       -0.06421264,  0.00464133, -0.04064857, -0.03056659,  0.03985272,
       -0.04536984, -0.05194784,  0.02306649, -0.05845733, -0.03359056,
       -0.04997613,  0.04903685,  0.05373524, -0.02376573, -0.06419381,
       -0.0608025 , -0.04520904,  0.0475619 ,  0.03352387,  0.00

In [37]:
np.save(f'{DATA_DIR}/gold/embeddings_use.npy', use_embeddings)

J'ai généré tous mes embeddings; je peux maintenant exporter la matrice de features complète.

In [39]:
data.to_json(f'{DATA_DIR}/gold/feature_matrix.json')