## Notebook de Feature Engineering ##

Les données textuelles étant nettoyées, traitées et explorées, il faut maintenant que je les rende exploitables pour une future modélisation.  
Problème, les modèles de Machine Learning ne prennent que des entrées numériques.   
L'objectif est d'extraire une valeur mathématique du texte, plus précisement d'obtenir une représentation vectorielle de chaque document du corpus.  
Pour cela, je vais utiliser deux familles de technique d'extraction de features:
1. Les techniques basées sur la fréquence des mots (*Bag of Words, TF-IDF*)
2. Les techniques basées sur la sémantique des mots (techniques d'embeddings comme *Word2Vec*, *BERT* ou *USE*)

### Importation et fonctions ###


#### Environnement de travail ####

In [221]:
# Générique
import random

# Manipulation de données
import pandas as pd
import numpy as np

# NLP
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from transformers import BertTokenizer, BertModel
import torch
import tensorflow_hub as hub


# Project modules
from config.paths import DATA_DIR

#### Importation des données ###

In [222]:
data = pd.read_json(f'{DATA_DIR}/silver/processed_data.json').sample(10000, random_state=42)
raw_data = data

In [223]:
data.head()

Unnamed: 0,processed_title_tokens,processed_body_tokens,processed_tags
33553,[],"[following, problem, someone, else, running, a...","glassfish,classpath,jax-ws,cxf,java-metro-fram..."
9427,"[way, get, h]","[application, uses, http, want, access, intern...","c++,windows,proxy,sdk,winhttp"
199,"[java, configuration, framework]","[process, values, java, library, wondering, fr...","java,xml,configuration,frameworks,configuratio..."
12447,"[instead, two, directly]","[see, first, faster, since, register, seem, si...","assembly,x86,nasm,accumulator,addressing-mode"
39489,"[high, load, parse, creating, errors]","[trying, access, problem, sometimes, parse, ab...","c#,xml,rest,streamreader,linq-to-xml"


#### Définition des fonctions ####

In [224]:
def frequency_unigram_bag_of_words(corpus):
    vectorizer = CountVectorizer()
    bow_matrix = vectorizer.fit_transform(corpus)
    bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
    return bow_df

def frequency_unigram_tf_idf(corpus):
    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform(corpus)
    tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
    return tfidf_df

def multi_label_binarizer(corpus, sep=' '):
    corpus_list = corpus.apply(lambda x: x.split(sep))
    mlb = MultiLabelBinarizer()
    y = mlb.fit_transform(corpus_list)
    tags_binarized_df = pd.DataFrame(y, columns=mlb.classes_)
    return tags_binarized_df

def display_token_info(corpus):
    print(f'Le corpus contient {len(corpus)} tokens')
    unique_tokens = set(corpus.split())
    print(f"Le corpus contient {len(unique_tokens)} tokens uniques")
    print(f"Occurences moyennes par token: {len(corpus) / len(unique_tokens)}")

def inspect_non_null_matrix_values(matrix):
    column_names = matrix.columns
    column_name = random.choice(column_names)
    print("Colonne choisie:", column_name)
    non_zero_column = matrix[matrix[column_name] > 0]
    print(non_zero_column[[column_name]].head())

def get_document_vector(doc, model):
    vectors = [model.wv[token] for token in doc if token in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

def get_bert_embeddings(model, docs):
    import torch

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    document_embeddings = []

    for i, tokens in enumerate(docs):
        print(f"{i} embeddings générés sur {len(docs)}, {i/len(docs)*100}% effectués")

        tokens = {key: value.to(device) for key, value in tokens.items()}

        with torch.no_grad():
            outputs = model(**tokens)
            cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            document_embeddings.append(cls_embedding)

    return document_embeddings


### Préparation des données ###

Puisque la phase de modélisation sera réalisée avec des données d'entraînement et de test pour évaluer la performance des modèles, il est important de réaliser le split des données avant de génerer les embeddings, pour éviter de potentielles fuites de données et l'overfitting.  
Seuls les modèles pré-entraînés comme *BERT* et *USE* ne comportent pas de risque.

Je combine d'abord les tokens de titre et de corps, pour former un seul corpus.

In [225]:
data['combined_text'] = data['processed_title_tokens'].apply(lambda x: ' '.join(x)) + ' ' + data['processed_body_tokens'].apply(lambda x: ' '.join(x))

In [226]:
data.head()

Unnamed: 0,processed_title_tokens,processed_body_tokens,processed_tags,combined_text
33553,[],"[following, problem, someone, else, running, a...","glassfish,classpath,jax-ws,cxf,java-metro-fram...",following problem someone else running applic...
9427,"[way, get, h]","[application, uses, http, want, access, intern...","c++,windows,proxy,sdk,winhttp",way get h application uses http want access in...
199,"[java, configuration, framework]","[process, values, java, library, wondering, fr...","java,xml,configuration,frameworks,configuratio...",java configuration framework process values ja...
12447,"[instead, two, directly]","[see, first, faster, since, register, seem, si...","assembly,x86,nasm,accumulator,addressing-mode",instead two directly see first faster since re...
39489,"[high, load, parse, creating, errors]","[trying, access, problem, sometimes, parse, ab...","c#,xml,rest,streamreader,linq-to-xml",high load parse creating errors trying access ...


Analysons ce nouveau corpus combiné:

In [227]:
display_token_info(' '.join(data.combined_text))

Le corpus contient 3735885 tokens
Le corpus contient 1066 tokens uniques
Occurences moyennes par token: 3504.5825515947467


Je peux maintenant séparer mes données.

In [228]:
X = data.combined_text
y = data.processed_tags

In [229]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [230]:
X_train.to_csv(f'{DATA_DIR}/gold/x_train.csv')
X_test.to_csv(f'{DATA_DIR}/gold/x_test.csv')
y_train.to_csv(f'{DATA_DIR}/gold/y_train.csv')
y_test.to_csv(f'{DATA_DIR}/gold/y_test.csv')

### Techniques basées sur la fréquence ###

Les techniques basées sur la fréquence représentent mathématiquement un document textuel à travers le nombre d'occurences de ces éléments, bien souvent des mots individuels, par rapport au vocabulaire entier du corpus.

#### Bag of Words ####

In [231]:
bow_matrix_train = frequency_unigram_bag_of_words(X_train)
bow_matrix_test = frequency_unigram_bag_of_words(X_test)

In [232]:
bow_matrix_train.head()

Unnamed: 0,able,accept,access,accomplish,according,account,achieve,across,action,active,...,written,wrong,wrote,www,xaml,xml,xmlns,xp,yes,yet
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,1,0,0,0,0,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [233]:
inspect_non_null_matrix_values(bow_matrix_train)
inspect_non_null_matrix_values(bow_matrix_test)

Colonne choisie: namespace
     namespace
13           5
31           1
47           1
72           1
100          1
Colonne choisie: direction
     direction
39           1
260          1
279          1
341          1
352          1


Les vecteurs ont une très forte dimensionalité, et sont majoritairement creux (beaucoup de 0, peu de 1).

Vu la taille des matrices BoW, il est plus efficace de les regénérer lors de la modélisation que de les enregistrer sur le disque.

#### TF-IDF ####

Pour prendre en compte la rareté relative d'un mot et pondérer sa fréquence par son importance dans le document, TF-IDF peut être utilisé.

In [234]:
tf_idf_matrix_train = frequency_unigram_tf_idf(X_train)
tf_idf_matrix_test = frequency_unigram_tf_idf(X_test)

In [235]:
tf_idf_matrix_train.sample(5)

Unnamed: 0,able,accept,access,accomplish,according,account,achieve,across,action,active,...,written,wrong,wrote,www,xaml,xml,xmlns,xp,yes,yet
762,0.0,0.0,0.256549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1183,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
291,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2751,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [236]:
inspect_non_null_matrix_values(tf_idf_matrix_train)
inspect_non_null_matrix_values(tf_idf_matrix_test)

Colonne choisie: equivalent
     equivalent
47     0.328038
181    0.105157
274    0.129607
335    0.058054
348    0.147460
Colonne choisie: found
       found
6   0.056284
8   0.109623
11  0.089815
22  0.118922
23  0.076740


Malgré la pondération, la matrice du TF-IDF garde les mêmes caractéristiques, et inconvénients.

Vu la taille des matrices TF-IDF, il est plus efficace de les regénérer lors de la modélisation que de les enregistrer sur le disque.

#### Multi-label Binarizer ####

Pour les tags, une approche fréquemment utilisée est de transformer la liste de labels en vecteurs binaires; 1 si le tag est présent dans le document, 0 si non.  
L'approche est semblable au Bag of Words, mais sera très utile lors de l'étape de modélisation et de classification supervisée.

In [237]:
mlb_matrix_train = multi_label_binarizer(y_train, sep=',')
mlb_matrix_test = multi_label_binarizer(y_test, sep=',')

In [238]:
mlb_matrix_train.sample(5)

Unnamed: 0,16-bit,24-bit,2d,2d-3d-conversion,2d-games,32-bit,32bit-64bit,3d,3d-engine,3dsmax,...,zigbee,zip,zipcode,zlib,zodb,zooming,zope,zos,zpl-ii,zsh
2239,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3608,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3533,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4390,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5174,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [239]:
inspect_non_null_matrix_values(mlb_matrix_train)
inspect_non_null_matrix_values(mlb_matrix_test)

Colonne choisie: image-resizing
      image-resizing
5430               1
Colonne choisie: speex
      speex
1242      1


In [240]:
mlb_matrix_train['python'].sum()

404

Vu la taille des matrices MLB, il est plus efficace de les regénérer lors de la modélisation que de les enregistrer sur le disque.

### Techniques basées sur la sémantique ###

Les techniques basées sur la sémantique font la promesse de représenter les mots selon leur contexte, en rapprochant mathématiquement les mots sémantiquement proches, tout en gardant une dimensionnalité mesurées par rapport aux techniques fréquentistes. Les vecteurs résultants de ces techniques sont appelés *embeddings*.

#### Word2Vec ####

Word2Vec est une des premières techniques d'embedding qui a vu le jour.  
Il génère des embeddings en capturant les contextes des mots présents, et en rapprochant les mots aux contextes ressemblants.  
Après avoir formaté le corpus en fonction des besoins du modèle, je l'entraîne pour obtenir mes embeddings.

In [241]:
sentences = X_train.apply(lambda x: simple_preprocess(x)).tolist()
model_w2v = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

In [242]:
model_w2v.wv['python']

array([-7.3888284e-01, -2.6823559e-01, -2.7656135e-01, -6.2002796e-01,
        1.3953106e+00, -2.8750449e-01,  4.1973820e-01, -3.4023654e-01,
       -7.3049700e-01,  8.5858488e-01, -5.1141310e-01, -5.7205546e-01,
       -2.5969136e-01,  2.1545164e-01, -3.9454076e-02,  5.8829822e-02,
        3.7618980e-01, -5.7778698e-01, -1.6280451e-01,  7.5852849e-02,
        8.4530145e-02,  3.2980472e-01,  5.5893576e-01, -1.1185496e+00,
        6.8130606e-01, -1.3107446e-01,  2.1685879e-01,  2.0051165e-01,
       -1.5103164e-01, -2.7555305e-01,  9.6739158e-03, -4.5091778e-01,
       -5.0024599e-01, -1.5688300e-01, -2.5187525e-01,  3.7546805e-01,
       -1.4072478e+00,  4.0402254e-01, -7.6856804e-01, -8.7406045e-01,
       -6.8202168e-01,  3.1363800e-01,  5.3694475e-01,  6.7929879e-02,
        9.3399477e-01,  4.3450254e-01, -2.1891499e-01, -1.0374304e-03,
       -5.1109934e-01, -6.6454571e-01, -3.3985298e-02, -1.2507688e+00,
       -3.0737284e-01, -1.9733450e-01, -1.0384946e+00,  3.7000751e-01,
      

Word2Vec génère des embeddings individuels; j'utilise donc une technique d'agrégation pour obtenir un embedding par document.

In [243]:
document_vectors = [get_document_vector(doc, model_w2v) for doc in sentences]

In [244]:
len(document_vectors)

8000

In [245]:
document_vectors[0]

array([-0.38528138,  0.42924562,  0.00812935,  0.14093404,  0.27120703,
       -0.3692592 ,  0.19311023,  0.43646142, -0.2203794 ,  0.02098007,
       -0.16667688, -0.04881778, -0.09998286,  0.28202903,  0.10962275,
       -0.10154137,  0.04147709, -0.21877539, -0.23100865, -0.28658992,
       -0.07408542,  0.30683455, -0.28319526, -0.19478424,  0.1856493 ,
       -0.34556034, -0.46153903, -0.16264074, -0.13519461, -0.00652798,
        0.31922004,  0.23193571,  0.0821247 , -0.05971844, -0.17588572,
        0.27147436, -0.01353427,  0.00733893,  0.01213345, -0.15598193,
        0.3530207 , -0.16683549, -0.08432421, -0.18202911,  0.09026979,
        0.07227368, -0.24481387,  0.15793084,  0.22435562, -0.08442316,
        0.03567999, -0.3084324 , -0.22738582,  0.07932704, -0.02229683,
        0.43055663,  0.01113515,  0.10044792, -0.04085264,  0.03259767,
       -0.10687108,  0.18109202,  0.14277211, -0.26142734, -0.15814063,
       -0.07950474,  0.3001827 , -0.37898818, -0.3443628 ,  0.02

In [246]:
np.save(f'{DATA_DIR}/gold/embeddings_word2vec_train.npy', document_vectors)

Et la même chose pour les données de test:

In [247]:
sentences = X_test.apply(lambda x: simple_preprocess(x)).tolist()
model_w2v = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

In [248]:
document_vectors = [get_document_vector(doc, model_w2v) for doc in sentences]

In [249]:
len(document_vectors)

2000

In [250]:
np.save(f'{DATA_DIR}/gold/embeddings_word2vec_test.npy', document_vectors)

J'obtiens bien les embeddings de chaque document ici.  
L'inconvénient principal de Word2Vec se situe dans son appréciation limitée du contexte des mots; un même mot aux multiples sens n'aura qu'un embedding, au lieu de un par sens différent.

#### BERT ####

BERT est un embedder pré-entraîné, basé sur l'architecture Transformer, ce qui le rend très puissant pour comprendre le contexte d'un mot en profondeur.  
Je vais donc obtenir des embeddings contextuels, basés sur les données qui ont servies à son entraînement, et adaptés au document fourni.

Ici, il n'y a pas de risque de fuite de données, car les embeddings de BERT sont basés sur les données d'entraînement qu'il a reçu, et non sur le corpus de documents.  
Je peux donc générer tous les embeddings en même temps pour être plus efficace.

In [251]:
X = pd.concat([X_train, X_test], ignore_index=True)

In [252]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

if torch.cuda.is_available():
    model = model.to('cuda')

In [253]:
documents = X.tolist()
tokenized_docs = [tokenizer(doc, return_tensors='pt', truncation=True, padding=True) for doc in documents]

In [254]:
bert_embeddings = get_bert_embeddings(model, tokenized_docs)

0 embeddings générés sur 10000, 0.0% effectués
1 embeddings générés sur 10000, 0.01% effectués
2 embeddings générés sur 10000, 0.02% effectués
3 embeddings générés sur 10000, 0.03% effectués
4 embeddings générés sur 10000, 0.04% effectués
5 embeddings générés sur 10000, 0.05% effectués
6 embeddings générés sur 10000, 0.06% effectués
7 embeddings générés sur 10000, 0.06999999999999999% effectués
8 embeddings générés sur 10000, 0.08% effectués
9 embeddings générés sur 10000, 0.09% effectués
10 embeddings générés sur 10000, 0.1% effectués
11 embeddings générés sur 10000, 0.11% effectués
12 embeddings générés sur 10000, 0.12% effectués
13 embeddings générés sur 10000, 0.13% effectués
14 embeddings générés sur 10000, 0.13999999999999999% effectués
15 embeddings générés sur 10000, 0.15% effectués
16 embeddings générés sur 10000, 0.16% effectués
17 embeddings générés sur 10000, 0.16999999999999998% effectués
18 embeddings générés sur 10000, 0.18% effectués
19 embeddings générés sur 10000, 0.1

In [255]:
len(bert_embeddings)

10000

In [256]:
bert_embeddings[0]

array([[-1.76758006e-01, -2.98788063e-02, -2.28762060e-01,
         1.33168146e-01, -2.40332380e-01, -2.03285381e-01,
         3.79662752e-01,  1.72479659e-01, -5.57469249e-01,
        -3.94626744e-02, -5.37935436e-01,  2.05345705e-01,
        -9.79465768e-02, -2.58598570e-02,  1.16248727e-02,
        -2.59797322e-03, -1.89464718e-01,  2.55033046e-01,
        -3.67105268e-02,  7.11916834e-02,  6.63179830e-02,
        -5.64715862e-01, -4.22118247e-01, -7.38211334e-01,
         3.71324420e-01,  3.70861962e-02, -5.16646020e-02,
        -2.14907050e-01, -2.04250291e-01,  1.08965978e-01,
        -2.54747957e-01,  1.26145169e-01,  5.29107489e-02,
        -3.89305323e-01,  2.83152163e-01, -5.29168367e-01,
         8.65663648e-01, -1.24159105e-01,  5.76573193e-01,
        -1.27737626e-01,  1.79162323e-01, -5.56504242e-02,
         6.68562412e-01,  7.07482025e-02,  3.10712665e-01,
         2.58386493e-01, -2.60688138e+00, -2.97698289e-01,
        -8.68685395e-02, -2.84966737e-01, -3.30708176e-0

In [257]:
X_train

2141     python difference difference class child def s...
46172    sql server issue query declare int insert valu...
18558    interface know view property called view howev...
32956    wpf allow user images method within control wp...
13094    database user linq asp net mvc app must write ...
                               ...                        
49939    two trying two able effect object fill far pre...
6420     create shared object linux gcc create shared o...
48421    font working client site css font font src url...
26037    c iphone set object references nil developing ...
31247    draw send binary text file http post request s...
Name: combined_text, Length: 8000, dtype: object

In [258]:
bert_embeddings_train = bert_embeddings[:len(X_train)]
bert_embeddings_test = bert_embeddings[len(X_train):]

In [259]:
np.save(f'{DATA_DIR}/gold/embeddings_bert_train.npy', bert_embeddings_train)
np.save(f'{DATA_DIR}/gold/embeddings_bert_test.npy', bert_embeddings_test)

In [260]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("dunzhang/stella_en_400M_v5", trust_remote_code=True).cuda()

Some weights of the model checkpoint at dunzhang/stella_en_400M_v5 were not used when initializing NewModel: ['new.pooler.dense.bias', 'new.pooler.dense.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [261]:
documents = X.tolist()

In [262]:
len(documents)

10000

In [263]:
stella_embeddings = model.encode(documents)


In [264]:
len(stella_embeddings)

10000

In [265]:
stella_embeddings

array([[-0.23159392, -0.04278202, -0.56382173, ..., -0.62945193,
        -0.5078363 , -1.1755458 ],
       [ 0.24233355,  0.11380592, -1.7524884 , ..., -1.3706971 ,
        -0.03118822, -0.17352524],
       [ 0.3918854 ,  0.19225791, -2.0567582 , ..., -0.5557479 ,
        -0.4804774 , -0.7205195 ],
       ...,
       [-0.5420255 ,  0.24317196, -1.2618406 , ..., -0.19470531,
         0.04484007, -0.37410057],
       [ 0.02674921, -0.13940753, -1.4761802 , ..., -0.13243905,
        -0.6358518 , -0.74601567],
       [-0.06496362,  0.6788909 , -1.3826499 , ..., -0.52944666,
        -0.98358214, -0.5240142 ]], dtype=float32)

In [266]:
stella_embeddings_train = stella_embeddings[:len(X_train)]
stella_embeddings_test = stella_embeddings[len(X_train):]

In [267]:
np.save(f'{DATA_DIR}/gold/embeddings_stella_train.npy', stella_embeddings_train)
np.save(f'{DATA_DIR}/gold/embeddings_stella_test.npy', stella_embeddings_test)

#### USE ####

USE (Universal Sentence Encoder) est un autre embedder pré-entraîné, développé par Google.  
Il est plus efficace et moins gourmand en ressources que BERT, au compromis d'un contexte moins granulaire.

In [268]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [269]:
documents = X.tolist()

In [270]:
use_embeddings = embed(documents).numpy()

In [271]:
use_embeddings.shape

(10000, 512)

In [272]:
use_embeddings[0]

array([ 1.79275777e-02, -4.32720818e-02, -6.36002794e-02,  7.92602170e-03,
       -4.94121015e-02,  6.98096752e-02,  4.84647453e-02, -2.50613317e-04,
       -1.14597511e-02,  6.94168806e-02,  4.49978895e-02,  6.53874427e-02,
        6.01173230e-02,  6.01881780e-02,  1.11237625e-02,  6.98550865e-02,
       -6.78726584e-02,  1.55871911e-02,  2.53479686e-02, -6.85332790e-02,
        1.46965105e-02,  4.39925529e-02, -5.26262820e-02, -1.09589926e-03,
        1.24888318e-02, -6.34953305e-02, -4.27166931e-02, -1.02059962e-02,
        6.65891692e-02,  2.68530250e-02,  6.17833398e-02,  3.84953246e-02,
       -5.56277670e-02,  1.76625084e-02,  5.85132055e-02, -3.39230001e-02,
       -3.78989801e-02, -3.81119587e-02, -1.76813118e-02, -2.80282553e-03,
        6.56576082e-02, -8.48801248e-03,  2.33982727e-02, -7.02778855e-03,
        6.98441714e-02, -5.60104810e-02, -5.48693687e-02, -4.38087657e-02,
        6.77721649e-02, -6.75557479e-02, -5.23977738e-04,  6.61484301e-02,
       -3.48508842e-02,  

In [273]:
use_embeddings_train = use_embeddings[:len(X_train)]
use_embeddings_test = use_embeddings[len(X_train):]

In [274]:
np.save(f'{DATA_DIR}/gold/embeddings_use_train.npy', use_embeddings_train)
np.save(f'{DATA_DIR}/gold/embeddings_use_test.npy', use_embeddings_test)