<a href="https://colab.research.google.com/github/fpgmina/DeepNLP/blob/main/L2_Word_and_Sentence_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Giuseppe Gallipoli

**Credits:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

## Word Embedding

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space.
- There are multiple architectures for the generation of word embeddings.
- Each architecture has its advantages and disadvantages.
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task).
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch.
- The use of pre-trained word embedding models is a common practice in NLP and removes the need of training a word embedding model from scratch (that could be very time consuming and computationally expensive).

Gensim is a Python library for natural language processing (NLP) that specializes in unsupervised learning of word and document representations, particularly for working with large text corpora efficiently.

It’s widely used for topic modeling, semantic similarity, and word embeddings.

### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the Python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

**Hint:** you can use the following code to load the text8 corpus:

```python
import gensim.downloader as api
from gensim.models import Word2Vec
import time
dataset = api.load("text8")
```

In [1]:
! pip install --upgrade gensim



In [2]:
import gensim.downloader as api
from gensim.models import Word2Vec
import time

# --- Load the text8 corpus ---
print("Loading text8 dataset...")
dataset = api.load("text8")  # This is an iterable of tokenized sentences
data = list(dataset)         # Convert to list for multiple passes
print(f"Number of sentences: {len(data)}")

# --- Define model parameters ---
params = {
    "vector_size": 100,  # Dimensionality of word embeddings
    "window": 5,         # Context window size
    "min_count": 5,      # Ignore words with total frequency lower than this
    "workers": 4,        # Number of CPU threads
    "sg": 0              # 0 = CBOW, 1 = Skip-gram
}

Loading text8 dataset...
Number of sentences: 1701


In [27]:
data[0][:20]

['anarchism',
 'originated',
 'as',
 'a',
 'term',
 'of',
 'abuse',
 'first',
 'used',
 'against',
 'early',
 'working',
 'class',
 'radicals',
 'including',
 'the',
 'diggers',
 'of',
 'the',
 'english']

In [6]:
# --- Train the model and measure time ---
start_time = time.time()
model = Word2Vec(sentences=data, **params)
training_time = time.time() - start_time

print(f"\n✅ Training completed in {training_time:.2f} seconds.")

# --- Store model and training time ---
model.save("word2vec_text8.model")
with open("training_time.txt", "w") as f:
    f.write(f"{training_time:.2f} seconds")

print("\nModel and training time saved successfully!")


✅ Training completed in 158.10 seconds.

Model and training time saved successfully!


### **Question 2**
Perform **intrinsic** evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv).

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type of entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v`
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top k (if `v[Iraq]` is among the k most similar words) with `k = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue? If yes, how can you solve it?
2. Test the model trained on Google News available in gensim.

In [7]:
import pandas as pd

url = "https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv"
analogies = pd.read_csv(url)

print("Columns:", analogies.columns)
print("Unique types:", analogies["type"].unique())
analogies.head()

Columns: Index(['Unnamed: 0', 'type', 'word1', 'word2', 'word3', 'target'], dtype='object')
Unique types: ['capital-common-countries' 'capital-world' 'currency' 'city-in-state'
 'family' 'gram1-adjective-to-adverb' 'gram2-opposite' 'gram3-comparative'
 'gram4-superlative' 'gram5-present-participle'
 'gram6-nationality-adjective' 'gram7-past-tense' 'gram8-plural'
 'gram9-plural-verbs']


Unnamed: 0.1,Unnamed: 0,type,word1,word2,word3,target
0,0,capital-common-countries,Athens,Greece,Baghdad,Iraq
1,1,capital-common-countries,Athens,Greece,Bangkok,Thailand
2,2,capital-common-countries,Athens,Greece,Beijing,China
3,3,capital-common-countries,Athens,Greece,Berlin,Germany
4,4,capital-common-countries,Athens,Greece,Bern,Switzerland


In [8]:
# To reduce computation time, we’ll start with "family" relations such as “boy : girl :: father : mother”.
subset = analogies[analogies["type"] == "family"]
print(f"Number of family analogies: {len(subset)}")

Number of family analogies: 506


In [10]:
subset.head()

Unnamed: 0.1,Unnamed: 0,type,word1,word2,word3,target
8363,8363,family,boy,girl,brother,sister
8364,8364,family,boy,girl,brothers,sisters
8365,8365,family,boy,girl,dad,mom
8366,8366,family,boy,girl,father,mother
8367,8367,family,boy,girl,grandfather,grandmother


In [11]:
# Let's load our trained model
from gensim.models import Word2Vec
model = Word2Vec.load("word2vec_text8.model")

In [24]:
from gensim.models import KeyedVectors

# --------------------------------------------------------
# Evaluate word analogies (works for Word2Vec or KeyedVectors)
# --------------------------------------------------------
def evaluate_analogies(model: Word2Vec | KeyedVectors, df: pd.DataFrame, top_ks: list =[1, 3, 5, 10]):
    """
    Evaluate word analogies of the form:
        word1 : word2 :: word3 : word4
    using vector arithmetic and similarity ranking.

    This function automatically detects if 'model' is:
    - Word2Vec object (use model.wv for vectors)
    - KeyedVectors object (use model directly)

    Parameters
    ----------
    model : gensim.models.Word2Vec or gensim.models.KeyedVectors
        The word embeddings to use for evaluation.
    df : pandas.DataFrame
        DataFrame with columns: word1, word2, word3, word4, type
    top_ks : list of int
        The top-k values to check accuracy for (default [1,3,5,10])

    Returns
    -------
    dict
        Dictionary mapping each top-k to accuracy (fraction of correct analogies)

    NB: w2v_google_news_model is already a KeyedVectors object, not a full Word2Vec model.
    KeyedVectors does not have .wv, because it is the word vectors.
    """
    # --------------------------------------------------------
    # Detect if the model is a full Word2Vec model or KeyedVectors
    # --------------------------------------------------------
    if hasattr(model, "wv"):  # Word2Vec object
        vectors = model.wv  #  wv stands for word-vectors
    else:  # KeyedVectors object
        vectors = model  # see NB above

    # Initialize result counters for each top-k
    results = {k: 0 for k in top_ks}
    total = 0  # number of valid analogies evaluated

    # --------------------------------------------------------
    # Loop through all analogies in the DataFrame
    # --------------------------------------------------------
    for _, row in df.iterrows():
        w1, w2, w3, target = row["word1"], row["word2"], row["word3"], row["target"]  # str

        # Skip analogy if any of the words are not in the vocabulary
        if not all(w in vectors.key_to_index for w in [w1, w2, w3, target]):
            continue

        # compute the analogy vector
        res_v = vectors[w2] - vectors[w1] + vectors[w3] ## numpy array

        # finds the words whose vectors are most similar to the vector res_v.
        sims = vectors.similar_by_vector(res_v, topn=max(top_ks))

        # Extract only the words (ignore similarity scores)
        predicted_words = [word for word, _ in sims]

        # --------------------------------------------------------
        # Check if the true target word (w4) is in the top-k predictions
        # --------------------------------------------------------
        for k in top_ks:
            results[k] += 1 if target in predicted_words[:k] else 0

        total += 1

    # --------------------------------------------------------
    # Compute final accuracy as fraction of correct predictions
    # --------------------------------------------------------
    accuracies = {k: results[k] / total if total > 0 else 0 for k in top_ks}
    print(f"Evaluated {total} valid analogies.")
    return accuracies


In [25]:
scores = evaluate_analogies(model, subset)
print("\nResults for text8 model:")
for k, v in scores.items():
    print(f"Top-{k}: {v:.3f}")


Evaluated 420 valid analogies.

Results for text8 model:
Top-1: 0.069
Top-3: 0.650
Top-5: 0.717
Top-10: 0.767


In [19]:
# Executing this cell could take ~5 minutes
import gensim.downloader
w2v_google_news_model = gensim.downloader.load('word2vec-google-news-300')  # 1.6GB object



In [26]:
scores_google = evaluate_analogies(w2v_google_news_model, subset)
print("\nResults for Google News model:")
for k, v in scores_google.items():
    print(f"Top-{k}: {v:.3f}")

Evaluated 506 valid analogies.

Results for Google News model:
Top-1: 0.350
Top-3: 0.879
Top-5: 0.937
Top-10: 0.972


### **Question 3**

Train a new FastText model using gensim with text8 corpus available in the Python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

- Is there any significant difference in training time if compared with Word2Vec training?

In [None]:
# your code here

### **Question 4**
Provide the same evaluation done in Question 2 for the FastText model. In this case, you can use the same type of analogy (family) and the same k values.

**Notes:**
- Try with the model trained on `text8`, is there any issue? What does it mean?
- Test the model trained on Wikipedia+News available in gensim.

In [None]:
# Executing this cell could take ~5 minutes
import gensim.downloader
ft_wiki_news_model = gensim.downloader.load('fasttext-wiki-news-subwords-300')

In [None]:
# your code here

### **Question 5** (optional)

Provide a complete evaluation of the best performing models (Word2Vec and FastText) by leveraging the complete dataset of analogy entries. In this case, you should use all the analogy types and all you can use the same k values provided in Question 2.

In [None]:
# your code here

## Sentence Embeddings

Sentence embeddings are a way to represent a sentence in a vector space. The vector space is usually learned from a large corpus of text. They are used in many NLP tasks, such as text classification, text similarity, and question answering. In this practice, we will use and interact both with Doc2Vec and InferSent models.

**Key takeaways** from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework.
- It incorporates Document ID to obtain a more accurate representation of a document/paragraph.
- Training document vectors are pre-computed, however you can infer vectors for new documents.
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations.
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6**

Train a novel Doc2Vec model using the [APIs provided by gensim](https://radimrehurek.com/gensim/models/doc2vec.html) with text8 corpus.

- Which is the training time for the model? Is it comparable with Word2Vec and FastText training time?

**Note:** Store the model to a file for subsequent steps.

In [41]:
import gensim.downloader as api
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import time

# --- Load the text8 corpus ---
print("Loading text8 dataset...")
dataset = api.load("text8")  # This is an iterable of tokenized sentences
data = list(dataset)         # Convert to list for multiple passes
print(f"Number of sentences: {len(data)}")

# --- Define model parameters ---
params = {
    "vector_size": 100,  # Dimensionality of word embeddings
    "window": 5,         # Context window size
    "min_count": 5,      # Ignore words with total frequency lower than this
    "workers": 4,        # Number of CPU threads
    "dm": 1              # 1=PV-DM (Distributed Memory), 0=PV-DBOW (Distributed Bag of Words)
}

Loading text8 dataset...
Number of sentences: 1701


In [38]:
# Each sentence is treated as a "document" and tagged with a unique ID
tagged_data = [TaggedDocument(words=sentence, tags=[str(i)]) for i, sentence in enumerate(data)]

In [42]:
# --- Train the model and measure time ---
start_time = time.time()
doc2vec_model = Doc2Vec(tagged_data, **params)
training_time = time.time() - start_time

print(f"\n✅ Training completed in {training_time:.2f} seconds.")

# --- Store model and training time ---
doc2vec_model.save("doc2vec_text8.model")
with open("training_time.txt", "w") as f:
    f.write(f"{training_time:.2f} seconds")

print("\nModel and training time saved successfully!")


✅ Training completed in 315.78 seconds.

Model and training time saved successfully!


### **Question 7 (Doc2Vec qualitative evaluation)**
Perform some **qualitative** experiments by computing the cosine similarities between sentences composed by yourself.
For example, you can use the following sentences:

```python
s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"
```

Please try to interact with the model by providing different sentences and check the results. Is the model able to capture the semantic meaning of the sentences? Are you satisfied with the results?

In [44]:
s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"

sentences = [s1, s2, s3, s4, "I love eating pizza"]

# Tokenize sentences
tokenized_sentences = [s.lower().split() for s in sentences]



In [47]:
tokenized_sentences[0]

['the', 'president', 'of', 'the', 'united', 'states', 'is', 'donald', 'trump']

In [49]:
# Compute embeddings using infer_vector
sentence_vectors = [doc2vec_model.infer_vector(tokens) for tokens in tokenized_sentences]  # tokens is a list of string

In [55]:
type(sentence_vectors), len(sentence_vectors), sentence_vectors[0]

(list,
 5,
 array([-0.07011464, -0.02384174, -0.01650893, -0.01788367, -0.01315612,
        -0.08039293,  0.01192861,  0.04928732,  0.04888473,  0.04247653,
        -0.01953345, -0.07276103,  0.00525262, -0.00095218,  0.00383172,
        -0.17174438, -0.00244455,  0.04429134,  0.02908824, -0.01924216,
        -0.02298632, -0.0171893 , -0.02812737,  0.00239561, -0.02214707,
         0.04351545, -0.02495292,  0.02746939,  0.06938585, -0.023519  ,
         0.01564159, -0.0225121 , -0.05348649,  0.01414427, -0.05161237,
         0.02597961,  0.02302339,  0.05740812, -0.04849053, -0.10878731,
         0.03447248,  0.01261412,  0.0489866 , -0.06500457, -0.05454819,
         0.01369612,  0.05463054,  0.0590937 ,  0.00231931, -0.05295352,
         0.00482217, -0.0219288 ,  0.0222112 , -0.06215452, -0.00994387,
         0.07463907,  0.0242512 ,  0.06850968, -0.04577455, -0.04118698,
        -0.01237703,  0.09592188,  0.03604743, -0.0523212 ,  0.0571046 ,
        -0.08112856, -0.077112  , -0.054

In [58]:
import numpy as np
def cosine_similarity(v1, v2):
  return np.dot(v1, v2) / np.linalg.norm(v1) / np.linalg.norm(v2)

for sentence_vector in sentence_vectors[1:]: # exclude first
  print(cosine_similarity(sentence_vectors[0], sentence_vector))



0.59613466
0.58729553
0.6417819
0.56600237


PRETTY BAD!

### **Question 8**

Load the InferSent model provided by Facebook Research ([reference](https://github.com/facebookresearch/InferSent)) and perform the same qualitative evaluation done in Question 7. In this case, you can use the InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent).

Try to find some sentences for which InferSent is able to capture the semantic meaning of the sentences as opposed to Doc2Vec. Are you satisfied with the results? Which model is able to better capture the semantic meaning of the sentences? What can be the reason for this?

**Note:**
Please find below the code to download the InferSent model.

In [None]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [None]:
from InferSent.models import InferSent
import torch
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

**Note:** Due to compatibility issues between newer NumPy versions and InferSent, you may encounter the following error when calling the `encode` method of the InferSent object:
> ValueError: setting an array element with a sequence...

If this occurs, you can fix it as follows:
- Restart the session
- Before loading the InferSent class, modify the `models.py` file in the InferSent folder by replacing line 207 with `sentences = np.array(sentences, dtype=object)[idx_sort]`

In [None]:
# your code here

### **Question 9** (Extrinsic Evaluation)

**Extrinsic** evaluation aims at measuring the performance of the word/sentence/paragraph embedding model when used in a downstream task. In this case, we will use the model to perform a text classification task.
We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset). It contains news headlines and the corresponding category. The dataset is composed by 200846 divided into multiple categories (e.g. politics, business, sports, etc.).

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab. You can use the complete data collection at home to achieve better results.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models


**Note:** For word embedding models you must use an aggregation strategy to obtain a single vector for each sentence. You can use the average of the word vectors or the sum of the word vectors. In both cases, the output vector can be used as input of the classifier.

Report the performance of each classification pipeline. Which model has better performance? Why? Try to elaborate on the results.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

In [59]:
# Load dataset from URL
url = "https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv"
df = pd.read_csv(url)

# Keep first 10,000 rows for the lab
df = df.iloc[:10000]

# Quick look
print(df.head())
print(df['category'].value_counts())

   Unnamed: 0       category  \
0           0          CRIME   
1           1  ENTERTAINMENT   
2           2  ENTERTAINMENT   
3           3  ENTERTAINMENT   
4           4  ENTERTAINMENT   

                                            headline  
0  There Were 2 Mass Shootings In Texas Last Week...  
1  Will Smith Joins Diplo And Nicky Jam For The 2...  
2    Hugh Grant Marries For The First Time At Age 57  
3  Jim Carrey Blasts 'Castrato' Adam Schiff And D...  
4  Julianna Margulies Uses Donald Trump Poop Bags...  
category
POLITICS          3604
ENTERTAINMENT     1906
WORLD NEWS         683
QUEER VOICES       512
COMEDY             495
BLACK VOICES       443
SPORTS             382
MEDIA              329
WOMEN              283
WEIRD NEWS         242
CRIME              201
BUSINESS           112
LATINO VOICES      105
IMPACT              86
TRAVEL              76
RELIGION            76
STYLE               70
PARENTS             66
GREEN               66
TECH                65
HEALTHY 

In [60]:
def preprocess(text):
    return text.lower().split()  # simple tokenization by whitespaces. TODO: use a better tokenization if available

df['tokens'] = df['headline'].apply(preprocess)

In [61]:
df.head()

Unnamed: 0.1,Unnamed: 0,category,headline,tokens
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,"[there, were, 2, mass, shootings, in, texas, l..."
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,"[will, smith, joins, diplo, and, nicky, jam, f..."
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,"[hugh, grant, marries, for, the, first, time, ..."
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,"[jim, carrey, blasts, 'castrato', adam, schiff..."
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,"[julianna, margulies, uses, donald, trump, poo..."


**Word2Vec + Average aggregation function**

In [63]:
def sentence_vector(tokens, model):
    """
    Compute sentence vector as average of word vectors
    """
    vectors = []

    # Check if model is KeyedVectors or Word2Vec
    if hasattr(model, 'wv'):
        vectors_model = model.wv
    else:
        vectors_model = model

    for word in tokens:
        if word in vectors_model.key_to_index:
            vectors.append(vectors_model[word])

    if len(vectors) > 0:
        return np.mean(vectors, axis=0)  # average embeddings for each token in the sentence
    else:
        raise ValueError(f"No word vectors found for tokens: {tokens}")

# Assuming w2v_google_news_model is already loaded
df['w2v_vector'] = df['tokens'].apply(lambda x: sentence_vector(x, w2v_google_news_model))

df.head()

Unnamed: 0.1,Unnamed: 0,category,headline,tokens,w2v_vector
0,0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,"[there, were, 2, mass, shootings, in, texas, l...","[0.006882888, -0.019878682, 0.06519993, 0.0732..."
1,1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,"[will, smith, joins, diplo, and, nicky, jam, f...","[0.0009543679, 0.03844105, 0.058582652, 0.0953..."
2,2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,"[hugh, grant, marries, for, the, first, time, ...","[0.04625787, 0.042371962, -0.024556478, 0.0947..."
3,3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,"[jim, carrey, blasts, 'castrato', adam, schiff...","[0.025887625, 0.040213447, -0.010271345, 0.126..."
4,4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,"[julianna, margulies, uses, donald, trump, poo...","[0.080883786, 0.064416505, 0.041757204, 0.1413..."


In [64]:
from sklearn.model_selection import train_test_split

# Labels
y = df['category']

# Example for Word2Vec vectors
X_w2v = np.vstack(df['w2v_vector'].values)  # convert list of arrays to 2D array

X_train, X_test, y_train, y_test = train_test_split(X_w2v, y, test_size=0.2, random_state=42)


In [66]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Define classifier
clf = MLPClassifier(hidden_layer_sizes=(128,), max_iter=100, random_state=42)

# Train
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Word2Vec classification accuracy: {acc:.3f}")


Word2Vec classification accuracy: 0.593




**FastText + Average aggregation function**

In [None]:
# your code here

**Doc2Vec (Text8)**

In [None]:
# your code here

**InferSent**

In [None]:
# your code here