#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Lorenzo Vaiani

**Credits:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

## Word Embedding

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space
- There are multiple architectures for the generation of word embeddings
- Each architecture has its advantages and disadvantages
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task)
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch
- The use of pre-trained word embedding models is a common practice in NLP and removes the need of training a word embedding model from scratch (that could be very time consuming and computationally expensive)


### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

**Hint:** you can use the following code to load the text8 corpus:

```python
import gensim.downloader as api
from gensim.models import Word2Vec
import time
dataset = api.load("text8")
```

In [None]:
import gensim.downloader as api
from gensim.models import Word2Vec, FastText
import time

dataset = api.load("text8")

start = time.time()
word2vec_model = Word2Vec(sentences=dataset)
end = time.time()

elapsed = end - start
word2vec_model.save("word2vec.model")
print(f"Word2Vec model trained on text8. Time elapsed: {elapsed} seconds.")

Word2Vec model trained on text8. Time elapsed: 130.257670879364 seconds.


### **Question 2**:
Perform **intrinsic** evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv).

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v`
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top K (if `v[Iraq]` is among the K most similar words) with `K = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue? If yes, how can you solve it?
2. Test the model trained on Google News available in gensim.

In [None]:
%%capture
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv
! pip install --upgrade pandas

In [None]:
# Executing this cell could take ~5 minutes
import gensim.downloader

w2v_google_news_model = gensim.downloader.load("word2vec-google-news-300")

In [None]:
import pandas as pd

analogies_dataset = pd.read_csv("google_analogies.csv")

family_dataset = analogies_dataset[analogies_dataset["type"] == "family"]
correct = 0

for idx, row in family_dataset.iterrows():
    word1 = row["word1"]
    word2 = row["word2"]
    word3 = row["word3"]
    target = row["target"]

    try:
        most_similar_words = word2vec_model.wv.most_similar(
            positive=[word2, word3], negative=[word1], topn=10
        )
    except Exception as e:
        # family words may not be present in the model's word vectors
        continue

    predicted_words = [word for (word, _) in most_similar_words]
    if target in predicted_words:
        correct += 1

print(
    f"Accuracy for word2vec (text8) with families dataset: {100 * correct/len(family_dataset)}%"
)

Accuracy for word2vec (text8) with families dataset: 67.58893280632411%


In [None]:
correct = 0

for idx, row in family_dataset.iterrows():
    word1 = row["word1"]
    word2 = row["word2"]
    word3 = row["word3"]
    target = row["target"]

    most_similar_words = w2v_google_news_model.most_similar(
        positive=[word2, word3], negative=[word1], topn=10
    )

    predicted_words = [word for (word, _) in most_similar_words]
    if target in predicted_words:
        correct += 1

print(
    f"Accuracy for word2vec (google-news) with families dataset: {100 * correct/len(family_dataset)}%"
)

Accuracy for word2vec (google-news) with families dataset: 97.43083003952569%


### **Question 3:**

Train a new FastText model using gensim with text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

- Is there any significant difference in training time if compared with Word2Vec training?

In [None]:
dataset = api.load("text8")

start = time.time()
fasttext_model = FastText()
fasttext_model.build_vocab(corpus_iterable=dataset)
fasttext_model.train(
    corpus_iterable=dataset,
    total_examples=fasttext_model.corpus_count,
    total_words=fasttext_model.corpus_total_words,
    epochs=10,
)
end = time.time()

elapsed = end - start
fasttext_model.save("fasttext.model")
print(f"Word2Vec model trained on text8. Time elapsed: {elapsed} seconds.")

Word2Vec model trained on text8. Time elapsed: 999.1348195075989 seconds.


### **Question 4:**
Provide the same evaluation done in Question 2 for the FastText model. In this case, you can use the same type of analogy (family) and the same K values.

**Notes:**
- Try with the model trained on `text8`, is there any issue? What does it mean?
- Test the model trained on Wikipedia+News available in gensim.

In [None]:
import gensim.downloader as api

In [None]:
# Executing this cell could take ~5 minutes
ft_wiki_news_model = api.load('fasttext-wiki-news-subwords-300')

In [None]:
correct = 0

for idx, row in family_dataset.iterrows():
    word1 = row["word1"]
    word2 = row["word2"]
    word3 = row["word3"]
    target = row["target"]

    most_similar_words = fasttext_model.wv.most_similar(
        positive=[word2, word3], negative=[word1], topn=10
    )

    predicted_words = [word for (word, _) in most_similar_words]
    if target in predicted_words:
        correct += 1

print(
    f"Accuracy for fasttext (text8) with families dataset: {100 * correct/len(family_dataset)}%"
)

Accuracy for fasttext (text8) with families dataset: 67.78656126482214%


In [None]:
correct = 0

for idx, row in family_dataset.iterrows():
    word1 = row["word1"]
    word2 = row["word2"]
    word3 = row["word3"]
    target = row["target"]

    most_similar_words = ft_wiki_news_model.most_similar(
        positive=[word2, word3], negative=[word1], topn=10
    )

    predicted_words = [word for (word, _) in most_similar_words]
    if target in predicted_words:
        correct += 1

print(
    f"Accuracy for fasttext (ft_wiki_news) with families dataset: {100 * correct/len(family_dataset)}%"
)

Accuracy for fasttext (text8) with families dataset: 98.41897233201581%


### **Question 5** (optional)

Provide a complete evaluation of the best performing models (Word2Vec and FastText) by leveraging the complete dataset of analogy entries. In this case, you should use all the analogy types and all you can use the same K values provided in Question 2.

In [None]:
import pandas as pd

analogies_dataset = pd.read_csv("google_analogies.csv")

correct = 0

for idx, row in analogies_dataset.iterrows():
    word1 = row["word1"]
    word2 = row["word2"]
    word3 = row["word3"]
    target = row["target"]

    try:
        most_similar_words = word2vec_model.wv.most_similar(
            positive=[word2, word3], negative=[word1], topn=10
        )
    except Exception as e:
        # words may not be present in the model's word vectors
        continue

    predicted_words = [word for (word, _) in most_similar_words]
    if target in predicted_words:
        correct += 1

print(
    f"Accuracy for word2vec (text8) with full dataset: {100 * correct/len(analogies_dataset)}%"
)

correct = 0

for idx, row in analogies_dataset.iterrows():
    word1 = row["word1"]
    word2 = row["word2"]
    word3 = row["word3"]
    target = row["target"]

    most_similar_words = fasttext_model.wv.most_similar(
        positive=[word2, word3], negative=[word1], topn=10
    )

    predicted_words = [word for (word, _) in most_similar_words]
    if target in predicted_words:
        correct += 1

print(
    f"Accuracy for fasttext (text8) with full dataset: {100 * correct/len(analogies_dataset)}%"
)

Accuracy for word2vec (text8) with full dataset: 24.43716741711011%
Accuracy for fasttext (text8) with full dataset: 41.439828080229226%


## Sentence Embeddings

Sentence embeddings are a way to represent a sentence in a vector space. The vector space is usually learned from a large corpus of text. They are used in many NLP tasks, such as text classification, text similarity, and question answering. In this practice, we will use and interact both with Doc2Vec and InferSent models.

Key takeaways from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework.
- It incorporates Document ID to obtain a more accurate representation of a document/paragraph.
- Training document vectors are pre-computed, however you can infer vectors for new documents.
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations.
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6**

Train a novel Doc2Vec model using the [APIs provided by gensim](https://radimrehurek.com/gensim/models/doc2vec.html) with text8 corpus.

- Which is the training time for the model? Is it comparable with Word2Vec and FastText training time?

NB. **Store** the model to a file for subsequent steps.

In [None]:
import nltk

nltk.download("punkt")

import gensim
import gensim.downloader as api
import time

dataset = api.load("text8")
data = [d for d in dataset]


def generator_tag_docs(data):
    for i, l in enumerate(data):
        yield gensim.models.doc2vec.TaggedDocument(l, [i])


training_dataset = list(generator_tag_docs(data))

print(training_dataset[:1])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers'

In [None]:
from gensim.models import Doc2Vec

doc2vec_model = Doc2Vec()
doc2vec_model.build_vocab(training_dataset)
start = time.time()
doc2vec_model.train(
    training_dataset,
    total_examples=doc2vec_model.corpus_count,
    epochs=doc2vec_model.epochs,
)
end = time.time()
doc2vec_model.save("doc2vec_text8.model")
print(f"Doc2Vec model trained. {end-start} seconds elapsed.")

Doc2Vec model trained. 225.96279573440552 seconds elapsed.


### **Question 7 (Doc2Vec qualitative evaluation)**
Perform some **qualitative** experiments by computing the cosine similarities between sentences composed by yourself.
For example, you can use the following sentences:

```python
s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"
```

Please try to interact with the model by providing different sentences and check the results. Is the model able to capture the semantic meaning of the sentences? Are you satisfied with the results?

In [None]:
s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"
s5 = "I have written a new sentence with a different structure"
s6 = "This model struggles to capture meaning of the sentences"
s7 = "The model is undergoing difficulties related to inferring context for the phrases"


def get_similarity(s1, s2):
  v1 = doc2vec_model.infer_vector(s1.split())
  v2 = doc2vec_model.infer_vector(s2.split())
  return doc2vec_model.dv.cosine_similarities(v1, [v2])

print(f"Cosine similarity between s1 and s2: {get_similarity(s1, s2)}")
print(f"Cosine similarity between s3 and s4: {get_similarity(s3, s4)}")
print(f"Cosine similarity between s1 and s4: {get_similarity(s1, s4)}")
print(f"Cosine similarity between s2 and s5: {get_similarity(s2, s5)}")
print(f"Cosine similarity between s3 and s6: {get_similarity(s3, s6)}")
print(f"Cosine similarity between s6 and s7: {get_similarity(s6, s7)}")

Cosine similarity between s1 and s2: [0.822847]
Cosine similarity between s3 and s4: [0.7301078]
Cosine similarity between s1 and s4: [0.70971566]
Cosine similarity between s2 and s5: [0.58107346]
Cosine similarity between s3 and s6: [0.24988657]
Cosine similarity between s6 and s7: [0.576627]


### **Question 8**

Load the InferSent model provided by Facebook Research ([reference](https://github.com/facebookresearch/InferSent)) and perform the same qualitative evaluation done in Question 7. In this case, you can use the InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent).

Try to find some sentences for which InferSent is able to capture the semantic meaning of the sentences as opposed to Doc2Vec. Are you satisfied with the results? Which model is able to better capture the semantic meaning of the sentences? What can be the reason for this?

**Notes:**
Please find below the code to download the InferSent model.

In [2]:
s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"
s5 = "I have written a new sentence with a different structure"
s6 = "This model struggles to capture meaning of the sentences"
s7 = "The model is undergoing difficulties related to inferring context for the phrases"

sentences = [s1, s2, s3, s4, s5, s6, s7]

In [None]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [None]:
from InferSent.models import InferSent
import torch

V = 2
MODEL_PATH = "encoder/infersent%s.pkl" % V
params_model = {
    "bsize": 64,
    "word_emb_dim": 300,
    "enc_lstm_dim": 2048,
    "pool_type": "max",
    "dpout_model": 0.0,
    "version": V,
}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = "fastText/crawl-300d-2M.vec"
infersent.set_w2v_path(W2V_PATH)

In [None]:
infersent.build_vocab(sentences, tokenize=True)
embeddings = infersent.encode(sentences, tokenize=True)

Found 40(/40) words with w2v vectors
Vocab size : 40


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


def get_similarity_infersent(s1, s2):
    index_s1 = sentences.index(s1)
    index_s2 = sentences.index(s2)

    emb1 = np.array(embeddings[index_s1]).reshape(1, -1)
    emb2 = np.array(embeddings[index_s2]).reshape(1, -1)

    similarity = cosine_similarity(emb1, emb2)
    return similarity[0][0]


print(f"Cosine similarity between s1 and s2: {get_similarity_infersent(s1, s2)}")
print(f"Cosine similarity between s3 and s4: {get_similarity_infersent(s3, s4)}")
print(f"Cosine similarity between s1 and s4: {get_similarity_infersent(s1, s4)}")
print(f"Cosine similarity between s2 and s5: {get_similarity_infersent(s2, s5)}")
print(f"Cosine similarity between s3 and s6: {get_similarity_infersent(s3, s6)}")
print(f"Cosine similarity between s6 and s7: {get_similarity_infersent(s6, s7)}")

Cosine similarity between s1 and s2: 0.8108417987823486
Cosine similarity between s3 and s4: 0.28624817728996277
Cosine similarity between s1 and s4: 0.09907999634742737
Cosine similarity between s2 and s5: 0.06086566671729088
Cosine similarity between s3 and s6: 0.10468509793281555
Cosine similarity between s6 and s7: 0.6279629468917847


### **Question 9** (Extrinsic Evaluation)

**Extrinsic** evaluation aims at measuring the performance of the word/sentence/paragraph embedding model when used in a downstream task. In this case, we will use the model to perform a text classification task.
We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset). It contains news headlines and the corresponding category. The dataset is composed by 200846 divided into multiple categories (e.g. politics, business, sports, etc.).

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab. You can use the complete data collection at home to achieve better results.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models


**Note:** For word embedding models you must use an aggregation strategy to obtain a single vector for each sentence. You can use the average of the word vectors or the sum of the word vectors. In both cases, the output vector can be used as input of the classifier.

Report the performance of each classification pipeline. Which model has better performance? Why? Try to elaborate on the results.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

In [10]:
import pandas as pd

dataset = pd.read_csv("news_headline_classification.csv")
sentences = dataset["headline"].tolist()
labels = dataset["category"].tolist()

**Word2Vec + Average aggregation function**

In [None]:
from nltk import word_tokenize
import numpy as np
from tqdm import tqdm

list_w2v_vectors = []
for s in tqdm(sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(w2v_google_news_model[w])
        except Exception as e:
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_w2v_vectors.append(sentence_vector)


from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(
    list_w2v_vectors, labels, test_size=0.20, random_state=42
)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

100%|██████████| 200847/200847 [00:37<00:00, 5292.50it/s]


Iteration 1, loss = 3.34525746
Iteration 2, loss = 3.28456339
Iteration 3, loss = 3.28466959
Iteration 4, loss = 3.28466676
Iteration 5, loss = 3.28459998
Iteration 6, loss = 3.28454753
Iteration 7, loss = 3.28453252
Iteration 8, loss = 3.28458385
Iteration 9, loss = 3.28454836
Iteration 10, loss = 3.28454491
Iteration 11, loss = 3.28447029
Iteration 12, loss = 3.28451829
Iteration 13, loss = 3.28448608
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
0.16370425690814042


  _warn_prf(average, modifier, msg_start, len(result))


                precision    recall  f1-score   support

          ARTS       0.00      0.00      0.00       314
ARTS & CULTURE       0.00      0.00      0.00       272
  BLACK VOICES       0.00      0.00      0.00       874
      BUSINESS       0.00      0.00      0.00      1139
       COLLEGE       0.00      0.00      0.00       216
        COMEDY       0.00      0.00      0.00      1082
         CRIME       0.00      0.00      0.00       682
CULTURE & ARTS       0.00      0.00      0.00       208
       DIVORCE       0.00      0.00      0.00       694
     EDUCATION       0.00      0.00      0.00       207
 ENTERTAINMENT       0.00      0.00      0.00      3182
   ENVIRONMENT       0.00      0.00      0.00       255
         FIFTY       0.00      0.00      0.00       305
  FOOD & DRINK       0.00      0.00      0.00      1301
     GOOD NEWS       0.00      0.00      0.00       273
         GREEN       0.00      0.00      0.00       518
HEALTHY LIVING       0.00      0.00      0.00  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**FastText + Average aggregation function**

In [None]:
# FastText + Avg
from nltk import word_tokenize

list_ft_vectors = []
for s in tqdm(sentences):
    words = word_tokenize(s)
    words_vectors = []
    for w in words:
        try:
            words_vectors.append(ft_wiki_news_model[w])
        except Exception as e:
            # print (e)
            continue
    if len(words_vectors) > 0:
        sentence_vector = np.mean(words_vectors, axis=0)
    else:
        sentence_vector = np.zeros(300)
    list_ft_vectors.append(sentence_vector)

X_train, X_test, y_train, y_test = train_test_split(
    list_ft_vectors, labels, test_size=0.20, random_state=42
)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

100%|██████████| 200847/200847 [00:57<00:00, 3518.93it/s]


Iteration 1, loss = 2.83543552
Iteration 2, loss = 2.20636800
Iteration 3, loss = 1.97942892
Iteration 4, loss = 1.87505088
Iteration 5, loss = 1.81175184
Iteration 6, loss = 1.76877782
Iteration 7, loss = 1.73823985
Iteration 8, loss = 1.71467398
Iteration 9, loss = 1.69626063
Iteration 10, loss = 1.68089348
Iteration 11, loss = 1.66792000
Iteration 12, loss = 1.65679504
Iteration 13, loss = 1.64694452
Iteration 14, loss = 1.63850418
Iteration 15, loss = 1.63108974
Iteration 16, loss = 1.62381656
Iteration 17, loss = 1.61762031
Iteration 18, loss = 1.61178362
Iteration 19, loss = 1.60693028
Iteration 20, loss = 1.60201445
Iteration 21, loss = 1.59781175
Iteration 22, loss = 1.59376225
Iteration 23, loss = 1.58963672
Iteration 24, loss = 1.58636926
Iteration 25, loss = 1.58288734
Iteration 26, loss = 1.57972741
Iteration 27, loss = 1.57668554
Iteration 28, loss = 1.57402112
Iteration 29, loss = 1.57145189
Iteration 30, loss = 1.56942948
Iteration 31, loss = 1.56693735
Iteration 32, los



0.5553895942245457
                precision    recall  f1-score   support

          ARTS       0.28      0.17      0.21       314
ARTS & CULTURE       0.33      0.18      0.24       272
  BLACK VOICES       0.41      0.28      0.34       874
      BUSINESS       0.41      0.38      0.40      1139
       COLLEGE       0.38      0.31      0.34       216
        COMEDY       0.47      0.29      0.35      1082
         CRIME       0.53      0.55      0.54       682
CULTURE & ARTS       0.46      0.22      0.30       208
       DIVORCE       0.64      0.54      0.59       694
     EDUCATION       0.36      0.30      0.33       207
 ENTERTAINMENT       0.56      0.72      0.63      3182
   ENVIRONMENT       0.44      0.25      0.32       255
         FIFTY       0.35      0.15      0.21       305
  FOOD & DRINK       0.60      0.71      0.65      1301
     GOOD NEWS       0.32      0.21      0.26       273
         GREEN       0.38      0.32      0.35       518
HEALTHY LIVING       0.32   

**Doc2Vec (Text8)**

In [None]:
# Doc2Vec
from nltk import word_tokenize
import numpy as np

list_d2v_vectors = []
for s in tqdm(sentences):
    words = word_tokenize(s)
    try:
        sentence_vector = doc2vec_model.infer_vector(words)
    except Exception as e:
        print(e)
        sentence_vector = np.zeros(300)

    list_d2v_vectors.append(sentence_vector)

100%|██████████| 200847/200847 [02:32<00:00, 1320.85it/s]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    list_d2v_vectors, labels, test_size=0.20, random_state=42
)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Iteration 1, loss = 3.32060992
Iteration 2, loss = 3.24378457
Iteration 3, loss = 3.23996465
Iteration 4, loss = 3.23798726
Iteration 5, loss = 3.23647417
Iteration 6, loss = 3.23538712
Iteration 7, loss = 3.23450545
Iteration 8, loss = 3.23387396
Iteration 9, loss = 3.23326831
Iteration 10, loss = 3.23268930
Iteration 11, loss = 3.23209980
Iteration 12, loss = 3.23174501
Iteration 13, loss = 3.23124310
Iteration 14, loss = 3.23082757
Iteration 15, loss = 3.23048963
Iteration 16, loss = 3.23014778
Iteration 17, loss = 3.22986866
Iteration 18, loss = 3.22958293
Iteration 19, loss = 3.22920412
Iteration 20, loss = 3.22898827
Iteration 21, loss = 3.22886572
Iteration 22, loss = 3.22857688
Iteration 23, loss = 3.22836714
Iteration 24, loss = 3.22806703
Iteration 25, loss = 3.22780451
Iteration 26, loss = 3.22767547
Iteration 27, loss = 3.22751127
Iteration 28, loss = 3.22730324
Iteration 29, loss = 3.22708308
Iteration 30, loss = 3.22690418
Iteration 31, loss = 3.22674716
Iteration 32, los

  _warn_prf(average, modifier, msg_start, len(result))


                precision    recall  f1-score   support

          ARTS       0.08      0.00      0.01       314
ARTS & CULTURE       0.00      0.00      0.00       272
  BLACK VOICES       0.00      0.00      0.00       874
      BUSINESS       0.05      0.00      0.00      1139
       COLLEGE       0.00      0.00      0.00       216
        COMEDY       0.11      0.00      0.00      1082
         CRIME       0.00      0.00      0.00       682
CULTURE & ARTS       0.00      0.00      0.00       208
       DIVORCE       0.00      0.00      0.00       694
     EDUCATION       1.00      0.00      0.01       207
 ENTERTAINMENT       0.11      0.01      0.01      3182
   ENVIRONMENT       0.00      0.00      0.00       255
         FIFTY       0.00      0.00      0.00       305
  FOOD & DRINK       0.12      0.00      0.00      1301
     GOOD NEWS       0.00      0.00      0.00       273
         GREEN       0.00      0.00      0.00       518
HEALTHY LIVING       0.00      0.00      0.00  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**InferSent**

In [14]:
from nltk import word_tokenize
import nltk

nltk.download("punkt")

infersent.build_vocab(sentences[:10_000], tokenize=True)
infersent_embeddings = infersent.encode(sentences[:10_000], tokenize=True)

X_train, X_test, y_train, y_test = train_test_split(
    infersent_embeddings, labels[:10_000], test_size=0.20, random_state=42
)
mlp = MLPClassifier(hidden_layer_sizes=(50), verbose=True)
mlp = mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Found 13646(/15245) words with w2v vectors
Vocab size : 13646
Iteration 1, loss = 2.26206645
Iteration 2, loss = 1.77062995
Iteration 3, loss = 1.56431892
Iteration 4, loss = 1.42191421
Iteration 5, loss = 1.31684641
Iteration 6, loss = 1.23298598
Iteration 7, loss = 1.16323816
Iteration 8, loss = 1.10335614
Iteration 9, loss = 1.04740975
Iteration 10, loss = 0.99942179
Iteration 11, loss = 0.95425177
Iteration 12, loss = 0.91252584
Iteration 13, loss = 0.87618822
Iteration 14, loss = 0.83649724
Iteration 15, loss = 0.80598241
Iteration 16, loss = 0.77426202
Iteration 17, loss = 0.74632017
Iteration 18, loss = 0.71954325
Iteration 19, loss = 0.69045799
Iteration 20, loss = 0.66599044
Iteration 21, loss = 0.64391445
Iteration 22, loss = 0.62142377
Iteration 23, loss = 0.60238641
Iteration 24, loss = 0.57603642
Iteration 25, loss = 0.55841694
Iteration 26, loss = 0.53797434
Iteration 27, loss = 0.52157096
Iteration 28, loss = 0.50192538
Iteration 29, loss = 0.48230846
Iteration 30, loss 