# 3 - Pré-processamento e transformações

Nessa fase, construiremos alguns modelos específicos para texto para então treiná-los;

In [1]:
# caminho para instalação do pacote mltoolkit, com metricas e gráficos personalizados
# !pip install git+ssh://git@github.com/flimao/mltoolkit

In [2]:
import pandas as pd
import numpy as np
from matplotlib import rcParams, rcParamsDefault, pyplot as plt
import seaborn as sns
from mltoolkit import metrics, plots, NLP
import spacy

rcParams.update(rcParamsDefault)
rcParams['figure.dpi'] = 120
rcParams['figure.figsize'] = (10, 8)

In [3]:
# !python -m spacy download pt_core_news_lg
# !python -m spacy download pt_core_news_md
# !python -m spacy download pt_core_news_sm
nlp = spacy.load("pt_core_news_lg")

## Importação dos dados

Primeiramente, importamos os dados e aplicamos as transformações utilizadas na fase anterior:

In [4]:
# não tocaremos no conjunto de submissão

tweets_raw = pd.read_csv(
    r'../data/Train3Classes.csv',
)

In [5]:
# trocar tipos para acelerar o processamento (menos espaço em memória)
# e ativar possíveis otimizações internas ao pandas para certos tipos
def mudar_tipos(df):
    df = df.copy()

    df['id'] = df['id'].astype('string')
    df['tweet_date'] = pd.to_datetime(df['tweet_date'])
    df['sentiment'] = df['sentiment'].astype('category')

    return df

def remover_duplicatas(df):
    df = df.copy()

    df = df.drop_duplicates(subset = 'id')

    return df

# o índice é o id, visto que não há repetidos
# vantagem: o índice é removido automaticamente quando separamos em base de treino e teste.
def setar_index(df):
    df = df.copy()

    df = df.set_index('id')

    return df

tweets_full = (tweets_raw
    .pipe(mudar_tipos)
    .pipe(remover_duplicatas)
    .pipe(setar_index)
)

## Pré-processamento de texto

Vamos então implementar o pré-processamento do texto da fase anterior (Análise Exploratória de Texto).

Primeiramente vamos importar as *stopwords*:

In [6]:
with open(r'../data/stopwords_alopes.txt', encoding = 'utf8') as stopword_list:
    lst = stopword_list.read().splitlines()

stopwords_alopes = set([ stopword.strip() for stopword in lst ])

# em uma análise de sentimento, não queremos remover palavras com conotação negativa
remover_stopwords = {
    'não', 
}

stopwords_alopes -= remover_stopwords

In [7]:
preprocessing_full = lambda s: NLP.preprocessing(s, preproc_funs_args = [
    NLP.remove_links,
    NLP.remove_hashtags,
    NLP.remove_mentions,
    NLP.remove_numbers,
    NLP.remove_special_caract,
    NLP.lowercase,
    #remove_punkt,
    #(remove_stopwords, dict(stopword_list = stopword_list_alopes)),
    (NLP.tokenize_remove_stopwords_get_radicals_spacy, dict(
        nlp = nlp,
        stopword_list = stopwords_alopes,
    )),
])

Vamos então aplicar esse pré-processamento a uma amostra da base de *tweets* (para podermos iterar rapidamente caso necessário). 

Em um momento posterior, treinaremos a base completa.

In [8]:
amostra_eda = 5000
radicais = tweets_full.sample(amostra_eda)['tweet_text'].apply(preprocessing_full)

tweets = tweets_full.copy()
tweets['radicais'] = radicais
tweets = tweets[tweets.radicais.notna()]

In [9]:
tweets.sample(10)

Unnamed: 0_level_0,tweet_text,tweet_date,sentiment,query_used,radicais
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1046253121217974272,queria beijar os dedinhso dos pes e das maos d...,2018-09-30 04:19:17+00:00,0,:(,querer beijar dedinhso pes maos deposi caro ba...
1046933383132192768,ai gente vai dar merda essa eleição :(,2018-10-02 01:22:24+00:00,0,:(,ai gente merda eleicao
1047555735192842240,"@LuscaTurtle ""Pros senadores tenho que ainda q...",2018-10-03 18:35:24+00:00,1,:),pros senador escolher outro alar suplicy lt pr...
1047468298554826752,"dólar derretendo !! R$3,848 com -2,09% !!! :))",2018-10-03 12:47:58+00:00,1,:),dolar derreter r$
1042941480732581888,Servidoras do Itamaraty fizeram vaquinha para ...,2018-09-21 01:00:00+00:00,2,jornaloglobo,servidor itamaraty fazer boi pagar cirurgia mu...
1046925538282151936,Welder não vai vim pra cá começo do ano :((((,2018-10-02 00:51:13+00:00,0,:(,welder nao vir pra ca comeco ano
1049315231376269313,Eu quase chorei hj na sala de tanto que as gur...,2018-10-08 15:07:01+00:00,0,:(,quase chorar hj sala guri encher sacar excluiram
1046782264301015045,@yagodeluque 🌻 que lindo o perfil 💛 8/10 ⭐ voc...,2018-10-01 15:21:54+00:00,1,:),lindar perfil voce ta interagir comigo dia cha...
1048465998213537792,@GustavoHSalesVi @BlogdoNoblat Entendo q no wi...,2018-10-06 06:52:28+00:00,1,:),entender q wikipedia so dia dar confusao
1049223716482240513,@SputNico_ perdi :( Eu tava morrendo de dor de...,2018-10-08 09:03:22+00:00,0,:(,perder tava morrer dor cabeca dormir x_x sorry


## Opções de modelos

Vamos agora olhar para alguns modelos que podemos utilizar.

Definiremos os modelos desejados, e então procederemos à comparação dos mesmos.

In [10]:
X = tweets['radicais']
y = tweets['sentiment']

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size = 0.3,
    stratify = y,
)

X_trains = {}
X_tests = {}

### 1. *Bag of Words* / `CountVectorizer`

*Bag of Words* é o processo onde traduzimos o texto já tratado para uma representação numérica que faça sentido para o modelo de *Machine Learning* consiga interpretá-lo.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X_trains['cv'] = cv.fit_transform(X_train).toarray()
X_tests['cv'] = cv.transform(X_test).toarray()

In [13]:
X_trains['cv']

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [14]:
X_trains['cv'].shape

(3500, 6419)

### 2. TF-IDF

***Term Frequency and Inverse Document Frequency*** é uma tranformação onde avaliamos a relevância das palavras pela **Frequência dos Termos** e multiplicamos pelo **Inverso da Frequência nos Documentos**.

Nesse contexto, um **documento** é cada um dos textos dentro de um *dataset*. Vamos entender cada um dos termos:

> **TF - Term Frequency**: é a frequência de vezes que um termo/palavra aparece em cada um dos documentos analisados (isso nos ajuda a avaliar a relevância daquela palavra);

> **IDF - Inverse Document Frequency**: aqui avaliamos em quantos documentos o termo/palavra aparece (dessa forma conseguimos entender a sua influência em identificar os textos);

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(use_idf = True)

X_trains['tfidf'] = tfidf.fit_transform(X_train).todense()
X_tests['tfidf']  = tfidf.transform(X_test).todense()

In [16]:
X_trains['tfidf']

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [17]:
X_trains['tfidf'].shape

(3500, 6419)

### *Word2vec*

O *Word2vec* ([Wikipedia](https://en.wikipedia.org/wiki/Word2vec), [Gensim](https://radimrehurek.com/gensim/models/word2vec.html)) é uma rede neural onde associa-se vetores a cada palavra. Os vetores são tais que pretendem capturar as relações semânticas entre as mesmas.

Por exemplo, se tivermos em nosso vocabulário as palavras *rei*, *rainha*, *homem* e *mulher*, poderíamos fazer a seguinte operação vetorial:

$ \vec{v}_{rei} - \vec{v}_{homem} + \vec{v}_{mulher} = \vec{v}_{rainha}$

In [18]:
from gensim.models import Word2Vec

X_train_tokens = X_train.str.split(' ').to_list()
X_test_tokens = X_test.str.split(' ').to_list()

w2v_model = Word2Vec(
    sentences = X_train_tokens, 
    vector_size = 2,  # este parâmetro é o equivalente ao número de features. 
    min_count = 1, 
    workers = 2
)

In [19]:
w2v_model.wv.most_similar(positive = ['bolsonaro'], negative = ['haddad'])

[('bgl', 0.9999997615814209),
 ('itabirito', 0.9999997019767761),
 ('fundir', 0.9999992251396179),
 ('eai', 0.9999991655349731),
 ('adolescencia', 0.9999955892562866),
 ('humilhacao', 0.9999939799308777),
 ('aquii', 0.9999939799308777),
 ('store', 0.9999932646751404),
 ('assistir', 0.9999839067459106),
 ('beija-mao', 0.9999828338623047)]

In [20]:
w2v_model.wv.similarity('bolsonaro', 'haddad')

0.97749794

In [21]:
# função para, dado um modelo word2vec e um conjunto de frases em formato de token (listas de listas), 
# construir os vetores associados a cada uma

def build_word2vec_vectors(model, phrases, vector_combination):

    X = []
    vector_size = model.vector_size

    for phrase in phrases:

        ntokens = len(phrase)
        vectors = np.zeros(shape = (ntokens, vector_size))

        for i, token in enumerate(phrase):
            try:

                vectors[i, :] = model.wv[token]
            except KeyError:  # token not present in corpus
                vectors[i, :] = 0

        X.append(vector_combination(vectors))
    
    return np.asarray(X)

# função para, dados conjuntos de frases de treino e teste, construir os vetores
# associados
def build_word2vec_model(
    X_train, X_test, 
    vector_combination,
    is_token = False,
    **kwargs
):
    # kwargs = arguments for Word2Vec class
    
    if is_token:
        X_train_tokens = X_train
        X_test_tokens = X_test
    else:
        X_train_tokens = X_train.str.split(' ').to_list()
        X_test_tokens = X_test.str.split(' ').to_list()
    
    # instantiate, build and train model
    w2v_model = Word2Vec(
        sentences = X_train_tokens, 
        **kwargs
    )

    # build vectors
    X_train_w2v = build_word2vec_vectors(
        model = w2v_model, 
        phrases = X_train_tokens, 
        vector_combination = vector_combination
    )

    X_test_w2v = build_word2vec_vectors(
        model = w2v_model, 
        phrases = X_test_tokens, 
        vector_combination = vector_combination
    )

    return w2v_model, X_train_w2v, X_test_w2v

In [22]:
X_train[:1]

id
1043224595644522496    sinal importante detectar area buscar submarin...
Name: radicais, dtype: object

In [23]:
w2v_model, X_trains['word2vec_sum'], X_tests['word2vec'] = build_word2vec_model(
    X_train[:2], X_test[:2], 
    is_token = False,
    # --- word2vec model parameters
    # vector_size = 2, # este parâmetro é o equivalente ao número de features. 
    # min_count = 1, workers = 2,
    vector_size=2, alpha=0.025, window=5, min_count=1, max_vocab_size=None, sample=1e-3, seed=1,
                 workers=2, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, epochs=5, null_word=0,
                 trim_rule=None, sorted_vocab=1, batch_words=10000,
    # --- vector_combination
    vector_combination = lambda x: np.sum(x, axis = 0),
)

In [24]:
X_trains['word2vec_sum']

array([[-0.09479237,  0.56492567],
       [-0.02681136,  0.01182151]])

In [25]:
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.exceptions import NotFittedError

from gensim import models

class W2VTransformer(TransformerMixin, BaseEstimator):
    """Base Word2Vec module, wraps :class:`~gensim.models.word2vec.Word2Vec`.
    For more information please have a look to `Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean: "Efficient
    Estimation of Word Representations in Vector Space" <https://arxiv.org/abs/1301.3781>`_.
    """
    def __init__(self, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=1e-3, seed=1,
                 workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, epochs=5, null_word=0,
                 trim_rule=None, sorted_vocab=1, batch_words=10000, vector_combination = lambda x: np.sum(x, axis = 0)):
        """
        Parameters
        ----------
        size : int
            Dimensionality of the feature vectors.
        alpha : float
            The initial learning rate.
        window : int
            The maximum distance between the current and predicted word within a sentence.
        min_count : int
            Ignores all words with total frequency lower than this.
        max_vocab_size : int
            Limits the RAM during vocabulary building; if there are more unique
            words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM.
            Set to `None` for no limit.
        sample : float
            The threshold for configuring which higher-frequency words are randomly downsampled,
            useful range is (0, 1e-5).
        seed : int
            Seed for the random number generator. Initial vectors for each word are seeded with a hash of
            the concatenation of word + `str(seed)`. Note that for a fully deterministically-reproducible run,
            you must also limit the model to a single worker thread (`workers=1`), to eliminate ordering jitter
            from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires
            use of the `PYTHONHASHSEED` environment variable to control hash randomization).
        workers : int
            Use these many worker threads to train the model (=faster training with multicore machines).
        min_alpha : float
            Learning rate will linearly drop to `min_alpha` as training progresses.
        sg : int {1, 0}
            Defines the training algorithm. If 1, CBOW is used, otherwise, skip-gram is employed.
        hs : int {1,0}
            If 1, hierarchical softmax will be used for model training.
            If set to 0, and `negative` is non-zero, negative sampling will be used.
        negative : int
            If > 0, negative sampling will be used, the int for negative specifies how many "noise words"
            should be drawn (usually between 5-20).
            If set to 0, no negative sampling is used.
        cbow_mean : int {1,0}
            If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
        hashfxn : callable (object -> int), optional
            A hashing function. Used to create an initial random reproducible vector by hashing the random seed.
        iter : int
            Number of iterations (epochs) over the corpus.
        null_word : int {1, 0}
            If 1, a null pseudo-word will be created for padding when using concatenative L1 (run-of-words)
        trim_rule : function
            Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary,
            be trimmed away, or handled using the default (discard if word count < min_count).
            Can be None (min_count will be used, look to :func:`~gensim.utils.keep_vocab_item`),
            or a callable that accepts parameters (word, count, min_count) and returns either
            :attr:`gensim.utils.RULE_DISCARD`, :attr:`gensim.utils.RULE_KEEP` or :attr:`gensim.utils.RULE_DEFAULT`.
            Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part
            of the model.
        sorted_vocab : int {1,0}
            If 1, sort the vocabulary by descending frequency before assigning word indexes.
        batch_words : int
            Target size (in words) for batches of examples passed to worker threads (and
            thus cython routines).(Larger batches will be passed if individual
            texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
        """
        self.gensim_model = None
        self.vector_size = vector_size
        self.alpha = alpha
        self.window = window
        self.min_count = min_count
        self.max_vocab_size = max_vocab_size
        self.sample = sample
        self.seed = seed
        self.workers = workers
        self.min_alpha = min_alpha
        self.sg = sg
        self.hs = hs
        self.negative = negative
        self.cbow_mean = int(cbow_mean)
        self.hashfxn = hashfxn
        self.epochs = epochs
        self.null_word = null_word
        self.trim_rule = trim_rule
        self.sorted_vocab = sorted_vocab
        self.batch_words = batch_words
        self.vector_combination = vector_combination

    def fit(self, X, y=None):
        """Fit the model according to the given training data.
        Parameters
        ----------
        X : iterable of iterables of str
            The input corpus. X can be simply a list of lists of tokens, but for larger corpora,
            consider an iterable that streams the sentences directly from disk/network.
            See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`
            or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.
        Returns
        -------
        :class:`~gensim.sklearn_api.w2vmodel.W2VTransformer`
            The trained model.
        """

        X_tokens = X.str.split(' ').to_list()

        self.gensim_model = models.Word2Vec(
            sentences=X_tokens, vector_size=self.vector_size, alpha=self.alpha,
            window=self.window, min_count=self.min_count, max_vocab_size=self.max_vocab_size,
            sample=self.sample, seed=self.seed, workers=self.workers, min_alpha=self.min_alpha,
            sg=self.sg, hs=self.hs, negative=self.negative, cbow_mean=self.cbow_mean,
            hashfxn=self.hashfxn, epochs=self.epochs, null_word=self.null_word, trim_rule=self.trim_rule,
            sorted_vocab=self.sorted_vocab, batch_words=self.batch_words
        )
        return self

    def transform(self, words):
        """Get the word vectors the input words.
        Parameters
        ----------
        words : {iterable of str, str}
            Word or a collection of words to be transformed.
        Returns
        -------
        np.ndarray of shape [`len(words)`, `size`]
            A 2D array where each row is the vector of one word.
        """
        if self.gensim_model is None:
            raise NotFittedError(
                "This model has not been fitted yet. Call 'fit' with appropriate arguments before using this method."
            )

        # # The input as array of array
        # if isinstance(words, six.string_types):
        #     words = [words]
        # vectors = [self.gensim_model.wv[word] for word in words]
        # return np.reshape(np.array(vectors), (len(words), self.vector_size))

        phrases = words.str.split(' ').to_list()

        wvs = build_word2vec_vectors(
            model = self.gensim_model,
            phrases = phrases,
            vector_combination = self.vector_combination
        )

        return wvs

    def partial_fit(self, X):
        raise NotImplementedError(
            "'partial_fit' has not been implemented for W2VTransformer. "
            "However, the model can be updated with a fixed vocabulary using Gensim API call."
        )

In [26]:
from mltoolkit.NLP import W2VTransformer as WtoVTransformer

In [27]:
mltoolkit_NLP.W2VTransformer()

NameError: name 'mltoolkit_NLP' is not defined

In [None]:
X_train[:2].str.split(' ').to_list()

[['perder', 'correntinha', 'ourar', 'o', 'trilho'],
 ['lancamento',
  'single',
  'conflito',
  'noticiar',
  'portal',
  'voce',
  'nao',
  'ouvir',
  'acesse',
  'aumentar',
  'som']]

In [28]:
w2vt = W2VTransformer(    
    # --- word2vec model parameters
    vector_size = 2, # este parâmetro é o equivalente ao número de features. 
    min_count = 1, workers = 2,
)

xtr = w2vt.fit_transform(X_train[:2])
xtst = w2vt.transform(X_test[:2])


In [29]:
xtr

array([[-0.09479237,  0.56492567],
       [-0.02681136,  0.01182151]])

In [30]:
w2vtm = WtoVTransformer(    
    # --- word2vec model parameters
    vector_size = 2, # este parâmetro é o equivalente ao número de features. 
    min_count = 1, workers = 2,
)

xtr = w2vtm.fit_transform(X_train[:2])
xtst = w2vtm.transform(X_test[:2])


In [31]:
xtr

array([[-0.09479237,  0.56492567],
       [-0.02681136,  0.01182151]])

### *Doc2Vec*

O *Doc2Vec* ([Gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)) é um modelo similar ao *Word2Vec*, mas que leva em consideração também o contexto de cada frase na construção dos vetores de similaridade.

In [None]:
from gensim.models import doc2vec

# função para ler o corpus e tagear os documentos (no caso, tweets)
def read_corpus(list_sentences, tokens_only = False):
    if tokens_only:
        return list_sentences
    else:
        # For training data, add tags
        lista = []
        for i, line in enumerate(list_sentences):
            lista.append(doc2vec.TaggedDocument(line, [i]))

        return lista
    
train_corpus = read_corpus(X_train_tokens)
test_corpus = read_corpus(X_test_tokens, tokens_only = True)

d2v_model = doc2vec.Doc2Vec(vector_size = 50, min_count = 2, epochs = 20)

d2v_model.build_vocab(train_corpus)

d2v_model.train(
    train_corpus, 
    total_examples = d2v_model.corpus_count, 
    epochs = d2v_model.epochs)

In [None]:
# exemplo: vetor de uma frase contendo duas palavras: 'bolsonaro' e 'haddad'

d2v_model.infer_vector(['bolsonaro', 'haddad'])

array([ 1.7442044e-02,  2.5535559e-02,  2.3881635e-02, -1.2053747e-02,
       -2.3669552e-02, -3.6800306e-02,  2.2385698e-02,  4.4900913e-02,
       -5.1939674e-02, -5.1815566e-03, -1.0489226e-02, -3.7510429e-02,
       -6.9495644e-03, -3.1758684e-03, -1.2563300e-02, -1.6216686e-02,
        4.1447401e-02, -3.3214556e-03, -5.2924678e-02, -1.9637613e-02,
       -4.1356515e-03,  1.9877413e-02,  5.5215832e-02, -7.6075710e-05,
        2.5362268e-02,  1.2883319e-02, -2.8340450e-02,  9.9135889e-04,
       -3.4626570e-02,  7.1958336e-03,  2.1095302e-02, -2.6432965e-02,
       -2.4708290e-02,  2.3136113e-02, -1.8435480e-02,  3.7445284e-02,
        2.1635253e-02,  1.3988094e-02,  1.2153922e-02, -1.2700081e-02,
        3.7212495e-02, -1.9925709e-03, -1.2544175e-02, -1.2507794e-03,
        6.6553988e-02,  1.7250776e-02, -1.5909132e-02, -5.1287767e-02,
        1.4884579e-02,  1.6710229e-02], dtype=float32)

In [None]:
# função para, dado um modelo doc2vec e um conjunto de tokens, construir o vetor associado
def build_doc2vec_vector(d2v_model, phrases):
    X = []

    for phrase in phrases:
        vecs = []
        vecs.append(d2v_model.infer_vector(phrase))
        
        X.append(vecs)
        
    X_d2v = np.array(X)[:, 0, :]

    return X_d2v

# função para, dados conjuntos de frases de treino e teste, construir os vetores
# associados

def build_doc2vec_model(
    X_train, X_test, 
    is_token = False,
    **kwargs
):

    if is_token:
        X_train_tokens = X_train
        X_test_tokens = X_test
    else:
        X_train_tokens = X_train.str.split(' ').to_list()
        X_test_tokens = X_test.str.split(' ').to_list()
    
    # make corpus
    train_corpus = read_corpus(X_train_tokens)
    test_corpus = read_corpus(X_test_tokens, tokens_only = True)

    # instantiate doc2vec model
    d2v_model = doc2vec.Doc2Vec(**kwargs)

    # build vocabulary
    d2v_model.build_vocab(train_corpus)

    # train model
    d2v_model.train(
        train_corpus, 
        total_examples = d2v_model.corpus_count, 
        epochs = d2v_model.epochs
    )

    # build vectors
    X_train_d2v = build_doc2vec_vector(
        d2v_model = d2v_model, 
        phrases = X_train_tokens
    )
    X_test_d2v = build_doc2vec_vector(
        d2v_model = d2v_model, 
        phrases = X_test_tokens
    )

    return d2v_model, X_train_d2v, X_test_d2v

In [None]:
d2v_model, X_train_d2v, X_test_d2v = build_doc2vec_model(
    X_train, X_test, 
    is_token = False,
    vector_size = 50, min_count = 2, epochs = 20,  # doc2vec model arguments
)

In [None]:
X_train_d2v.shape

(3500, 50)

In [None]:
X_test_d2v.shape

(1500, 50)