# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [3]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [4]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['ffffffffff', 'alright', 'no', 'way', 'can', 'meet', 'up', 'with', 'you', 'sooner'], tags=[0])

In [5]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2)

In [6]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-4.67233919e-03,  1.76530443e-02,  3.56972427e-03, -4.47185012e-03,
       -2.67395610e-03, -2.86629833e-02, -2.15632981e-03,  3.24200056e-02,
       -1.15820765e-02, -1.68551374e-02, -8.69195629e-03, -2.30628233e-02,
        6.38049387e-05,  6.91069756e-03,  3.55699705e-03, -2.48369686e-02,
        1.09322099e-02, -1.58905126e-02, -8.00711010e-03, -2.38338672e-02,
        5.36618475e-03,  9.83556546e-03,  4.76273894e-03, -8.40839278e-03,
        9.84873739e-04,  5.48338518e-03, -1.32455993e-02, -2.36246060e-03,
       -8.47336091e-03,  6.49851793e-03,  1.84935685e-02,  1.44915516e-03,
        6.72278088e-03, -1.29135912e-02, -4.74671461e-03,  1.74263120e-02,
        4.10890300e-03, -1.89434271e-02, -1.37343826e-02, -2.43625045e-02,
       -7.16899103e-03, -1.55243566e-02, -3.10475659e-03, -8.59075133e-03,
        1.75379906e-02, -4.39357292e-03, -1.50317540e-02, -1.36221980e-03,
        1.38186226e-02,  1.39956148e-02,  1.02134915e-02, -1.66556947e-02,
        5.08253090e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!