# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["labeinl", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['is', 'xy', 'in', 'ur', 'car', 'when', 'picking', 'me', 'up'], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                 vector_size=100,
                                 window=5,
                                 min_count=2)

In [7]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [8]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i','am','learning','nlp'])

array([-8.5632975e-04, -3.9026937e-03, -3.6878157e-03,  5.6106187e-03,
        8.6533381e-03, -9.7892992e-03, -1.5537147e-03,  1.2896167e-03,
       -2.4252459e-03,  8.4318593e-03,  5.0075897e-03, -7.8988411e-03,
       -1.5394895e-02, -5.2403705e-03, -5.0717085e-03,  1.0019312e-02,
       -2.9283275e-03,  5.9946552e-03, -1.1062729e-02,  6.3244934e-04,
        7.5742360e-03,  1.7748809e-04,  3.4917910e-03,  8.4907571e-03,
        9.5894169e-03,  5.5189379e-03, -1.1816436e-02,  4.2434298e-03,
       -1.0419705e-02,  2.6313700e-03, -1.3480376e-02,  4.5070462e-03,
        3.5293093e-03,  6.4457683e-03, -6.9238050e-03,  1.8148954e-03,
        9.3880882e-03,  2.1608647e-02,  7.0175328e-03, -1.6295344e-02,
        1.1062544e-02,  1.8493634e-03, -5.2587669e-03, -5.5133025e-03,
        7.6915151e-03,  6.3382625e-03, -1.4956533e-03,  3.0792190e-03,
       -1.2131914e-03, -1.6585179e-02,  2.3198621e-03,  1.3161394e-02,
        1.6037961e-02, -1.5455454e-02,  2.6978958e-02,  5.7861645e-04,
      

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!