# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [1]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [2]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [3]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['where', 'are', 'you', 'what', 'do', 'you', 'do', 'how', 'can', 'you', 'stand', 'to', 'be', 'away', 'from', 'me', 'doesn', 'your', 'heart', 'ache', 'without', 'me', 'don', 'you', 'wonder', 'of', 'me', 'don', 'you', 'crave', 'me'], tags=[0])

In [4]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs,
                                  vector_size=100,
                                  window=5,
                                  min_count=2)

In [5]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [6]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-8.87602568e-03,  1.09385774e-02,  5.11323055e-03, -6.38195779e-03,
       -7.25458469e-03, -1.13833696e-03, -6.78820815e-03, -1.70726702e-03,
       -3.32039990e-03,  1.98232054e-04, -3.10476450e-03, -4.78714332e-03,
        5.33253001e-03,  1.45982401e-02,  9.74501017e-03,  5.23984293e-03,
        6.12392928e-03, -5.25344955e-03, -2.68952758e-03, -8.21829215e-03,
       -8.90680961e-03, -7.21965916e-04, -3.64406733e-03, -8.28683563e-03,
        7.45848985e-03, -4.47585061e-03, -1.33870905e-02, -3.34707974e-03,
        6.89136097e-03,  1.11705114e-04,  2.12026876e-04, -4.58791066e-04,
        1.59915863e-03, -1.15711205e-02, -2.57483544e-03,  1.36145744e-02,
        1.09776948e-02,  5.33281709e-04,  3.53552634e-03,  1.56332254e-02,
       -3.47791682e-03,  1.65766443e-03,  1.36225391e-02,  6.11415040e-03,
       -6.41800230e-03,  7.22425990e-03,  2.14356207e-03, -2.71046138e-03,
       -1.39552215e-03,  4.19839658e-03, -1.09695885e-02, -5.55958366e-03,
        9.61300824e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!

# doc2vec: How To Prep Document Vectors For Modeling

In [7]:
# What does a document vector look like again?
d2v_model.infer_vector(['convert', 'words', 'to', 'vectors'])

array([-0.00862826,  0.0071864 ,  0.00389735, -0.00425909, -0.00516394,
       -0.0037196 , -0.00259269, -0.00301915, -0.0016474 , -0.00027165,
       -0.00623732,  0.0017613 ,  0.00341308,  0.00484667,  0.00611341,
        0.00290644,  0.00543516,  0.00146407,  0.00103653, -0.0077876 ,
       -0.0061283 ,  0.00466484,  0.001793  , -0.00073053,  0.01112898,
       -0.00975147, -0.00778145, -0.00735493,  0.00770985, -0.00276762,
        0.00418301,  0.00120887,  0.0017338 , -0.00641951, -0.00638191,
        0.00973248,  0.00934952, -0.00273729, -0.0006277 ,  0.00635287,
       -0.0082921 ,  0.00341398,  0.00330042,  0.00868116, -0.00667403,
        0.00203242,  0.00369902, -0.00717073, -0.00025978,  0.01032502,
       -0.00880408, -0.00486287,  0.00879907, -0.00749715, -0.00053251,
       -0.00608589,  0.00057463,  0.00435847,  0.01151353,  0.00404367,
        0.00387285,  0.00121799,  0.0124625 , -0.00067046, -0.0008349 ,
        0.01147593,  0.00045557, -0.00614358,  0.00686876, -0.00

In [8]:
# How do we prepare these vectors to be used in a machine learning model?
vectors = [[d2v_model.infer_vector(words)] for words in X_test]

In [9]:
vectors[0]

[array([-5.77358156e-03,  1.06707392e-02,  3.93922348e-03, -7.57708331e-04,
        -2.58477381e-03, -1.92638807e-04, -4.11166903e-03,  1.42892252e-03,
        -3.60492477e-03,  4.61327657e-03, -9.04983748e-03,  6.16300851e-04,
         1.71596487e-03,  7.40852766e-03,  1.69022530e-02, -4.84264638e-06,
         7.71544641e-03, -5.36660850e-03,  2.04246282e-03, -6.78323070e-03,
        -6.98415888e-03, -6.06735048e-05,  2.13853503e-03, -3.96416103e-03,
         5.87929925e-03, -6.58658426e-03, -5.63623197e-03, -6.96596922e-03,
         1.16070481e-02,  3.42267728e-03,  2.08046148e-03,  5.73769910e-04,
         7.51322089e-03, -5.91283245e-03, -6.49547437e-03,  4.98791132e-03,
         1.54486941e-02, -6.46633422e-03, -5.08132624e-04,  1.61939599e-02,
        -9.78285540e-03,  7.37955840e-03,  9.21900850e-03,  7.10050995e-03,
        -3.65699898e-03,  6.58337492e-03,  9.00582317e-03, -2.08294671e-03,
        -3.35605931e-03,  1.05053801e-02, -1.37188034e-02, -1.96189946e-03,
         4.7