# doc2vec: How To Implement doc2vec

### Train Our Own Model

In [2]:
# Read in data, clean it, and then split into train and test sets
import gensim
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('C:\\Users\\Yauheni_Leaniuk\\Documents\\Python\\Data_Engineer\\Advanced NLP Python for ML\\data\\spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [3]:
# Create tagged document objects to prepare to train the model
tagged_docs = [gensim.models.doc2vec.TaggedDocument(v, [i]) for i, v in enumerate(X_train)]

In [4]:
# Look at what a tagged document looks like
tagged_docs[0]

TaggedDocument(words=['married', 'local', 'women', 'looking', 'for', 'discreet', 'action', 'now', 'real', 'matches', 'instantly', 'to', 'your', 'phone', 'text', 'match', 'to', 'msg', 'cost', 'stop', 'txt', 'stop', 'bcmsfwc', 'xx'], tags=[0])

In [5]:
# Train a basic doc2vec model
d2v_model = gensim.models.Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=2)

In [6]:
# What happens if we pass in a single word like we did for word2vec?
d2v_model.infer_vector('text')

TypeError: Parameter doc_words of infer_vector() must be a list of strings (not a single string).

In [7]:
# What happens if we pass in a list of words?
d2v_model.infer_vector(['i', 'am', 'learning', 'nlp'])

array([-9.89732798e-03,  9.23072547e-03,  1.44394226e-02, -3.14423814e-03,
       -1.71968294e-03, -2.63507850e-02,  4.60825255e-03,  3.94569859e-02,
       -2.05114465e-02, -1.89312994e-02, -6.45141676e-03, -2.42996104e-02,
        1.06551568e-03,  5.46619808e-03, -8.02520895e-04, -1.60838198e-02,
        4.24711825e-03, -2.43957769e-02, -1.08059738e-02, -3.85802686e-02,
        1.32546267e-02,  6.50887750e-03, -1.05333503e-03, -9.51079093e-03,
        2.50722212e-03, -7.33733177e-05, -1.90268308e-02, -1.15709202e-02,
       -2.31137332e-02, -1.23270310e-03,  1.85645446e-02,  7.33188074e-03,
       -2.09702714e-03, -4.70012380e-03, -5.82600187e-04,  1.58381518e-02,
        7.73539720e-03, -2.20430717e-02, -1.98278367e-03, -2.34691240e-02,
       -2.86667986e-04, -9.58838314e-03, -8.45596287e-03, -4.33282834e-03,
        9.78386495e-03, -1.10928128e-02, -1.26501946e-02, -2.38483702e-03,
        6.24766480e-03,  8.67208652e-03,  9.87984985e-03, -1.42486598e-02,
        2.14386987e-03, -

### What About Pre-trained Document Vectors?

There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for `word2vec` so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec). Feel free to explore on your own!