# Doc2Vec

Playing around with Doc2Vec, from the paper [Distributed Representations of Sentences and Documents](https://arxiv.org/pdf/1405.4053.pdf).

## Intro to Word2Vec
**Word2Vec** creates embeddings of words using a NN with one hidden layer, which creates the embeddings for the words in the dataset.
* **CBOW, or Continuous Bag of Words**
    * Use context words to predict a target word  
    * Better on smaller datasets  
* **Skip-gram models**
    * Use a word to predict context words  
    * Feed model the one-hot embedding; corresponding embedding = one-hot vector * hidden layer weights
    * Better on larger datasets, rare examples
* **Objective:** maximize the [avg. log likelihood](https://datascience.stackexchange.com/questions/28259/connection-between-cross-entropy-and-likelihood-for-multi-class-soft-label-class)
* [Hierarchical softmax](https://paperswithcode.com/method/hierarchical-softmax) used for prediction layer

## Paragraph Vector Framework
* Addition of a paragraph ID vector to Word2Vec Model  
* Paragraph token can be thought of as another word  
* Paragraph vector is then used as the feature for the paragraph  
* Doc2Vec can be used in this way to learn both paragraph and word embeddings

## Doc2Vec Architecture
* Each word vector is a 1-hot encoding of dimension 1xV, where V is the number of words in the corpus  
* Each paragraph vector is a 1-hot encoding of dimension 1xC, where C is the number of paragraphs  
* Weight matrices W(VxN) and D(CxN) contain the embeddings
![doc2vec architecture](https://shuzhanfan.github.io/assets/images/doc2vec.jpg)

In [1]:
!pip install nltk



You should consider upgrading via the 'c:\users\claudia.nguyen\anaconda3\python.exe -m pip install --upgrade pip' command.


## Imports

In [2]:
import pandas as pd
import regex as re
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
from nltk.stem import WordNetLemmatizer

### Load data

In [3]:
tile_df=pd.read_csv('tile_descs.csv',encoding = "ISO-8859-1")

In [None]:
tile_df

In [35]:
tile_id_dict = dict(zip(tile_df['Tile Name'], tile_df['Tile ID']))

### Preprocessing

In [5]:
tile_df['Text']=tile_df['Text'] + ' ' +tile_df['Theme'] + ' ' +tile_df['Story'] 

In [6]:
tile_df['Text']=tile_df['Text'].replace('\x1a','').replace('\x93','').replace('\x94','')

In [None]:
tile_df[tile_df['Tile Name']=='myb_mob_order_pendingplanchange'].Text[25]

In [10]:
# remove stopwords
# tile_df['Text']=tile_df['Text'].apply(lambda x: [w for w in x if w is not None and w != ''])
tile_df['Text']=tile_df['Text'].apply(lambda x: preprocess_string(x))
# lemmatize
# lemmatizer = WordNetLemmatizer()
# tile_df['Text']=tile_df['Text'].apply(lambda x: [lemmatizer.lemmatize(w) for w in x])
# remove blanks
tile_df['Text']=tile_df['Text'].apply(lambda x: [w for w in x if w is not None and w != '' and w !='\x1a'])

In [11]:
tile_dict = dict(zip(tile_df['Tile Name'], tile_df.Text))

In [12]:
tile_df.columns

Index(['Tile ID', 'Tile Name', 'Text', 'Theme', 'Story'], dtype='object')

## Doc2Vec

In [14]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [15]:
documents=[TaggedDocument(val,[key]) for key, val in tile_dict.items()]
documents_dict = dict(zip(tile_dict.keys(),documents))

In [17]:
len(tile_dict.keys())

38

In [18]:
model = Doc2Vec(documents, vector_size = 10, epochs=100, seed=1)

In [None]:
# evaluate similar tiles 
for doc_id in documents_dict.keys():
    inferred_vector = model.infer_vector(documents_dict[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    print('tile: ' + doc_id)
    print('   words: ' + str(documents_dict[doc_id].words))
    print('most similar:' + str(sims[0]) + '\n   words:' + str(documents_dict[sims[0][0]].words))
    print('2nd most similar:' + str(sims[1])+ '\n   words:' + str(documents_dict[sims[1][0]].words))
    print('least similar:'+ str(sims[len(model.dv)-1]) + '\n   words:' + str(documents_dict[sims[len(model.dv)-1][0]].words))
    print('-------------------------')

In [41]:
# tile_ids= []
# vectors= []

# for name, tile_id in tile_id_dict.items():
#     tile_ids.append(tile_id)
#     vectors.append(model.dv[name])
    
# tile_id_to_vector_dict = dict(zip(tile_ids, vectors))

In [43]:
# import pickle
# pickle.dump(tile_id_to_vector_dict, open('tile_embeddings.pkl', 'wb'))