## Document Vectors
Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs and documents), by considering the context of words in the text into account.<br><br>
In this notebook we will create a Document Vector for using averaging via spacy. [spaCy](https://spacy.io/) is a python library for Natural Language Processing (NLP) which has a lot of built-in capabilities and features. spaCy has different types of models. The default model for the English language is '**en_core_web_sm**'.

In [None]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install spacy==2.2.4

# ===========================

In [None]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [1]:
# downloading en_core_web_sm, assuming spacy is already installed
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
     |████████████████████████████████| 13.9 MB 4.7 MB/s            
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
#Import spacy and load the model
import spacy
nlp = spacy.load("en_core_web_sm") #here nlp object refers to the 'en_core_web_sm' language model instance.

In [3]:
#Assume each sentence in documents corresponds to a separate document.
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

print("Document After Pre-Processing:",processed_docs)


#Iterate over each document and initiate an nlp instance.
for doc in processed_docs:
    doc_nlp = nlp(doc) #creating a spacy "Doc" object which is a container for accessing linguistic annotations. 
    
    print("-"*30)
    print("Average Vector of '{}'\n".format(doc),doc_nlp.vector)#this gives the average vector of each document
    for token in doc_nlp:
        print()
        print(token.text,token.vector)#this gives the text of each word in the doc and their respective vectors.
        

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

Document After Pre-Processing: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
------------------------------
Average Vector of 'dog bites man'
 [ 0.35036066  0.10273071  0.33009622 -0.2030462   0.397346   -0.05984474
 -0.2201394   0.25512496 -0.3575406   0.39600918 -0.75429106 -0.5888314
  0.29484284 -0.63774514 -0.23350935  0.5900816  -0.2566284  -0.71845955
  0.20040572  0.7679166  -0.26526675 -0.6816276  -0.0701522   0.04820635
  0.1266749   0.2589217  -0.6932214  -0.3419633   1.0904325  -0.32465276
  1.4362421  -0.5931116   0.32251295 -0.341225   -0.12486354 -0.7798831
 -0.29717746  0.4014299  -0.1318171   0.910722   -0.41182932  0.04191664
  0.59365046 -0.04422406 -0.18440922 -0.05003772  0.59136873 -0.6386824
  1.8019737  -0.04936111  0.27116123  0.21994926 -0.2368415  -0.23461938
  0.22323321  1.0983983  -0.39096567  0.10752393 -0.06386908  0.14312072
  0.37180772 -0.34773377 -0.42992604 -0.4652144  -0.58004665  0.37198398
  0.04235339 -0.4719428   0.281242