This notebook shows an example recommendation system using doc2vec. We will use a dataset called CMU Book summaries [dataset](http://www.cs.cmu.edu/~dbamman/booksummaries.html). Alternateively, the dataset's link can be found in the `BookSummaries_Link.md` file under the Data folder in Ch7.


In [None]:
!pip install gensim
!pip install nltk
!wget -O booksummaries.tar.gz -q http://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
!tar --overwrite -xzf booksummaries.tar.gz



In [None]:
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
nltk.download('punkt')
import os

# Read the dataset’s README to understand the data format.
data_path = os.path.join("booksummaries", "booksummaries.txt")
mydata = {} #titles-summaries dictionary object
for line in open(data_path, encoding="utf-8"):
    temp = line.split("\t")
    mydata[temp[2]] = temp[6]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
#prepare the data for doc2vec, build and save a doc2vec model
train_doc2vec = [TaggedDocument((word_tokenize(mydata[t])), tags=[t]) for t in mydata.keys()]
model = Doc2Vec(vector_size=50, alpha=0.025, min_count=10, dm =1, epochs=100)
model.build_vocab(train_doc2vec)
model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)
model.save("d2v.model")

In [None]:
%time
#Use the model to look for similar texts
model= Doc2Vec.load("d2v.model")

#This is a sentence from the summary of “Animal Farm” on Wikipedia:
#https://en.wikipedia.org/wiki/Animal_Farm
sample = """
Napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs who will run the farm.
 """
new_vector = model.infer_vector(word_tokenize(sample))
sims = model.docvecs.most_similar([new_vector])
print(sims)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs
[('Animal Farm', 0.7018142342567444), ('The Wild Irish Girl', 0.6415804624557495), ('Ponni', 0.6028823256492615), ("Snowball's Chance", 0.590345561504364), ('Family Matters', 0.5861057043075562), ('The Big Country', 0.5768381357192993), ('The Prophet', 0.5702881217002869), ('Poor White', 0.5645689964294434), ("Family Guy: Stewie's Guide to World Domination", 0.5617426633834839), ('The Land of Little Rain', 0.556604266166687)]
