# Collecting Quotes
In this notebook, we will be creating a dataframe with two columns:
1. Text - Full or abbr. quote of Krishnamurti
2. URL - Link to full or find the quote

Here are the website we will be scraping to create our dataframe:
1. [Good Reads](https://www.goodreads.com/author/quotes/850512.J_Krishnamurti?page=1)
2. [Brainy Quotes](https://www.brainyquote.com/authors/jiddu-krishnamurti-quotes)
3.  [wikiquotes](https://en.wikiquote.org/wiki/Jiddu_Krishnamurti)
4.  [jkrishnamurti.org](https://jkrishnamurti.org/jksearch?keyword=&page=1&type=16618)


**Installing Dependencies**

In [None]:
import requests                     # To make 'get' requests through chrome browser
from bs4 import BeautifulSoup       # To parse html file in python tree object
import pandas as pd                 # To create Dataframe and save data into JSON file

import random
import time
from pprint import pprint

from google.colab import data_table
from vega_datasets import data
data_table.enable_dataframe_formatter()

**Getting the Hyperinks**

In this section, we will:
1. Get all of the pages of quotes from goodreads
2. Associate each quote with a hyperlink
3. Create a dataframe from (quote, hyperlink pairs)


In [None]:
%%capture
# There are 33 pages of quotes we want to collect from
urls = []
base_url = "https://www.goodreads.com/author/quotes/850512.J_Krishnamurti?page={}"
search_pages = [base_url.format(str(i)) for i in range(1,34)]

**Single Page**

In [None]:
plain_quotes = []
goodreads_quotes = []

for search_page in search_pages:
  # Grabbing a list of quotes:
  quote_list = []
  page = requests.get(search_page)
  soup = BeautifulSoup(page.content, 'html.parser')

  # Using CSS Selector to grab all quotes
  quotes = [quote.get_text() for quote in soup.find_all("div", class_="quoteText")]
  quotes = [quote.split("\n")[1].strip() for quote in quotes]
  plain_quotes.extend(quotes)
  quotes = [(quote, search_page) for quote in quotes]

  goodreads_quotes.extend(quotes)

In [None]:
df = pd.DataFrame.from_records(
    data=goodreads_quotes,
    columns=["Quote", "URL"]
)

In [99]:
df = df['Quote']
df.to_json('krish_quotes')

In [None]:
%%capture
df.head(100)

## Great!
Now that we have our quotes dataframe, we will associate each quote with a document centroid vector using word embeddings. The library we will be using for this word2vec task in **Gensim** (*gen*erate *sim*ilar), which has open-source software we can use to easily represent documents (or individual words) as vectors.

*Fun fact: Gensim was a python project that arose in the effort to find the most similar mathematical articles to a given on in a Czech digital academic collection.*

In [None]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [None]:
# Remove quotations from plain quotes
plain_quotes = [quote[1:-1] for quote in plain_quotes]
plain_quotes[0]

'It is no measure of health to be well adjusted to a profoundly sick society.'

In [None]:
# Tegging list of quotes for gensim model

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(plain_quotes)]
pprint(tagged_data)

In [None]:
# Training gensim model
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

In [None]:
from gensim.models.doc2vec import Doc2Vec

model= Doc2Vec.load("d2v.model")

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)


# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

[('970', 0.8046198487281799), ('47', 0.7607197761535645), ('965', 0.74647057056427), ('414', 0.7463820576667786), ('647', 0.7418216466903687), ('967', 0.7404366135597229), ('490', 0.735985279083252), ('56', 0.7330352067947388), ('275', 0.732215166091919), ('877', 0.7304850816726685)]
[-1.730693    1.5829514  -1.0686712   4.099093    0.5436246  -0.10553455
  1.5978084  -1.0038096  -1.149201    1.446069   -0.74242496  1.3126647
  4.3996644  -3.6409996  -1.6735785  -1.9980305   1.724739   -2.013659
 -1.1099366  -0.37500638]


# Training a Doc2Vec Model


In [None]:
# Installing dependencies

# Importing generate similarities library for doc2vec
import gensim 

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
plain_quotes[0]

'It is no measure of health to be well adjusted to a profoundly sick society.'

In [None]:
# First we want to Tag and preprocess our quotes, so that they can be turned into vectors

"""For each quote we create a TaggedDocument Object, with:
(1) words= Words tokens
(2) tag= index"""
train_corpus = [gensim.models.doc2vec.TaggedDocument(quote.lower().split(), [i]) for i, quote in enumerate(plain_quotes)]
train_corpus[0]

In [None]:
%%capture
# Now we will create a gensim Doc2vec model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, # dimension of word embeddings
                                      min_count=2, # discard word with one or less occurence
                                      epochs= 40 # number of iterations over documents (diminishing returns)
                                      )




In [None]:
# Building vocabulary
model.build_vocab(train_corpus)

In [None]:
# Training model
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

In [None]:
# Using model to infer a vector, and compare to other vectors using cosine similarity
vector1 = model.infer_vector("The hardest problem in life the struggle".lower().split())
vector2 = model.infer_vector("Love is for the kindess people")
print(type(vector1))

<class 'numpy.ndarray'>


In [None]:
from gensim.models import doc2vec
from scipy import spatial

cos_distance = spatial.distance.cosine(vector1, vector2)
print(cos_distance)

0.39973020553588867
