# Doc2Vec Notebook
Author: Andrew Auyeung  

References: [Detecting Document Similarity with doc2vec](https://towardsdatascience.com/detecting-document-similarity-with-doc2vec-f8289a9a7db7)

Versions:  
doc2vec_v1 - FSM.  
doc2vec_v2 - retrained with doctags as an entire element. 
doc2vec_v3 - remove "TDS Editors" from authors.  Their articles were mainly links to other articles and did not have any text to their body.   

In [60]:
# Load Libraries
import pandas as pd
from gensim.models.doc2vec import Doc2Vec

In [2]:
# Load Model
model = Doc2Vec.load('../models/d2v.model')

In [71]:
# Load articles
articles = pd.read_csv('../src/TDS_articles.csv', sep='\t', index_col=0)

Lets check that the word2vec model works.  

In [17]:
model.most_similar("integrity".split())

[('quality', 0.3656102120876312),
 ('consistency', 0.33530813455581665),
 ('reliability', 0.325817734003067),
 ('availability', 0.31300684809684753),
 ('compliance', 0.3017086684703827),
 ('frameits', 0.2888759672641754),
 ('lineage', 0.2830910384654999),
 ('security', 0.2805323600769043),
 ('privacy', 0.2762299180030823),
 ('followingconsumer', 0.27621662616729736)]

In [18]:
model.most_similar("pyspark")

[('spark', 0.43477481603622437),
 ('scala', 0.3810179531574249),
 ('keras', 0.3605492115020752),
 ('pandas', 0.3562259376049042),
 ('panda', 0.3393400013446808),
 ('geopandas', 0.3307809829711914),
 ('bigquery', 0.32360467314720154),
 ('pytorch', 0.32095879316329956),
 ('python', 0.3196185231208801),
 ('dask', 0.31692153215408325)]

In [19]:
model.most_similar("king")

[('younglooking', 0.26352280378341675),
 ('giroux', 0.25816887617111206),
 ('muufl', 0.2439907193183899),
 ('gawande', 0.24306219816207886),
 ('valuenothing', 0.23543716967105865),
 ('adornment', 0.23438489437103271),
 ('criteriaone', 0.23263442516326904),
 ('underlaying', 0.23235675692558289),
 ('overcomplete', 0.23194748163223267),
 ('emtech', 0.2314153015613556)]

From a first glance, this looks okay.  
* Integrity gives other nouns that describe the constitution of something. 
* pyspark is similar to other python libraries with closest similarities to its siblings (Spark and Scala)
* king gives different similarities to other terms.  Seems like those esimilar terms are uplifting.  It makes sense that Queen may not show up because this corpus may not even contain that term. 

We can test using the famous King - Man = Queen example. 

We don't get exact matches for the traditional King and Queen but it is to be expected with a small corpus.  
*However*, if we try combinations with data science context, we get better results!

In [27]:
model.similar_by_vector(model['king'] - model['man'])

[('king', 0.6999312043190002),
 ('asymptotic', 0.23131296038627625),
 ('processmachine', 0.2307613044977188),
 ('younglooking', 0.2303832322359085),
 ('saliencyguided', 0.22898586094379425),
 ('underlaying', 0.22815591096878052),
 ('aj', 0.2259904444217682),
 ('muufl', 0.22169552743434906),
 ('shao', 0.2213234305381775),
 ('softened', 0.2211374193429947)]

In [39]:
model.similar_by_vector(model['python'] + model['scala'])

[('scala', 0.8702682256698608),
 ('python', 0.788196325302124),
 ('java', 0.4596204161643982),
 ('pytorch', 0.43595853447914124),
 ('julia', 0.4249792993068695),
 ('pyspark', 0.4242379665374756),
 ('plotly', 0.42139214277267456),
 ('javascript', 0.4102274775505066),
 ('sql', 0.37776073813438416),
 ('ggplot', 0.3705998659133911)]

In [48]:
model.similar_by_vector(model['python'] + model['project'])

[('project', 0.7329937815666199),
 ('python', 0.7092459201812744),
 ('projects', 0.5556811690330505),
 ('pythonthe', 0.3842272460460663),
 ('tutorial', 0.35284870862960815),
 ('task', 0.35004037618637085),
 ('projectif', 0.3434600234031677),
 ('exercise', 0.34310513734817505),
 ('pytorch', 0.3402515649795532),
 ('startup', 0.33660030364990234)]

### To Do:
* Go back and chck for typos in raw text.  Some words are concatenated that may not be intentional. 

## Doc2Vec Document Similarity

*document vectors are stored in the model by index.  To look up by the document you will have to find the index associated with the article_id tag*

In [67]:
# Let's look at the first document in the model
id_1 = int(model.docvecs.index_to_doctag(0))
id_1

3406

In [73]:
articles.loc[id_1]['title']

'Iteratively Finding a Good Machine Learning Model'

In [102]:
def show_similar_titles(model, articles, idx, n_to_show=5):
    """
    Helper Function to look up similar articles from the corpus
    """
    check_id = int(model.docvecs.index_to_doctag(idx))
    print(f"The title of the article selected is: {articles.loc[check_id]['title']}")
    print(f"It's article id is: {check_id}\n")

    print(f"The {n_to_show} most similar articles are:")
    similar_articles = model.docvecs.most_similar(idx)
    print("Article ID:\tTitle:")
    for j in range(n_to_show):
        current_id = int(similar_articles[j][0])
        print(f"{current_id}\t\t{articles.loc[current_id]['title']}")

In [103]:
show_similar_titles(model, articles, 934)

The title of the article selected is: The essence of eigenvalues and eigenvectors in Machine Learning
It's article id is: 17887

The 5 most similar articles are:
Article ID:	Title:
57051		My take on 25 Questions to test a Data Scientist on Image Processing with Interactive Code- Part 1
52744		Local Outlier Factor for Anomaly Detection
62876		Understanding how to explain predictions with “explanation vectors”
41447		nan
40384		nan


In [104]:
show_similar_titles(model, articles, 2000)

The title of the article selected is: Coreference Resolution in Python
It's article id is: 18957

The 5 most similar articles are:
Article ID:	Title:
54207		Large-scale Graph Mining with Spark: Part 2
58192		Text Similarity with TensorFlow.js Universal Sentence Encoder
46848		Natural Language Processing: A beginner’s guide part-II
16318		A Beginner’s Guide to Rasa NLU for Intent Classification and Named-entity Recognition
55230		A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text


In [106]:
show_similar_titles(model, articles, 3874)

The title of the article selected is: How to be Good at Algorithms?
It's article id is: 25970

The 5 most similar articles are:
Article ID:	Title:
57650		8 Common Data Structures every Programmer must know
44321		Introduction to 8 Essential Data Structures
32068		How I Taught Myself Linked Lists
18084		3 Programming Concepts for Data Scientists
46402		Data Structures — Simplified and Classified


## Using Doc2Vec to document similarity to vector similarity
* Convert String to W2V
* Compare W2V to D2V

In [108]:
query_vector = model.infer_vector("Beginnner Projects for Data Science".split())

In [121]:
model.docvecs.most_similar([query_vector])

[('51487', 0.5366455316543579),
 ('51375', 0.533961296081543),
 ('40853', 0.5326160192489624),
 ('51431', 0.5314090251922607),
 ('51538', 0.5269762277603149),
 ('40972', 0.5228385925292969),
 ('40384', 0.5165457725524902),
 ('40662', 0.5165322422981262),
 ('41511', 0.515238881111145),
 ('41198', 0.5151280760765076)]

In [152]:
def similar_doc_by_query(query, n_to_show=5):
    assert type(query)==str
    query_vector = model.infer_vector(query.split())
    similar_articles = model.docvecs.most_similar([query_vector])
    for j in range(n_to_show):
        current_id = int(similar_articles[j][0])
        intro = articles.loc[current_id]['body'].replace('{','').split('","')[0]
        print(f"Article ID:\t{current_id}")
        print(f"Title: \t\t{articles.loc[current_id]['title']}")
        print(f"Intro: \t\t{intro}")

In [162]:
similar_doc_by_query("Random Forest For Classification")

Article ID:	21761
Title: 		Exploring the Python Pandas Library
Intro: 		"Pandas is a python library used for analyzing, transforming, and generating statistics from data. In this post, we will discuss several useful methods in Pandas for data wrangling and exploration. For our purposes, we will be using the Medical Cost Personal Datasets data from Kaggle.
Article ID:	21478
Title: 		Pivoting to Efficient Data Summaries
Intro: 		"One of the most powerful tools across all professions and industry is the Pivot Table. In most traditional analytics, Microsoft Excel serves as a must-have skill and a pivot table is the core of data exploration. They are dynamic, relatively straight forward, and provide vital summaries at both surface and in-depth levels of the data.
Article ID:	24847
Title: 		Using clustering to improve classification — a use case
Intro: 		"In today’s blog, we are going to give the intuition of one of our early articles published in a Hindawi Journal named “International Schol

In [173]:
similar_doc_by_query("Random Forest Tutorial")

Article ID:	47065
Title: 		Exploring Greater Sydney suburbs
Intro: 		"This was part of my IBM Data Science Professional Certificate’s capstone project. Read on and follow the link to source code and feel free to use.
Article ID:	46525
Title: 		An Introduction to Nine Essential Machine Learning Algorithms
Intro: 		"If this is the kind of stuff that you like, be one of the FIRST to subscribe to my new YouTube channel here! While there aren’t any videos yet, I’ll be sharing lots of amazing content like this but in video form. Thanks for your support :)
Article ID:	34409
Title: 		The Math Behind Deepfakes
Intro: 		"Although many are familiar with the incredible results produced by deepfakes, most people find it hard to understand how the deepfakes actually work. Hopefully, this article will demystify some of the math that goes into creating a deepfake.
Article ID:	51550
Title: 		Markov Chain Monte Carlo
Intro: 		"When I learned Markov Chain Monte Carlo (MCMC) my instructor told us there we

In [157]:
articles.loc[51375]['body'].replace('{', '').split('","')[0]

'"Latest picks:"}'

In [163]:
len(articles[articles.author == 'TDS Editors']) # Need to remove TDS Editors from Corpus! ~300 documents 

300

In [165]:
model.total_train_time/3600

3.701691695346365

## Topic Model with Doc2Vec
We will try clustering with PCA(2) and then DBSCAN to check for clusters.  
Steps:  
1. Need a conversion from the model's document index to its associated document tags  
2. PCA to visualize.   May need to rewrite the PCA Visualization plot to add in context of terms.  Maybe use titles of articles to do visual analysis?  
3. Use DBSCAN to search for clusters based on density.  Not sure about distance metric yet.   

In [50]:
model.docvecs.vectors_docs

array([[-0.2086484 ,  0.74766654, -0.13281198, ..., -4.418513  ,
         3.863445  , -1.003898  ],
       [ 0.6898131 ,  0.07900671,  0.6799487 , ..., -3.2480214 ,
         3.2630968 ,  0.27556705],
       [-2.7516708 ,  0.10931432,  0.22373906, ...,  0.50017786,
        -0.6704044 , -2.082502  ],
       ...,
       [ 3.4422028 ,  1.0311323 , -4.259427  , ..., -2.8754826 ,
        -0.26456988, -0.451631  ],
       [-2.4910774 ,  1.1628321 , -2.2151136 , ..., -0.9286146 ,
         0.50013655, -2.3644364 ],
       [ 0.21698952, -0.774499  , -2.809891  , ...,  0.57830733,
         0.59509945, -0.08918346]], dtype=float32)

In [49]:
from sklearn.decomposition import PCA

In [None]:
pca_features = PCA(2)