# Stackoverflow Doc2Vec

<a href="https://colab.research.google.com/github/fmcooper/stackoverflow-doc2vec/blob/master/stackoverflow-doc2vec.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Data from: https://www.kaggle.com/stackoverflow/pythonquestions#Tags.csv

---

In [1]:
import sys
import numpy as np
import pandas as pd
from time import time
import codecs # handles errors on data import
import gensim
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import logging # to show progress of training model
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



print("\n---------- versions ----------\n")
print("python version: " + sys.version)
print("pandas version: " + pd.__version__)
print("numpy version: " + np.__version__)
print("matplotlib version: " + mpl.__version__)
print("gensim version: " + gensim.__version__)
print("sklearn version: " + sklearn.__version__)
print("nltk version: " + nltk.__version__)
print()

from google.colab import drive
drive.mount('/content/gdrive')

NUM_TESTING = 10000
TESTING = True

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.

---------- versions ----------

python version: 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
pandas version: 0.24.2
numpy version: 1.16.4
matplotlib version: 3.0.3
gensim version: 3.6.0
sklearn version: 0.21.2
nltk version: 3.2.5

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20ht

### Downloading the data

In [2]:
print("\n---------- downloading data ----------\n")
DATA_PATH = '/content/gdrive/My Drive/Colab/stackoverflow-doc2vec/data/Questions.csv'

content = ""
with codecs.open(DATA_PATH, 'r', encoding='utf-8', errors='ignore') as file:
    content = file.read()
    
df = pd.read_csv(DATA_PATH, encoding = "ISO-8859-1")
  


---------- downloading data ----------



### Exploring the data

In [3]:
print("\n---------- exploring the data ----------\n")

# reduce data size if testing
if TESTING:
    print("Reduced data size to " + str(NUM_TESTING) + " entries")
    df = df[:NUM_TESTING]
    
print("Data shape: " + str(df.shape))
print("\ncolumn counts:\n" + str(df.count()))

num_missing = df.isnull().sum()
print("\nMissing values before:\n" + str(num_missing))

pd.set_option('display.expand_frame_repr', False)
print("\nFirst 5 data rows: \n" + str(df.head(n=5)))



---------- exploring the data ----------

Reduced data size to 10000 entries
Data shape: (10000, 6)

column counts:
Id              10000
OwnerUserId      9051
CreationDate    10000
Score           10000
Title           10000
Body            10000
dtype: int64

Missing values before:
Id                0
OwnerUserId     949
CreationDate      0
Score             0
Title             0
Body              0
dtype: int64

First 5 data rows: 
    Id  OwnerUserId          CreationDate  Score                                              Title                                               Body
0  469        147.0  2008-08-02T15:11:16Z     21  How can I find the full path to a font from it...  <p>I am using the Photoshop's javascript API t...
1  502        147.0  2008-08-02T17:01:58Z     27            Get a preview JPEG of a PDF on Windows?  <p>I have a cross-platform (Python) applicatio...
2  535        154.0  2008-08-02T18:43:54Z     40  Continuous Integration System for a Python Cod...  <p>I'm

### Preparing Data

In [4]:
print("\n---------- preparing the data ----------\n")
# drop the columns we aren't looking at
df_dropped = df.drop(['Id', 'OwnerUserId', 'CreationDate', 'Score', 'Body'], axis = 1)
print("\nFirst 5 data rows after column dropping: \n" + str(df_dropped.head(n=5)))

# change to a numpy array
np_data = np.asarray(df_dropped)
print("\nData as an np array: \n" + str(np_data[0:5]))
print("\nData shape: " + str(np_data.shape) + "\n")

# removing punctuation and stop words, transforming to lower case and lemmatising
tokenizer = nltk.RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
lemmatizer = nltk.WordNetLemmatizer() 
docs = []
index = 0
for d in np_data:
    wordsList = tokenizer.tokenize(d[0])
    wordsList = [lemmatizer.lemmatize(w.lower()) for w in wordsList if w not in stop_words]
    docs.append(wordsList)

# tag the data
tagged_data = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]
print("\nTagged data: " + str(tagged_data[0:5]))
# tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(np_data)]


---------- preparing the data ----------


First 5 data rows after column dropping: 
                                               Title
0  How can I find the full path to a font from it...
1            Get a preview JPEG of a PDF on Windows?
2  Continuous Integration System for a Python Cod...
3     cx_Oracle: How do I iterate over a result set?
4  Using 'in' to match an attribute of Python obj...

Data as an np array: 
[['How can I find the full path to a font from its display name on a Mac?']
 ['Get a preview JPEG of a PDF on Windows?']
 ['Continuous Integration System for a Python Codebase']
 ['cx_Oracle: How do I iterate over a result set?']
 ["Using 'in' to match an attribute of Python objects in an array"]]

Data shape: (10000, 1)


Tagged data: [TaggedDocument(words=['how', 'i', 'find', 'full', 'path', 'font', 'display', 'name', 'mac'], tags=[0]), TaggedDocument(words=['get', 'preview', 'jpeg', 'pdf', 'window'], tags=[1]), TaggedDocument(words=['continuous', 'integration', 's

### Training the model

In [5]:
print("\n---------- training the model ----------\n")
t = time()
num_epochs = 100
alpha = 0.025

# create model
model = Doc2Vec(vector_size=200, 
                window=5, 
                compute_loss=True,
                dm=1)
  
# build the vocab
model.build_vocab(tagged_data)

# train the vocab
model.train(tagged_data, total_examples=model.corpus_count, epochs=num_epochs, report_delay=1)
model.save("/content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model")
    
time_taken = round((time() - t) / 60, 2)
print("Time to train model: " + str(time_taken) + " mins")

2019-06-29 14:55:45,062 : INFO : collecting all words and their counts
2019-06-29 14:55:45,066 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-06-29 14:55:45,093 : INFO : collected 6286 word types and 10000 unique tags from a corpus of 10000 examples and 59751 words
2019-06-29 14:55:45,094 : INFO : Loading a fresh vocabulary
2019-06-29 14:55:45,103 : INFO : effective_min_count=5 retains 1635 unique words (26% of original 6286, drops 4651)
2019-06-29 14:55:45,103 : INFO : effective_min_count=5 leaves 52430 word corpus (87% of original 59751, drops 7321)
2019-06-29 14:55:45,111 : INFO : deleting the raw counts dictionary of 6286 items
2019-06-29 14:55:45,112 : INFO : sample=0.001 downsamples 47 most-common words
2019-06-29 14:55:45,113 : INFO : downsampling leaves estimated 40362 word corpus (77.0% of prior 52430)
2019-06-29 14:55:45,118 : INFO : estimated required memory for 1635 words and 200 dimensions: 11433500 bytes
2019-06-29 14:55:45,119 : INFO


---------- training the model ----------



2019-06-29 14:55:45,590 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-29 14:55:45,596 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-29 14:55:45,604 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-29 14:55:45,604 : INFO : EPOCH - 1 : training on 59751 raw words (50408 effective words) took 0.4s, 138048 effective words/s
2019-06-29 14:55:45,974 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-29 14:55:45,981 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-29 14:55:45,985 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-29 14:55:45,986 : INFO : EPOCH - 2 : training on 59751 raw words (50249 effective words) took 0.4s, 135166 effective words/s
2019-06-29 14:55:46,328 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-29 14:55:46,336 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-0

Time to train model: 0.65 mins


### Exploring the model

In [6]:
print("\n---------- Exploring the model ----------\n")
model = Doc2Vec.load("/content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model")

# choosing a test phrase to find a similar document to
test_phrase = 'list comprehension'
print("Testing the phrase: '", test_phrase, "'")
test_data = word_tokenize(test_phrase.lower())
inferred_phrase_vector = model.infer_vector(test_data)
print("Inferred vector for phrase: ", inferred_phrase_vector)

# finding the most similar documents to the phrase
print()
print("10 most similar docs")
t = time()
similar_docs = model.docvecs.most_similar([inferred_phrase_vector])
time_taken = round((time() - t) / 60, 2)
print("Time to find 10 most similar docs: ", time_taken, " mins")

print("Most similar docs (tags and scores): ", similar_docs)

print("Most similar docs (scores and text):")
for doc in similar_docs:
    tag = doc[0]
    score = doc[1]
    text = np_data[tag][0]
    print("score:", score, " doc:", text)

# now looking at similarity to all documents
print()
print(model.docvecs.count, "most similar docs")
print("Time to find " + str(model.docvecs.count) + " most similar docs")
t = time()
similar_docs = model.docvecs.most_similar([inferred_phrase_vector], topn=model.docvecs.count)
time_taken = round((time() - t) / 60, 2)
print("Time to find " + str(model.docvecs.count) + " similar docs: " + str(time_taken) + " mins")

print("Most similar doc: ", np_data[similar_docs[0][0]]) 
print("Median similar doc: ", np_data[similar_docs[int(model.docvecs.count/2)][0]])
print("Least similar doc: ", np_data[similar_docs[-1][0]])
print()

2019-06-29 14:56:24,058 : INFO : loading Doc2Vec object from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-06-29 14:56:24,145 : INFO : loading vocabulary recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.vocabulary.* with mmap=None
2019-06-29 14:56:24,146 : INFO : loading trainables recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.trainables.* with mmap=None
2019-06-29 14:56:24,147 : INFO : loading wv recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.wv.* with mmap=None
2019-06-29 14:56:24,150 : INFO : loading docvecs recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.docvecs.* with mmap=None
2019-06-29 14:56:24,151 : INFO : loaded /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model
2019-06-29 14:56:24,170 : INFO : precomp


---------- Exploring the model ----------

Testing the phrase: ' list comprehension '
Inferred vector for phrase:  [ 0.08814396 -0.19688198 -0.08834332  0.06076005 -0.07049202 -0.11756884
  0.15121663  0.12571563  0.14121367 -0.16799547  0.19702432  0.13667159
 -0.01472403  0.20087966  0.03922275 -0.12506829 -0.14694007  0.12052501
 -0.02133354  0.00272268 -0.02368275  0.11390471 -0.09726799  0.13297288
 -0.1222138  -0.18324319  0.11362614 -0.15548204 -0.123593   -0.06920963
  0.00122982 -0.14186504  0.02757566 -0.03109406 -0.10153558  0.0470976
  0.00248729  0.088322    0.05436746 -0.14143631 -0.12887034  0.0971546
 -0.22033861 -0.01942477  0.17039259  0.13670775 -0.04282744  0.07833874
 -0.02183347  0.00111994  0.27038702 -0.11864842 -0.00058999  0.1976604
  0.32713258 -0.0808882   0.17750269  0.1629374   0.14259183 -0.14527313
 -0.00897255  0.09979337 -0.07497867  0.05100015 -0.14338923  0.06945376
 -0.07643405  0.04549519  0.07599924 -0.07360131 -0.1404713   0.05072852
  0.0315482