# Stackoverflow Doc2Vec

<a href="https://colab.research.google.com/github/fmcooper/stackoverflow-doc2vec/blob/master/stackoverflow-doc2vec.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Data from: https://www.kaggle.com/stackoverflow/pythonquestions#Tags.csv

---

In [17]:
import sys
import numpy as np
import pandas as pd
from time import time
import codecs # handles errors on data import
import gensim
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import logging # to show progress of training model
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



print("\n---------- versions ----------\n")
print("python version: " + sys.version)
print("pandas version: " + pd.__version__)
print("numpy version: " + np.__version__)
print("matplotlib version: " + mpl.__version__)
print("gensim version: " + gensim.__version__)
print("sklearn version: " + sklearn.__version__)
print("nltk version: " + nltk.__version__)
print()

from google.colab import drive
drive.mount('/content/gdrive')

NUM_TESTING = 10000
TESTING = True

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

---------- versions ----------

python version: 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0]
pandas version: 0.24.2
numpy version: 1.16.4
matplotlib version: 3.0.3
gensim version: 3.6.0
sklearn version: 0.21.2
nltk version: 3.2.5

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Downloading the data

In [18]:
print("\n---------- downloading data ----------\n")
DATA_PATH = '/content/gdrive/My Drive/Colab/stackoverflow-doc2vec/data/Questions.csv'

content = ""
with codecs.open(DATA_PATH, 'r', encoding='utf-8', errors='ignore') as file:
    content = file.read()
    
df = pd.read_csv(DATA_PATH, encoding = "ISO-8859-1")
  


---------- downloading data ----------



### Exploring the data

In [19]:
print("\n---------- exploring the data ----------\n")

# reduce data size if testing
if TESTING:
    print("Reduced data size to " + str(NUM_TESTING) + " entries")
    df = df[:NUM_TESTING]
    
print("Data shape: " + str(df.shape))
print("\ncolumn counts:\n" + str(df.count()))

num_missing = df.isnull().sum()
print("\nMissing values before:\n" + str(num_missing))

pd.set_option('display.expand_frame_repr', False)
print("\nFirst 5 data rows: \n" + str(df.head(n=5)))



---------- exploring the data ----------

Reduced data size to 10000 entries
Data shape: (10000, 6)

column counts:
Id              10000
OwnerUserId      9051
CreationDate    10000
Score           10000
Title           10000
Body            10000
dtype: int64

Missing values before:
Id                0
OwnerUserId     949
CreationDate      0
Score             0
Title             0
Body              0
dtype: int64

First 5 data rows: 
    Id  OwnerUserId          CreationDate  Score                                              Title                                               Body
0  469        147.0  2008-08-02T15:11:16Z     21  How can I find the full path to a font from it...  <p>I am using the Photoshop's javascript API t...
1  502        147.0  2008-08-02T17:01:58Z     27            Get a preview JPEG of a PDF on Windows?  <p>I have a cross-platform (Python) applicatio...
2  535        154.0  2008-08-02T18:43:54Z     40  Continuous Integration System for a Python Cod...  <p>I'm

### Preparing Data

In [20]:
print("\n---------- preparing the data ----------\n")
# drop the columns we aren't looking at
df_dropped = df.drop(['Id', 'OwnerUserId', 'CreationDate', 'Score', 'Body'], axis = 1)
print("\nFirst 5 data rows after column dropping: \n" + str(df_dropped.head(n=5)))

# change to a numpy array
np_data = np.asarray(df_dropped)
print("\nData as an np array: \n" + str(np_data[0:5]))
print("\nData shape: " + str(np_data.shape) + "\n")

# removing punctuation and stop words, transforming to lower case and lemmatising
tokenizer = nltk.RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
lemmatizer = nltk.WordNetLemmatizer() 
docs = []
index = 0
for d in np_data:
    wordsList = tokenizer.tokenize(d[0])
    wordsList = [lemmatizer.lemmatize(w.lower()) for w in wordsList if w not in stop_words]
    docs.append(wordsList)

# tag the data
tagged_data = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]
print("\nTagged data: " + str(tagged_data[0:5]))
# tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(np_data)]


---------- preparing the data ----------


First 5 data rows after column dropping: 
                                               Title
0  How can I find the full path to a font from it...
1            Get a preview JPEG of a PDF on Windows?
2  Continuous Integration System for a Python Cod...
3     cx_Oracle: How do I iterate over a result set?
4  Using 'in' to match an attribute of Python obj...

Data as an np array: 
[['How can I find the full path to a font from its display name on a Mac?']
 ['Get a preview JPEG of a PDF on Windows?']
 ['Continuous Integration System for a Python Codebase']
 ['cx_Oracle: How do I iterate over a result set?']
 ["Using 'in' to match an attribute of Python objects in an array"]]

Data shape: (10000, 1)


Tagged data: [TaggedDocument(words=['how', 'i', 'find', 'full', 'path', 'font', 'display', 'name', 'mac'], tags=[0]), TaggedDocument(words=['get', 'preview', 'jpeg', 'pdf', 'window'], tags=[1]), TaggedDocument(words=['continuous', 'integration', 's

### Training the model

In [21]:
print("\n---------- training the model ----------\n")
t = time()
num_epochs = 100
alpha = 0.025

# create model
model = Doc2Vec(vector_size=200, 
                window=5, 
                compute_loss=True,
                dm=1)
  
# build the vocab
model.build_vocab(tagged_data)

# train the vocab
model.train(tagged_data, total_examples=model.corpus_count, epochs=num_epochs, report_delay=1)
model.save("/content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model")
    
time_taken = round((time() - t) / 60, 2)
print("Time to train model: " + str(time_taken) + " mins")

2019-06-08 19:20:41,415 : INFO : collecting all words and their counts
2019-06-08 19:20:41,417 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-06-08 19:20:41,450 : INFO : collected 6286 word types and 10000 unique tags from a corpus of 10000 examples and 59751 words
2019-06-08 19:20:41,451 : INFO : Loading a fresh vocabulary
2019-06-08 19:20:41,460 : INFO : effective_min_count=5 retains 1635 unique words (26% of original 6286, drops 4651)
2019-06-08 19:20:41,460 : INFO : effective_min_count=5 leaves 52430 word corpus (87% of original 59751, drops 7321)
2019-06-08 19:20:41,469 : INFO : deleting the raw counts dictionary of 6286 items
2019-06-08 19:20:41,471 : INFO : sample=0.001 downsamples 47 most-common words
2019-06-08 19:20:41,472 : INFO : downsampling leaves estimated 40362 word corpus (77.0% of prior 52430)
2019-06-08 19:20:41,478 : INFO : estimated required memory for 1635 words and 200 dimensions: 11433500 bytes
2019-06-08 19:20:41,478 : INFO


---------- training the model ----------



2019-06-08 19:20:41,606 : INFO : training model with 3 workers on 1635 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-06-08 19:20:42,074 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-08 19:20:42,076 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-08 19:20:42,084 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-08 19:20:42,086 : INFO : EPOCH - 1 : training on 59751 raw words (50388 effective words) took 0.5s, 107374 effective words/s
2019-06-08 19:20:42,529 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-08 19:20:42,531 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-08 19:20:42,540 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-08 19:20:42,541 : INFO : EPOCH - 2 : training on 59751 raw words (50394 effective words) took 0.4s, 113472 effective words/s
2019-06-08 19:20:43,012 : INFO : worker

Time to train model: 0.78 mins


### Exploring the model

In [23]:
print("\n---------- Exploring the model ----------\n")
model = Doc2Vec.load("/content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model")

# choosing a test phrase to find a similar document to
test_phrase = 'list comprehension'
print("Testing the phrase: '", test_phrase, "'")
test_data = word_tokenize(test_phrase.lower())
inferred_phrase_vector = model.infer_vector(test_data)
print("Inferred vector for phrase: ", inferred_phrase_vector)

# finding the most similar documents to the phrase
print()
print("10 most similar docs")
t = time()
similar_docs = model.docvecs.most_similar([inferred_phrase_vector])
time_taken = round((time() - t) / 60, 2)
print("Time to find 10 most similar docs: ", time_taken, " mins")

print("Most similar docs (tags and scores): ", similar_docs)

print("Most similar docs (scores and text):")
for doc in similar_docs:
    tag = doc[0]
    score = doc[1]
    text = np_data[tag][0]
    print("score:", score, " doc:", text)

# now looking at similarity to all documents
print()
print(model.docvecs.count, "most similar docs")
print("Time to find " + str(model.docvecs.count) + " most similar docs")
t = time()
similar_docs = model.docvecs.most_similar([inferred_phrase_vector], topn=model.docvecs.count)
time_taken = round((time() - t) / 60, 2)
print("Time to find " + str(model.docvecs.count) + " similar docs: " + str(time_taken) + " mins")

print("Most similar doc: ", np_data[similar_docs[0][0]]) 
print("Median similar doc: ", np_data[similar_docs[int(model.docvecs.count/2)][0]])
print("Least similar doc: ", np_data[similar_docs[-1][0]])
print()

2019-06-08 19:22:05,077 : INFO : loading Doc2Vec object from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-06-08 19:22:05,180 : INFO : loading vocabulary recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.vocabulary.* with mmap=None
2019-06-08 19:22:05,182 : INFO : loading trainables recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.trainables.* with mmap=None
2019-06-08 19:22:05,183 : INFO : loading wv recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.wv.* with mmap=None
2019-06-08 19:22:05,184 : INFO : loading docvecs recursively from /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model.docvecs.* with mmap=None
2019-06-08 19:22:05,185 : INFO : loaded /content/gdrive/My Drive/Colab/stackoverflow-doc2vec/saved_model.model
2019-06-08 19:22:05,201 : INFO : precomp


---------- Exploring the model ----------

Testing the phrase: ' list comprehension '
Inferred vector for phrase:  [-0.0303454   0.03146004  0.06117353  0.11697804  0.15151669  0.08276034
 -0.09367646  0.02419818  0.03362042 -0.17787649 -0.12787664 -0.05211365
  0.16136287  0.09925481  0.19775678 -0.18140951 -0.01212998  0.03110356
 -0.09260952 -0.18974523  0.13384499  0.00079723 -0.04463546 -0.0570041
  0.0812666   0.13048825  0.1623901  -0.04154849  0.0723967  -0.04093298
  0.06035692 -0.11277607  0.11763982 -0.02965449  0.072163    0.09545857
 -0.07982928 -0.0394634  -0.06494092  0.07373034  0.14936884 -0.19790508
  0.00925783 -0.06225413  0.14222968  0.03906593 -0.05303317 -0.2320815
 -0.1320135  -0.09805014 -0.0880162   0.0162906  -0.09882345  0.13940069
 -0.03867751 -0.08261397  0.02370062  0.1531846  -0.05535223  0.08236048
  0.10152145 -0.03202575 -0.03014833  0.09819843  0.01812522  0.00866298
 -0.15402807 -0.16700388 -0.11383224  0.23744929  0.02000295 -0.04555642
 -0.112223

  if np.issubdtype(vec.dtype, np.int):
