# Problem

Popular question and answer (qna) site -  stackoverflow (+ their sister sites) allows for download of monthly data dumps from https://archive.org/details/stackexchange.

With this data, can we classify the questions/answers based on

* Identify similar questions
* A particular question is associated with another question in terms of the next things to do or perhaps the pre-requisites?
* Predict the next question a user may ask based on this current search

The taxanomy could be a useful layout of the land for a student of the area.

# Schema

The schema for their data is located @ https://ia800500.us.archive.org/22/items/stackexchange/readme.txt.
    
Unfortunately, the data is dumped in an XML format and there is preliminary effort to convert that data into CSV format. We have written a converter (convert2csv.py) for the tables of interest.

# Conversion from XML to CSV

Run python convert2csv.py to convert each of the xml files to their CSV equivalents. For columns/attributes which contain textual data, the converter encodes them with base64 encoding so that handling of quotes and special characters (separators) is avoided. 

When the data is read back into the dataframe, the corresponding decode (from base64) needs to happen. The converter also creates a sample file of 100 rows for each xml data dump converted.

In [493]:
#imports
import pandas as pd
import math
import re
import gensim
from gensim import corpora, models,similarities
from gensim.models import word2vec, doc2vec
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()


SAMPLE_SIZE = 20000

In [515]:
posts = pd.read_csv('data/posts.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Body','Title'])
posts['Tags'] = posts['Tags'].apply(lambda x : x.replace('<',' ').replace('>',' '))
posts[['Body','Title', 'Tags']].head(2)


Unnamed: 0,Body,Title,Tags
0,when should i use can when should i use could...,when do i use can or could,word-choice tenses politeness subjunctive-...
1,doesn t quint mean five what does that h...,where does the quint in quintessential com...,etymology


In [453]:
#comments = pd.read_csv('data/comments.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna()
#comments[['Score','Text']].head(5)

In [454]:
posthistory = pd.read_csv('data/posthistory.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Text'])
posthistory[['Text']].head(5)

Unnamed: 0,Text
0,when could i use can or when can i use could ...
1,when do i could can could
3,doesn t quint mean five what does that hav...
4,where does the quint in quintessential come ...
6,which is the correct use of these two words an...


In [455]:
users = pd.read_csv('data/users.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['AboutMe','Location'])
users[['Location','AboutMe']].head(5)

Unnamed: 0,Location,AboutMe
0,on the server farm,hi i m not really a person i m a background ...
1,corvallis or,developer on the stack overflow team find me...
2,new york ny,developer on the stack overflow team was dubb...
3,raleigh nc,i design stuff for stack exchange also a prof...
4,california,i slip my front end into the back end and the...


## Further cleansing

* Remove (html) tags & carriage returns from the Text field
* Remove stop words (pick up the nltk stop words)
* Use PorterStemmer to stem words

In [456]:
#global
p_stemmer = PorterStemmer()
stop_words = stopwords.words('english')
stop_words.append('use')
#print(stop_words)

In [457]:
class SentenceTokens():
    def __init__(self,df,field):
        self.field = field
        self.df = df
    
    def __iter__(self):
      for index, row in self.df.iterrows():
         raw_sentence = row[self.field]
         raw_tokens = filter(None, re.split("[ ]+",raw_sentence))
         #stem_tokens = [p_stemmer.stem(tok) for tok in raw_tokens]
         yield [tok for tok in raw_tokens if not tok in stop_words and len(tok) > 1 ]

#all posts is a list of (list of tokens). The inner list of tokens is created once for each post
allposts = SentenceTokens(posts,'Title')


class LabeledLineSentence(object):
    def __init__(self,df,field,tag):
        self.df = df
        self.field = field
        self.tag = tag

    def __iter__(self):
        for index, row in self.df.iterrows():
           tags = ['T_' + tag for tag in row[self.tag].split()]
           yield doc2vec.TaggedDocument(words=row[self.field].split(),tags=tags)

lablines = LabeledLineSentence(posts,'Title','Tags')
#print([p for p in allposts])

In [458]:
#How frequently each term occurs within each document? We construct a document-term matrix.
dictionary = corpora.Dictionary(allposts)
#print(dictionary.token2id) #maps ids to tokens

In [504]:
#bag of words
corpus = [dictionary.doc2bow(text) for text in allposts]
#corpus, is a list of vectors equal to the number of documents. 
#In each document vector is a series of tuples. 
#print(corpus[0])
#print(len(corpus))
#print(corpus)
#(term ID, term frequency) pairs

In [494]:
# convert tokenized documents to vectors
new_doc = "bunary randam trees unorder"
new_vec = dictionary.doc2bow(new_doc.lower().split())

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
vec_lsi = lsi[new_vec] # convert the query to LSI space

index = similarities.MatrixSimilarity(lsi[corpus]) 

In [520]:
sims = index[vec_lsi] # perform a similarity query against the corpus
#print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
sims = sorted(enumerate(sims), key=lambda item: -item[1])
#print(sims) # print sorted (document number, similarity score) 2-tuples
sims_list = list(enumerate(sims))
for i in range(0, len(sims_list)):
    #print(sims_list[i])
    docid = sims_list[i][1][0]
    print(str(sims_list[i][1][1] * 100) + " : " + posts.iloc[docid]['Title'])

0.0 : when do i use  can  or  could  
0.0 : where does the  quint  in  quintessential  come from 
0.0 : when should i use  shall  versus  will  
0.0 : when did  while  and  whilst  become interchangeable 
0.0 :  may     might   what s the right context 
0.0 : is it appropriate to use short form of  have    ve  when it means possession 
0.0 : which words in a title should be capitalized 
0.0 : when is it appropriate to end a sentence in a preposition 
0.0 : where did the term  ok okay  come from 
0.0 : what is the proper plural of the word  freshman  
0.0 : where did the singular  innings  come from 
0.0 : are  betwixt    trebble   etc   acceptable in american english 
0.0 : is it ever acceptable for a period to come after a quote at the end of a sentence 
0.0 :  what s wrong in with this question  
0.0 : is there a correct gender neutral  singular pronoun   his  versus  her  versus  their    
0.0 : can a word be contracted twice  e g   i ven t   
0.0 : what   s the rule for using    wh

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=30)

#num_topics: required. An LDA model requires the user to determine how many topics should be generated. 
#id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.
#passes: optional. The number of laps the model will take through corpus.

In [461]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

In [462]:
@interact(num_topics=5, num_words=3)
def understand(num_topics, num_words):
    return ldamodel.print_topics(num_topics, num_words)
    
#Each generated topic is separated by a comma.
#Within each topic are the three most probable words to appear in that topic.

[(0, u'0.026*pronunciation + 0.014*vs + 0.013*verb'),
 (1, u'0.072*sentence + 0.019*word + 0.016*end'),
 (9, u'0.080*vs + 0.031*plural + 0.018*singular'),
 (3, u'0.062*english + 0.048*words + 0.034*word'),
 (7, u'0.122*mean + 0.011*phrase + 0.011*names')]

In [463]:
pyLDAvis.gensim.prepare(ldamodel,corpus,dictionary)

In [464]:
#wordmodel = word2vec.Word2Vec(allposts)
#wordmodel.init_sims(replace=True)
#docmodel = doc2vec.Doc2Vec(lablines, min_count=1)


docmodel = doc2vec.Doc2Vec(alpha=0.025, min_alpha=0.025)
docmodel.build_vocab(lablines)
for epoch in range(10):
    docmodel.train(lablines)
    docmodel.alpha -= 0.002  # decrease the learning rate
    docmodel.min_alpha = docmodel.alpha  # fix the learning rate, no decay



In [491]:
from pprint import pprint
def showsimilar(q):
    norm_input = [tok for tok in q.split() if not tok in stop_words and len(tok) > 1 ]
    input_vec = docmodel.infer_vector(norm_input)
    similar_vecs = docmodel.docvecs.most_similar(positive=[input_vec])
    similartags = [s[0] for s in similar_vecs]

    def matches(data,col,values):
        match = False
        for v in values:
            match =  data[col].str.contains(v.replace('T_',''))
            if match is True:
                return match
        return match

    similarposts = posts[matches(posts,'Tags',similartags)]
    pprint(similarposts[['Title']])


In [492]:
from IPython.display import display
from ipywidgets import widgets 
import warnings

warnings.filterwarnings('error')

def handler(sender):
    showsimilar(text.value)
try:
    text = widgets.Text()    
except DeprecationWarning:
    text.on_submit(handler)
display(text)    

## References

* https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
* LDA Viz - http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf
* This dashboard @ https://github.com/dhruvaray/soml
    