# Problem

Popular question and answer (qna) site -  stackoverflow (+ their sister sites) allows for download of monthly data dumps from https://archive.org/details/stackexchange.

With this data, can we classify the questions/answers based on

* Identify similar questions
* A particular question is associated with another question in terms of the next things to do or perhaps the pre-requisites?
* Predict the next question a user may ask based on this current search

The taxanomy could be a useful layout of the land for a student of the area.

# Schema

The schema for their data is located @ https://ia800500.us.archive.org/22/items/stackexchange/readme.txt.
    
Unfortunately, the data is dumped in an XML format and there is preliminary effort to convert that data into CSV format. We have written a converter (convert2csv.py) for the tables of interest.

# Conversion from XML to CSV

Run python convert2csv.py to convert each of the xml files to their CSV equivalents. For columns/attributes which contain textual data, the converter encodes them with base64 encoding so that handling of quotes and special characters (separators) is avoided. 

When the data is read back into the dataframe, the corresponding decode (from base64) needs to happen. The converter also creates a sample file of 100 rows for each xml data dump converted.

In [157]:
#imports
import pandas as pd
import math
import re
import gensim
from gensim import corpora, models,similarities
from gensim.models import word2vec, doc2vec
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from pprint import pprint                        # pretty-printer

%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

SAMPLE_SIZE = 20000

## Data loading
* Load post title and body from post files
* Load post test from history files
* Load user info from user files

In [158]:
posts = pd.read_csv('data/posts.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Body','Title'])
posts['Tags'] = posts['Tags'].apply(lambda x : x.replace('<',' ').replace('>',' '))
posts[['Body','Title', 'Tags']].head(2)

Unnamed: 0,Body,Title,Tags
0,when should i use can when should i use could...,when do i use can or could,word-choice tenses politeness subjunctive-...
1,doesn t quint mean five what does that h...,where does the quint in quintessential com...,etymology


In [159]:
#comments = pd.read_csv('data/comments.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna()
#comments[['Score','Text']].head(5)

In [160]:
posthistory = pd.read_csv('data/posthistory.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Text'])
posthistory[['Text']].head(5)

Unnamed: 0,Text
0,when could i use can or when can i use could ...
1,when do i could can could
3,doesn t quint mean five what does that hav...
4,where does the quint in quintessential come ...
6,which is the correct use of these two words an...


In [161]:
users = pd.read_csv('data/users.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['AboutMe','Location'])
users[['Location','AboutMe']].head(5)

Unnamed: 0,Location,AboutMe
0,on the server farm,hi i m not really a person i m a background ...
1,corvallis or,developer on the stack overflow team find me...
2,new york ny,developer on the stack overflow team was dubb...
3,raleigh nc,i design stuff for stack exchange also a prof...
4,california,i slip my front end into the back end and the...


## Further cleansing

* Remove (html) tags & carriage returns from the Text field
* Remove stop words (pick up the nltk stop words)
* Use PorterStemmer to stem words

In [162]:
class StopWords():
    def __init__(self):
        #p_stemmer = PorterStemmer()
        self.stop_words = stopwords.words('english')
        self.stop_words.append('use')

    def remove(self, sentence):
        raw_tokens = filter(None, re.split(";+| +|,+|\?+|\*+",sentence))
        return [tok for tok in raw_tokens if not tok in self.stop_words and len(tok) > 1]


In [163]:
#global
#p_stemmer = PorterStemmer()
#stop_words = stopwords.words('english')
#stop_words.append('use')
stop_words = StopWords()

In [164]:
class SentenceTokens():
    def __init__(self,df,field):
        self.field = field
        self.df = df
    
    def __iter__(self):
        for index, row in self.df.iterrows():
            raw_sentence = row[self.field]
            yield stop_words.remove(raw_sentence)

            #raw_tokens = filter(None, re.split("[ ]+",raw_sentence))
            #stem_tokens = [p_stemmer.stem(tok) for tok in raw_tokens]
            #yield [tok for tok in raw_tokens if not tok in stop_words and len(tok) > 1 ]


In [165]:
#all posts is a list of (list of tokens). The inner list of tokens is created once for each post
allposts = SentenceTokens(posts,'Title')
#print([p for p in allposts])


In [171]:
#How frequently each term occurs within each document? We construct a document-term matrix.
dictionary = corpora.Dictionary(allposts)

In [172]:
#bag of words
#corpus is a list of vectors equal to the number of documents. 
#In each document vector is a series of tuples. 
corpus = [dictionary.doc2bow(post) for post in allposts]

# Try Bag of Words
* Take a sample question
* Remove stop words
* Convert into a vector using bag of words
* Search vector using LSI model

In [173]:
# Find similar questions by converting it into vector
samples = ['Where does the quint in quintessential come from?',
           'Where does goodness me come from?']

sampleIndex = 0
sampleVector = dictionary.doc2bow(stop_words.remove(samples[sampleIndex]))
pprint(stop_words.remove(samples[sampleIndex]))

['Where', 'quint', 'quintessential', 'come']


In [174]:
# Decide number of topics based on factors in a vector
numberOfTopics = len(sampleVector)
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=numberOfTopics)

# convert the query (sample vector) to LSI space
vec_lsi = lsi[sampleVector]

In [322]:
index = similarities.MatrixSimilarity(lsi[corpus]) 

# perform a similarity query against the corpus
sims = index[vec_lsi]

# Sort in descending order - highest matching percentage on top
sims = sorted(enumerate(sims), key=lambda item: -item[1])
sims_list = list(enumerate(sims))

# Show top 10 matches only
for i in range(0, 10):
    docid = sims_list[i][1][0]
    matchPercentage = sims_list[i][1][1]
#     print(str(matchPercentage * 100) + " : " + posts.iloc[docid]['Title'])
    print("{:10.3f}% : {}".format(matchPercentage * 100, posts.iloc[docid]['Title']))

   100.000% :  tit for tat    where does this come from 
   100.000% : where does the  quint  in  quintessential  come from 
   100.000% : where does  ta   come from 
   100.000% : are  come round  and  visit  interchangeable 
   100.000% : where did the  juices  in  creative juices  come from 
   100.000% : where does  santa  in santa claus come from 
   100.000% : where does  can t be arsed  come from 
    99.999% : where does  hot damn   come from 
    99.998% : should i use  will  or  would  when i suggest that something will would come in handy 
    99.997% : where did the saying  bite the dust  come from 


# Try LDA model

In [176]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word = dictionary, passes=30)

#num_topics: required. An LDA model requires the user to determine how many topics should be generated. 
#id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.
#passes: optional. The number of laps the model will take through corpus.

In [177]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

In [178]:
@interact(num_topics=5, num_words=3)
def understand(num_topics, num_words):
    return ldamodel.print_topics(num_topics, num_words)
    
#Each generated topic is separated by a comma.
#Within each topic are the three most probable words to appear in that topic.

[(0, u'0.030*sentence + 0.021*correct + 0.014*meaning'),
 (1, u'0.032*difference + 0.031*vs + 0.015*say'),
 (2, u'0.028*mean + 0.018*meaning + 0.015*versus'),
 (3, u'0.038*word + 0.028*english + 0.015*mean')]

In [179]:
pyLDAvis.gensim.prepare(ldamodel,corpus,dictionary)

# Try Doc2Vec

In [368]:
class LabeledLineSentence(object):
    def __init__(self,df,field,tag):
        self.df = df
        self.field = field
        self.tag = tag

    def __iter__(self):
        for index, row in self.df.iterrows():
            tokens = stop_words.remove(row[self.field])
            yield doc2vec.TaggedDocument(words=tokens,tags=[row[self.tag]])

lablines = LabeledLineSentence(posts,'Title','Id')
# print([p for p in lablines])

In [369]:
docmodel = doc2vec.Doc2Vec(alpha=0.025, min_alpha=0.025)
docmodel.build_vocab(lablines)
for epoch in range(10):
    docmodel.train(lablines)
    docmodel.alpha -= 0.002  # decrease the learning rate
    docmodel.min_alpha = docmodel.alpha  # fix the learning rate, no decay

In [370]:
class MatchingPost(object):
    matchingPercentage = 0
    title = ""
    
    def __init__(self, matchingPercentage, title):
        self.matchingPercentage = matchingPercentage
        self.title = title
        

def showsimilar(question):
    if (type(question) is not 'str'):
        question = str(question)
        
    norm_input = stop_words.remove(question) # question.split()
    q_vector = docmodel.infer_vector(norm_input)
    similar_vecs = docmodel.docvecs.most_similar(positive=[q_vector])
    similarTitles = []
    
    for vec in similar_vecs:
        post = posts[posts['Id']==vec[0]]
        if(len(post) == 0): continue
        title = posts[posts['Id']==vec[0]]['Title']
        similarPostInfo = MatchingPost(vec[1], title.iloc[0])
        similarTitles.append(similarPostInfo)
    
    # Show top 10 matches only
    for title in similarTitles:
        post = title.title
        matchPercentage = title.matchingPercentage
        print("{:10.2f}% : {}".format(matchPercentage * 100, post))

    return similarTitles


similarTitles = showsimilar("Where does the quint in quintessential come from?")

     90.76% : which is correct   one or more is  or  one or more are  
     87.12% : what s the meaning of  get one s finger in the air  
     86.85% : what is the difference between  used to  and  i was used to  
     86.75% : what s the difference between these sentences 
     86.13% : what s the difference between these sentences 
     85.96% : how did  mad  come to be a determiner 
     85.89% : where does  santa  in santa claus come from 
     85.86% : where does the  quint  in  quintessential  come from 
     85.83% : where does  pull it off  come from 
     85.80% : where does  can t be arsed  come from 


In [330]:
from IPython.display import display
from ipywidgets import widgets 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

def handler(sender):
    showsimilar(text.value)
    
text = widgets.Text()    
display(text)    
text.on_submit(handler)


     66.27% : what s the meaning of  get one s finger in the air  
     59.08% : can  found  be used  as it is in this sentence  in the future tense 
     58.78% : what does  capacity  mean in this question 
     54.69% : difference between  get  and  take 
     52.43% : does anybody pronounce the word  pillow  as  pellow  
     50.19% : when does a word become a  word  
     49.68% : colons and semi colons
     48.07% : how long does it take to mull something over 
     47.89% : can a book be divided in categories 
     47.71% : where does the  quint  in  quintessential  come from 
     55.30% :  prove me     prove to me     confirm one s belief 
     52.84% : colons and semi colons
     52.78% : difference between  get  and  take 
     49.12% : what s the meaning of  get one s finger in the air  
     49.11% :  in the middle of riddle  means what 
     48.67% : is it wrong to pronounce  pizza  as  peedtza  
     48.03% : can you use   sic   in other contexts 
     47.69% : can we say

## References

* https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
* LDA Viz - http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf
* This dashboard @ https://github.com/dhruvaray/soml
    