# Problem

Popular question and answer (qna) site -  stackoverflow (+ their sister sites) allows for download of monthly data dumps from https://archive.org/details/stackexchange.

With this data, can we classify the questions/answers based on

* Conceptual v/s howto question 
* Beginner v/s intermediate v/s hard/trick
* A particular question is associated with another question in terms of the next things to do or perhaps the pre-requisites?
* Predict the next question a user may ask based on this current search

The taxanomy could be a useful layout of the land for a student of the area.

# Schema

The schema for their data is located @ https://ia800500.us.archive.org/22/items/stackexchange/readme.txt.
    
Unfortunately, the data is dumped in an XML format and there is preliminary effort to convert that data into CSV format. We have written a converter (convert2csv.py) for the tables of interest.

The schemas for the tables of interest are shown below.


## Posts
-----------
- Id
- PostTypeId
  - 1: Question
  - 2: Answer
- ParentID (only present if PostTypeId is 2)
- AcceptedAnswerId (only present if PostTypeId is 1)
- CreationDate
- Score
- ViewCount
- Body
- OwnerUserId
- LastEditorUserId
- LastEditorDisplayName="Jeff Atwood"
- LastEditDate="2009-03-05T22:28:34.823"
- LastActivityDate="2009-03-11T12:51:01.480"
- CommunityOwnedDate="2009-03-11T12:51:01.480"
- ClosedDate="2009-03-11T12:51:01.480"
- Title=
- Tags=
- AnswerCount
- CommentCount
- FavoriteCount

## Comments
---------------------------
- Id
- PostId
- Score
- Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?"
- CreationDate, e.g.:"2008-09-06T08:07:10.730"
- UserId

## Post History
---------------------------
- Id
- PostHistoryTypeId
    - 1: Initial Title - The first title a question is asked with.
    - 2: Initial Body - The first raw body text a post is submitted with.
    - 3: Initial Tags - The first tags a question is asked with.
    - 4: Edit Title - A question's title has been changed.
    - 5: Edit Body - A post's body has been changed, the raw text is stored here as markdown.
    - 6: Edit Tags - A question's tags have been changed.
    - 7: Rollback Title - A question's title has reverted to a previous version.
    - 8: Rollback Body - A post's body has reverted to a previous version - the raw text is stored here.
    - 9: Rollback Tags - A question's tags have reverted to a previous version.
    - 10: Post Closed - A post was voted to be closed.
    - 11: Post Reopened - A post was voted to be reopened.
    - 12: Post Deleted - A post was voted to be removed.
    - 13: Post Undeleted - A post was voted to be restored.
    - 14: Post Locked - A post was locked by a moderator.
    - 15: Post Unlocked - A post was unlocked by a moderator.
    - 16: Community Owned - A post has become community owned.
    - 17: Post Migrated - A post was migrated.
    - 18: Question Merged - A question has had another, deleted question merged into itself.
    - 19: Question Protected - A question was protected by a moderator
    - 20: Question Unprotected - A question was unprotected by a moderator
    - 21: Post Disassociated - An admin removes the OwnerUserId from a post.
    - 22: Question Unmerged - A previously merged question has had its answers and votes restored.
- PostId
- RevisionGUID: At times more than one type of history record can be recorded by a single action.  
- CreationDate: "2009-03-05T22:28:34.823"
- UserId
- UserDisplayName: populated if a user has been removed and no longer referenced by user Id
- Comment: This field will contain the comment made by the user who edited a post
- Text: A raw version of the new value for a given revision. 
- CloseReasonId
    - 1: Exact Duplicate - This question covers exactly the same ground as earlier questions on this topic; its answers may be merged with another identical question.
    - 2: off-topic
    - 3: subjective
    - 4: not a real question
    - 7: too localized
       
       
## Users
---------------------------
 - Id
 - Reputation
 - CreationDate
 - DisplayName
 - EmailHash
 - LastAccessDate
 - WebsiteUrl
 - Location
 - Age
 - AboutMe
 - Views
 - UpVotes
 - DownVotes
       

# Conversion from XML to CSV

Run python convert2csv.py to convert each of the xml files to their CSV equivalents. For columns/attributes which contain textual data, the converter encodes them with base64 encoding so that handling of quotes and special characters (separators) is avoided. 

When the data is read back into the dataframe, the corresponding decode (from base64) needs to happen. The converter also creates a sample file of 100 rows for each xml data dump converted.

In [426]:
#imports
import pandas as pd
import math
import re
import gensim
from gensim import corpora, models
from gensim.models import word2vec, doc2vec
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()


SAMPLE_SIZE = 5000

In [427]:
posts = pd.read_csv('data/posts.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Body','Title'])
posts['Tags'] = posts['Tags'].apply(lambda x : x.replace('<',' ').replace('>',' '))
posts[['Body','Title', 'Tags']].head(2)

Unnamed: 0,Body,Title,Tags
0,when should i use can when should i use could...,when do i use can or could,word-choice tenses politeness subjunctive-...
1,doesn t quint mean five what does that h...,where does the quint in quintessential com...,etymology


In [428]:
comments = pd.read_csv('data/comments.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna()
comments[['Score','Text']].head(5)

Unnamed: 0,Score,Text
0,9,i think you need to edit the title of your que...
1,12,it s correct when you re accessing a method of...
2,2,yes i would think in almost any context where...
3,0,would you say it can certainly be acceptable...
4,4,serg would you expect anything less on a ...


In [429]:
posthistory = pd.read_csv('data/posthistory.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['Text'])
posthistory[['Text']].head(5)

Unnamed: 0,Text
0,when could i use can or when can i use could ...
1,when do i could can could
3,doesn t quint mean five what does that hav...
4,where does the quint in quintessential come ...
6,which is the correct use of these two words an...


In [430]:
users = pd.read_csv('data/users.csv.gz',compression='gzip',nrows=SAMPLE_SIZE).dropna(subset=['AboutMe','Location'])
users[['Location','AboutMe']].head(5)

Unnamed: 0,Location,AboutMe
0,on the server farm,hi i m not really a person i m a background ...
1,corvallis or,developer on the stack overflow team find me...
2,new york ny,developer on the stack overflow team was dubb...
3,raleigh nc,i design stuff for stack exchange also a prof...
4,california,i slip my front end into the back end and the...


## Further cleansing

* Remove (html) tags & carriage returns from the Text field
* Remove stop words (pick up the nltk stop words)
* Use PorterStemmer to stem words

In [431]:
#global
p_stemmer = PorterStemmer()
stop_words = stopwords.words('english')
stop_words.append('use')
#print(stop_words)

In [432]:
class SentenceTokens():
    def __init__(self,df,field):
        self.field = field
        self.df = df
    
    def __iter__(self):
      for index, row in self.df.iterrows():
         raw_sentence = row[self.field]
         raw_tokens = filter(None, re.split("[ ]+",raw_sentence))
         #stem_tokens = [p_stemmer.stem(tok) for tok in raw_tokens]
         yield [tok for tok in raw_tokens if not tok in stop_words and len(tok) > 1 ]

#all posts is a list of (list of tokens). The inner list of tokens is created once for each post
allposts = SentenceTokens(posts,'Title')


class LabeledLineSentence(object):
    def __init__(self,df,field,tag):
        self.df = df
        self.field = field
        self.tag = tag

    def __iter__(self):
        for index, row in self.df.iterrows():
           tags = ['T_' + tag for tag in row[self.tag].split()]
           yield doc2vec.TaggedDocument(words=row[self.field].split(),tags=tags)

lablines = LabeledLineSentence(posts,'Title','Tags')
#print([p for p in allposts])

In [433]:
#How frequently each term occurs within each document? We construct a document-term matrix.
dictionary = corpora.Dictionary(allposts)
#print(dictionary.token2id) #maps ids to tokens

In [441]:
#bag of words
corpus = [dictionary.doc2bow(text) for text in allposts]
#corpus, is a list of vectors equal to the number of documents. 
#In each document vector is a series of tuples. 
print(corpus[0])
print(len(corpus))
#(term ID, term frequency) pairs

[(0, 1)]
1371


In [435]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=30)

#num_topics: required. An LDA model requires the user to determine how many topics should be generated. 
#id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.
#passes: optional. The number of laps the model will take through corpus.

In [436]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

In [437]:
@interact(num_topics=5, num_words=6)
def understand(num_topics, num_words):
    return ldamodel.print_topics(num_topics, num_words)
    
#Each generated topic is separated by a comma.
#Within each topic are the three most probable words to appear in that topic.

[(1,
  u'0.030*pronunciation + 0.023*correct + 0.019*word + 0.016*usage + 0.012*different + 0.010*mean'),
 (3,
  u'0.032*words + 0.030*english + 0.014*difference + 0.012*correct + 0.011*word + 0.011*american'),
 (4,
  u'0.071*vs + 0.015*word + 0.013*correct + 0.011*sentence + 0.009*proper + 0.008*appropriate'),
 (6,
  u'0.043*used + 0.023*using + 0.017*word + 0.015*instead + 0.010*versus + 0.010*correct'),
 (8,
  u'0.021*sentence + 0.020*correct + 0.014*question + 0.013*one + 0.012*right + 0.011*grammatically')]

In [438]:
pyLDAvis.gensim.prepare(ldamodel,corpus,dictionary)

In [439]:
#wordmodel = word2vec.Word2Vec(allposts)
#wordmodel.init_sims(replace=True)
#docmodel = doc2vec.Doc2Vec(lablines, min_count=1)


docmodel = doc2vec.Doc2Vec(alpha=0.025, min_alpha=0.025)
docmodel.build_vocab(lablines)
for epoch in range(10):
    docmodel.train(lablines)
    docmodel.alpha -= 0.002  # decrease the learning rate
    docmodel.min_alpha = docmodel.alpha  # fix the learning rate, no decay



In [440]:
input = 'how should one answer a business phone'
norm_input = [tok for tok in input.split() if not tok in stop_words and len(tok) > 1 ]
print(norm_input)
input_vec = docmodel.infer_vector(norm_input)
similar_vecs = docmodel.docvecs.most_similar(positive=[input_vec])
similartags = [s[0] for s in similar_vecs]

def matches(data,col,values):
    match = False
    for v in values:
        match =  data[col].str.contains(v.replace('T_',''))
        if match is True:
            return match
    return match

similarposts = posts[matches(posts,'Tags',similartags)]
similarposts[['Body','Title', 'Tags']].head(50)

['one', 'answer', 'business', 'phone']


Unnamed: 0,Body,Title,Tags
1201,when i was a child pretty much every children...,english term for a word that differs from anot...,single-word-requests terminology word-games
3617,having studied latin at high school and not be...,what s the origin of pig latin,history word-games


## References

* https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
* LDA Viz - http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf
* This dashboard @ https://github.com/dhruvaray/soml
    