# Problem

Popular question and answer (qna) site -  stackoverflow (+ their sister sites) allows for download of monthly data dumps from https://archive.org/details/stackexchange.

With this data, can we classify the questions/answers based on

* Conceptual v/s howto question 
* Beginner v/s intermediate v/s hard/trick
* A particular question is associated with another question in terms of the next things to do or perhaps the pre-requisites?
* Predict the next question a user may ask based on this current search

The taxanomy could be a useful layout of the land for a student of the area.

# Schema

The schema for their data is located @ https://ia800500.us.archive.org/22/items/stackexchange/readme.txt.
    
Unfortunately, the data is dumped in an XML format and there is preliminary effort to convert that data into CSV format. We have written a converter (convert2csv.py) for the tables of interest.

The schemas for the tables of interest are shown below.


## Posts
-----------
- Id
- PostTypeId
  - 1: Question
  - 2: Answer
- ParentID (only present if PostTypeId is 2)
- AcceptedAnswerId (only present if PostTypeId is 1)
- CreationDate
- Score
- ViewCount
- Body
- OwnerUserId
- LastEditorUserId
- LastEditorDisplayName="Jeff Atwood"
- LastEditDate="2009-03-05T22:28:34.823"
- LastActivityDate="2009-03-11T12:51:01.480"
- CommunityOwnedDate="2009-03-11T12:51:01.480"
- ClosedDate="2009-03-11T12:51:01.480"
- Title=
- Tags=
- AnswerCount
- CommentCount
- FavoriteCount

## Comments
---------------------------
- Id
- PostId
- Score
- Text, e.g.: "@Stu Thompson: Seems possible to me - why not try it?"
- CreationDate, e.g.:"2008-09-06T08:07:10.730"
- UserId

## Post History
---------------------------
- Id
- PostHistoryTypeId
    - 1: Initial Title - The first title a question is asked with.
    - 2: Initial Body - The first raw body text a post is submitted with.
    - 3: Initial Tags - The first tags a question is asked with.
    - 4: Edit Title - A question's title has been changed.
    - 5: Edit Body - A post's body has been changed, the raw text is stored here as markdown.
    - 6: Edit Tags - A question's tags have been changed.
    - 7: Rollback Title - A question's title has reverted to a previous version.
    - 8: Rollback Body - A post's body has reverted to a previous version - the raw text is stored here.
    - 9: Rollback Tags - A question's tags have reverted to a previous version.
    - 10: Post Closed - A post was voted to be closed.
    - 11: Post Reopened - A post was voted to be reopened.
    - 12: Post Deleted - A post was voted to be removed.
    - 13: Post Undeleted - A post was voted to be restored.
    - 14: Post Locked - A post was locked by a moderator.
    - 15: Post Unlocked - A post was unlocked by a moderator.
    - 16: Community Owned - A post has become community owned.
    - 17: Post Migrated - A post was migrated.
    - 18: Question Merged - A question has had another, deleted question merged into itself.
    - 19: Question Protected - A question was protected by a moderator
    - 20: Question Unprotected - A question was unprotected by a moderator
    - 21: Post Disassociated - An admin removes the OwnerUserId from a post.
    - 22: Question Unmerged - A previously merged question has had its answers and votes restored.
- PostId
- RevisionGUID: At times more than one type of history record can be recorded by a single action.  
- CreationDate: "2009-03-05T22:28:34.823"
- UserId
- UserDisplayName: populated if a user has been removed and no longer referenced by user Id
- Comment: This field will contain the comment made by the user who edited a post
- Text: A raw version of the new value for a given revision. 
- CloseReasonId
    - 1: Exact Duplicate - This question covers exactly the same ground as earlier questions on this topic; its answers may be merged with another identical question.
    - 2: off-topic
    - 3: subjective
    - 4: not a real question
    - 7: too localized
       
       
## Users
---------------------------
 - Id
 - Reputation
 - CreationDate
 - DisplayName
 - EmailHash
 - LastAccessDate
 - WebsiteUrl
 - Location
 - Age
 - AboutMe
 - Views
 - UpVotes
 - DownVotes
       

# Conversion from XML to CSV

Run python convert2csv.py to convert each of the xml files to their CSV equivalents. For columns/attributes which contain textual data, the converter encodes them with base64 encoding so that handling of quotes and special characters (separators) is avoided. 

When the data is read back into the dataframe, the corresponding decode (from base64) needs to happen. The converter also creates a sample file of 100 rows for each xml data dump converted.

In [1]:
#imports
import pandas as pd
import base64
import math
import re
import gensim
from gensim import corpora, models
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup 
from nltk.stem.porter import PorterStemmer

%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

In [2]:
posts = pd.read_csv('posts.sample.csv').dropna(subset=['Body','Title'])
posts['Body'] = posts['Body'].apply(lambda x : BeautifulSoup(base64.b64decode(x),"lxml").get_text())
posts['Title'] = posts['Title'].apply(lambda x : BeautifulSoup(base64.b64decode(x),"lxml").get_text())
posts[['Body','Title']].head(5)

Unnamed: 0,Body,Title
0,When should I use can? When should I use could...,"When do I use ""can"" or ""could""?"
1,"Doesn't ""quint"" mean ""five""? What does that h...","Where does the ""quint"" in ""quintessential"" com..."
2,"Which is the correct use of these two words, a...","When should I use ""shall"" versus ""will""?"
4,"I think most folk happily use either ""while"" o...","When did ""while"" and ""whilst"" become interchan..."
5,\n\nI may not be coming in tomorrow... \nI mig...,"""May"" & ""Might"": What's the right context?"


### Notice that we have stripped out the html formatting tags with BeautifulSoup before reassigning back to the dataframe

In [3]:
comments = pd.read_csv('comments.sample.csv').dropna()
comments['Text'] = comments['Text'].apply(lambda x : BeautifulSoup(base64.b64decode(x),"lxml").get_text())
comments[['Score','Text']].head(5)

Unnamed: 0,Score,Text
0,9,I think you need to edit the title of your que...
1,12,It's correct when you're accessing a method of...
2,2,"Yes, I would think in almost any context where..."
3,0,"Would you say `It can certainly be ""acceptable..."
4,4,@serg555: Would you expect anything less on a ...


In [4]:
posthistory = pd.read_csv('posthistory.sample.csv').dropna(subset=['Text'])
posthistory['Text'] = posthistory['Text'].apply(lambda x : BeautifulSoup(base64.b64decode(x),"lxml").get_text())
comments[['CreationDate','Text']].head(5)

Unnamed: 0,CreationDate,Text
0,2010-08-05T19:48:13.987,I think you need to edit the title of your que...
1,2010-08-05T20:01:43.273,It's correct when you're accessing a method of...
2,2010-08-05T20:10:25.270,"Yes, I would think in almost any context where..."
3,2010-08-05T20:11:23.957,"Would you say `It can certainly be ""acceptable..."
4,2010-08-05T20:12:43.500,@serg555: Would you expect anything less on a ...


In [5]:
users = pd.read_csv('users.sample.csv').dropna(subset=['AboutMe','Location'])
users['AboutMe'] = users['AboutMe'].apply(lambda x : BeautifulSoup(base64.b64decode(x),"lxml").get_text())
users['Location'] = users['Location'].apply(lambda x : BeautifulSoup(base64.b64decode(x),"lxml").get_text())
users[['Location','AboutMe']].head(5)

Unnamed: 0,Location,AboutMe
0,on the server farm,"Hi, I'm not really a person.\nI'm a background..."
1,"Corvallis, OR",Developer on the Stack Overflow team. Find me...
2,"New York, NY",Developer on the Stack Overflow team.\nWas dub...
3,"Raleigh, NC",I design stuff for Stack Exchange. Also a prof...
4,California,"I slip my front end into the back end, and the..."


## Further cleansing

* Remove (html) tags & carriage returns from the Text field
* Remove stop words (pick up the nltk stop words)
* Use PorterStemmer to stem words

In [6]:
#global
p_stemmer = PorterStemmer()
stop_words = stopwords.words('english')
print stop_words

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [7]:
class Sentences():
    def __init__(self,df,field):
        self.field = field
        self.df = df
    
    def __iter__(self):
      for index, row in self.df.iterrows():
         raw_sentence = row[self.field].replace('\n','').lower()
         raw_tokens = filter(None, re.split("[ ?]+",raw_sentence))
         stem_tokens = [p_stemmer.stem(tok) for tok in raw_tokens]
         tokens = [tok for tok in stem_tokens if not tok in stop_words]       
         yield tokens
        
#all posts is a list of (list of tokens). The inner list of tokens is created once for each post
allposts = Sentences(posts,'Body')

In [8]:
dictionary = corpora.Dictionary(allposts)
#print(dictionary.token2id) maps ids to tokens

In [9]:
#bag of words
corpus = [dictionary.doc2bow(text) for text in allposts]

In [10]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=300, id2word = dictionary, passes=20)

In [11]:
print(ldamodel.print_topics(num_topics=3, num_words=4))

[(163, u'0.003*that\u2019 + 0.003*whom. + 0.003*up, + 0.003*said'), (241, u'0.003*that\u2019 + 0.003*whom. + 0.003*up, + 0.003*said'), (196, u'0.003*that\u2019 + 0.003*whom. + 0.003*up, + 0.003*said')]
