# Post classification Experiment using Scikit learn

* Date 20/02/18
* Dylan Butler

## Task
The overall task of this experiment is to create a trained classifier to correctly classify whether or not a post is useful for quizes and knowledge testing of Java core concepts.

## Data
The data for this experiment consists of a manually labelled dataset of 1500 stackoverflow posts. These posts have been filtered according to the following characteristics:

* They posses the structure of either a "how-to"(procedural intent) or a "why"(casual intent) type of question
* They have a minimum score of 7 (post score)
* They have not been deleted
* They have not been closed
* They have an accepted answer

After extracting this data I conducted an analysis on the resulting dataset to gain a deeper understanding of the data:

### Extracted Data insights
* Group 1 (useful for quizzes):
    * How to split a string in Java?
    * Read and convert an input stream to a string?
    * How to read all files in a folder in Java?
    * How to round a number to n decimal places in Java?
    * How to parse JSON in Java?
    * How do I declare and initialize an array in Java?
    * Why is it faster to process an unsorted array vs a sorted array
    * How do I compare strings in Java?
* Group 2 (not useful fr quizzes):
    * How do I fix android.os.NetworkOnMainThreadException?
    * How do you assert that a certain exception is thrown in JUnit 4 tests?
    * How to fix java.lang.UnsupportedClassVersionError: Unsupported major.minor version
    * How to add local jar files to a Maven project?
    * How do I set up IntelliJ IDEA for Android applications?
    * How does autowiring work in Spring?
    * How do I tell Maven to use the latest version of a dependency?
    * Unfortunately MyApp has stopped. How can I solve this?
    * Why is subtracting these two times (in 1927) giving a strange result?

### Key Findings
* Useless Q's
    * A key difference I can spot is that most of the questions that pose no use are environment, framework, related and focus on a technology that uses Java.
    * Verbs like; set-up, fix, stopped ... i.e. less java specific and more generic - used in everyday language. 
* Useful Q's
    * The useful questions seem to be following a pattern in which the main words in the questions (split, string, read, java, JSON, declare, initialize) are all words closely related to Java and programming concepts in general.  
    * The verbs/action words used in the useful q's are closely associated with java itself.
    
    
# Experiment Process

1. Chunk tags and titles and bodies into a single body
    * eliminate code snippets 
    * remove stop words
    * lemmatise each body
2. Extract the core features from the text that the algorithm can learn from
3. Train a classifier
4. Evaluate
5. Improve results

# 1) Generating the data
The format I will converting the data into for this first experiment will be flattened chunks of (tags, title and body) of each post. 

1. Remove all the code snippets from the bodys and titles of the  text --> using BeautifulSoup
2. Clean the post tags from the <[tag | tag-tag]> format --> [tag | tag tag]
3. Merge the tags, title, bodies into a single chunk
4. remove all stop words


In [2]:
import pandas 
df = pandas.read_csv('./data/procedural_casual_Q_1500_SO_Java.csv')

In [3]:
df.head()

Unnamed: 0,Id,Score,Body,Title,Tags,AnswerCount,CommentCount,FavoriteCount,ViewCount,OK
0,13225,8,<p>I've recently inherited a internationalized...,How can I refactor HTML markup out of my prope...,<java><jsp><internationalization><struts>,4,0,,2078,0
1,24991,19,<p>I have defined a Java function:</p>\n\n<pre...,Why can't I explicitly pass the type argument ...,<java><generics><syntax>,4,1,6.0,21171,1
2,24866,11,<p>I am using Java back end for creating an XM...,Is it essential that I use libraries to manipu...,<java><xml>,11,0,,690,0
3,25449,29,<p>I want to create a Java program that can be...,How to create a pluginable Java program?,<java><plugins><plugin-architecture>,6,1,18.0,17544,1
4,26305,151,<p>I want to be able to play sound files in my...,How can I play sound in Java?,<java><audio>,9,1,57.0,262318,1


Merge each posts body and title into a single chunk

In [8]:
df.columns[2:4]

Index(['Body', 'Title'], dtype='object')

In [10]:
#merges title and body into a single chunk
df['Title_Body_Chunk'] = df[df.columns[2:4]].apply(lambda x: ','.join(x),axis=1)

In [21]:
df.drop(['Body', 'Title'], axis=1)

Unnamed: 0,Id,Score,Tags,AnswerCount,CommentCount,FavoriteCount,ViewCount,OK,Title_Body_Chunk
0,13225,8,<java><jsp><internationalization><struts>,4,0,,2078,0,<p>I've recently inherited a internationalized...
1,24991,19,<java><generics><syntax>,4,1,6.0,21171,1,<p>I have defined a Java function:</p>\n\n<pre...
2,24866,11,<java><xml>,11,0,,690,0,<p>I am using Java back end for creating an XM...
3,25449,29,<java><plugins><plugin-architecture>,6,1,18.0,17544,1,<p>I want to create a Java program that can be...
4,26305,151,<java><audio>,9,1,57.0,262318,1,<p>I want to be able to play sound files in my...
5,40545683,8,<java><mappedbytebuffer>,2,11,1.0,134,0,<p>I have just encountered an error in my open...
6,16192410,11,<java><mysql>,3,0,3.0,34848,0,"<p>I have a java program, when i log in, after..."
7,43802,60,<java><date><calendar>,5,1,19.0,115813,1,<p>I have a <code>String</code> representation...
8,8110975,19,<java><swing><colors><paintcomponent><transluc...,2,0,1.0,34897,0,<p>I'm trying to paint a rectangle on my appli...
9,47177,137,<java><memory><cross-platform><cpu><diskspace>,10,4,87.0,202224,1,<p>I would like to monitor the following syste...


In [25]:
from bs4 import BeautifulSoup
from bs4 import Tag

In [39]:
def _remove_attrs(soup):
    for tag in soup.findAll(True): 
        tag.attrs = None
    return soup

In [51]:
#initialise a new column
df['cleaned_body_title'] = ""

# loop thorugh the data frame
for index, row in df.iterrows():
        
        #print(row.Title_Body_Chunk)
        
        soup = BeautifulSoup(row['Title_Body_Chunk'], 'html5lib')
        
        for code in soup.find_all("code"):
            code.decompose()
        cleaned = soup.get_text()
        
        #create a new column to hold the cleaned data
        df.loc[index, "cleaned_body_title"] = cleaned

In [55]:
df = df.drop(['Title', 'Body', 'Title_Body_Chunk'], axis=1)

In [56]:
df.head()

Unnamed: 0,Id,Score,Tags,AnswerCount,CommentCount,FavoriteCount,ViewCount,OK,cleaned_body_title
0,13225,8,<java><jsp><internationalization><struts>,4,0,,2078,0,I've recently inherited a internationalized an...
1,24991,19,<java><generics><syntax>,4,1,6.0,21171,1,I have defined a Java function:\n\n\n\nOne way...
2,24866,11,<java><xml>,11,0,,690,0,I am using Java back end for creating an XML s...
3,25449,29,<java><plugins><plugin-architecture>,6,1,18.0,17544,1,I want to create a Java program that can be ex...
4,26305,151,<java><audio>,9,1,57.0,262318,1,I want to be able to play sound files in my pr...


Generate a Dataframe with only the classification and the chunk of text

In [57]:
df_new = df[['cleaned_body_title', 'OK']]

In [59]:
#df_new

# 2) Extracting Features from the documents

In [60]:
import numpy as np

In [61]:
from sklearn.feature_extraction.text import CountVectorizer

In [62]:
cv = CountVectorizer()
counts = cv.fit_transform(df_new['cleaned_body_title'].values)

In [71]:
counts

<1499x8026 sparse matrix of type '<class 'numpy.int64'>'
	with 85058 stored elements in Compressed Sparse Row format>

### list all of the elements in the CountVectorizer

In [75]:
cv.get_feature_names()

['00',
 '000',
 '0000',
 '000000',
 '00000000000000000',
 '0001',
 '0007',
 '000z',
 '001',
 '0014',
 '01',
 '02',
 '03',
 '04',
 '05',
 '06',
 '07',
 '08',
 '09',
 '099',
 '0_07',
 '0_101',
 '0_144',
 '0_15',
 '0_20',
 '0_25',
 '0_26',
 '0_29',
 '0_40',
 '0_55',
 '0_60',
 '0_74',
 '0_91',
 '0a2',
 '0s',
 '0x0a',
 '0x2',
 '0x7fffffff',
 '0xff000000',
 '10',
 '100',
 '1000',
 '10000000',
 '10000000x2',
 '1001',
 '100d',
 '100ms',
 '100th',
 '101',
 '1024',
 '1028',
 '1033',
 '106',
 '109',
 '10k',
 '10m',
 '10x',
 '11',
 '11111111',
 '1112',
 '1123',
 '1139',
 '11g',
 '12',
 '1225328',
 '123',
 '1234',
 '12345',
 '12346',
 '12347',
 '1235',
 '124',
 '1245',
 '125',
 '1264622',
 '127',
 '1274883865000',
 '1274883865399',
 '1275552',
 '128',
 '12c',
 '12clarification',
 '13',
 '1300x',
 '132',
 '133',
 '1345',
 '135',
 '14',
 '143',
 '1433',
 '144',
 '145',
 '145007',
 '146',
 '149',
 '15',
 '150k',
 '151',
 '1586',
 '15mb',
 '15pm',
 '15s',
 '16',
 '166',
 '17',
 '17858',
 '17pm',
 '18',

# 3) Classifying the Posts

The first classifier I will be implementing is a naive bayes classifier. Bayes theorom - each feature (in this case word counts) is independent from every other one and each one contributes to the probability that an example belongs to a particular class

## Create, Initialize and train a new MultinomialNB

In [76]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
#targets are the OK column in the df_new dataframe above
targets = df_new['OK'].values
#train the NB classifier
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### test out the classifier

In [79]:
examples = ["How do I explicitly pass the type argument to a generic Java method? I do not understand how to achieve this", "How do I generate a new eclipse project? I am trying to create a new eclipse project and I need help setting it up"]
example_counts = cv.transform(examples)
predictions = classifier.predict(example_counts)

In [80]:
predictions

array([1, 0], dtype=int64)

#### Notes on the above:

The predictor can correctly classify between the two examples that were generated using the chunk of text provided for each. 

# Pipelining - connecting the process