# Classifying the Java Posts from SO

* Dylan Butler
* 26/02/18

## Overview
This notebook will document the process of classifying new data which contains all the Java tagged how-to and why questions from stackoverflow into two categories: OK (1) for quizzes or NOT OK (0) for quizzes. A pipeline of processes will be created to automate preprocessing the raw data, inserting it into the model and storing the OK posts for the application to work off. 

## Process
1. Load the data into a dataframe
2. Preprocess the data:
    * Clean the tags
    * Chunk the title and tags into a single column
3. Passing each instance into the trained model and labelling with 1(OK) or 0 (NOT OK)
4. Discard all NOT OK posts
5. Save postID, Title, Tags and Accepted Answer to a PostgreSQL DB 

# 1) Load the data

In [1]:
import pandas as pd
df = pd.read_csv('./data/StackoverflowCompleteDS_JAVA.csv')

# 2) Preprocess the data

In [2]:
df = df[['Id', 'Title','Body','Tags', 'body']]

In [3]:
df.head()

Unnamed: 0,Id,Title,Body,Tags,body
0,5328,Why can't I use a try block around my super() ...,"<p>So, in Java, the first line of your constru...",<java><exception><mocking><try-catch>,"<p>Unfortunately, compilers can't work on theo..."
1,15690,How do you begin designing a large system?,<p>It's been mentioned to me that I'll be the ...,<java><oop><design><architecture>,"<p>Do you know much about OOP? If so, look in..."
2,24866,Is it essential that I use libraries to manipu...,<p>I am using Java back end for creating an XM...,<java><xml>,"<p>It's not essential, but advisable. However,..."
3,25449,How to create a pluginable Java program?,<p>I want to create a Java program that can be...,<java><plugins><plugin-architecture>,<p>I've done this for software I've written in...
4,24991,Why can't I explicitly pass the type argument ...,<p>I have defined a Java function:</p>\n\n<pre...,<java><generics><syntax>,<p>When the java compiler cannot infer the par...


## a) Clean the tags

In [4]:
def clean_tags(raw_tags):
    cleaned_tags = raw_tags.replace('>', " ").replace('<', " ").replace('java', '')
    return cleaned_tags

for index, row in df.iterrows():
    cleaned_tags = clean_tags(df.loc[index, 'Tags'])
    df.loc[index, 'Tags'] = cleaned_tags

## b) Chunking the title and tags per post

In [5]:
df['title_tag_chunk'] = df[df.columns[1:3]].apply(lambda x: ','.join(x),axis=1)

# 3) Load the Trained NaiveBayes Model and Filter Dataset

In [6]:
import pickle
trained_NB_model = pickle.load(open('./models/multinomialnb_classifier_ngrams_title_tag.sav', 'rb'))

## a) Test out approach on sample of data set and analyse results

In [7]:
import numpy as np

In [8]:
tmp_df = df[50:100]

In [9]:
test_list = list(tmp_df['title_tag_chunk'])

In [10]:
len(trained_NB_model.predict(test_list))

50

In [11]:
for l in test_list:
    prediction = trained_NB_model.predict([l])
    print("Question: {}\n Prediciton: {}\n".format(l, prediction))

Question: How to convert a Reader to InputStream and a Writer to OutputStream?,<p>Is there an easy way to avoid dealing with text encoding problems?</p>

 Prediciton: [1]

Question: Why am I getting a ClassCastException when generating javadocs?,<p>I'm using ant to generate javadocs, but get this exception over and over - why?</p>

<p>I'm using JDK version <strong>1.6.0_06</strong>.</p>

<pre><code>[javadoc] java.lang.ClassCastException: com.sun.tools.javadoc.ClassDocImpl cannot be cast to com.sun.javadoc.AnnotationTypeDoc
  [javadoc]     at com.sun.tools.javadoc.AnnotationDescImpl.annotationType(AnnotationDescImpl.java:46)
  [javadoc]     at com.sun.tools.doclets.formats.html.HtmlDocletWriter.getAnnotations(HtmlDocletWriter.java:1739)
  [javadoc]     at com.sun.tools.doclets.formats.html.HtmlDocletWriter.writeAnnotationInfo(HtmlDocletWriter.java:1713)
  [javadoc]     at com.sun.tools.doclets.formats.html.HtmlDocletWriter.writeAnnotationInfo(HtmlDocletWriter.java:1702)
  [javadoc]     

## b) - Filter the entire dataset

In [12]:
#create the target column
df['OK'] = None

#iterates over the dataframe
for index, row in df.iterrows():
    
    #extract the correct data to feed model
    data = df.loc[index, 'title_tag_chunk']
    #predicts whether or not it is ok
    prediction = trained_NB_model.predict([data])
    #saves prediction to row
    df.loc[index, 'OK'] = prediction

In [13]:
df.OK.count()

10728

In [14]:
df

Unnamed: 0,Id,Title,Body,Tags,body,title_tag_chunk,OK
0,5328,Why can't I use a try block around my super() ...,"<p>So, in Java, the first line of your constru...",exception mocking try-catch,"<p>Unfortunately, compilers can't work on theo...",Why can't I use a try block around my super() ...,[1]
1,15690,How do you begin designing a large system?,<p>It's been mentioned to me that I'll be the ...,oop design architecture,"<p>Do you know much about OOP? If so, look in...","How do you begin designing a large system?,<p>...",[0]
2,24866,Is it essential that I use libraries to manipu...,<p>I am using Java back end for creating an XM...,xml,"<p>It's not essential, but advisable. However,...",Is it essential that I use libraries to manipu...,[0]
3,25449,How to create a pluginable Java program?,<p>I want to create a Java program that can be...,plugins plugin-architecture,<p>I've done this for software I've written in...,"How to create a pluginable Java program?,<p>I ...",[0]
4,24991,Why can't I explicitly pass the type argument ...,<p>I have defined a Java function:</p>\n\n<pre...,generics syntax,<p>When the java compiler cannot infer the par...,Why can't I explicitly pass the type argument ...,[1]
5,32041,How to remove debug statements from production...,<p>Is it possible for the compiler to remove s...,debugging compiler-construction,<p>Two recommendations.</p>\n\n<p><strong>Firs...,How to remove debug statements from production...,[0]
6,32529,How do I restrict JFileChooser to a directory?,<p>I want to limit my users to a directory and...,swing jfilechooser,<p>You can probably do this by setting your ow...,How do I restrict JFileChooser to a directory?...,[0]
7,33262,How do I load an org.w3c.dom.Document from XML...,<p>I have a complete XML document in a string ...,xml document w3c,<p>This works for me in Java 1.5 - I stripped ...,How do I load an org.w3c.dom.Document from XML...,[0]
8,35186,How do I fix a NoSuchMethodError?,<p>I'm getting a <code>NoSuchMethodError</code...,nosuchmethoderror,<p>Without any more information it is difficul...,"How do I fix a NoSuchMethodError?,<p>I'm getti...",[0]
9,37089,How can an application use multiple cores or C...,<p>When launching a thread or a process in .NE...,c# multithreading,"<p>If you're using multiple threads, the opera...",How can an application use multiple cores or C...,[1]


In [15]:
df['OK'] = df['OK'].str.get(0)

In [16]:
df_ok = df[df.OK == 1]

In [17]:
df_ok = df_ok.drop(['title_tag_chunk', 'OK'], axis=1)

In [18]:
df_ok

Unnamed: 0,Id,Title,Body,Tags,body
0,5328,Why can't I use a try block around my super() ...,"<p>So, in Java, the first line of your constru...",exception mocking try-catch,"<p>Unfortunately, compilers can't work on theo..."
4,24991,Why can't I explicitly pass the type argument ...,<p>I have defined a Java function:</p>\n\n<pre...,generics syntax,<p>When the java compiler cannot infer the par...
9,37089,How can an application use multiple cores or C...,<p>When launching a thread or a process in .NE...,c# multithreading,"<p>If you're using multiple threads, the opera..."
10,37335,"How to deal with ""java.lang.OutOfMemoryError: ...",<p>I am writing a client-side <strong>Swing</s...,-ee jvm out-of-memory heap-memory,<p>Ultimately you always have a finite max of ...
11,41107,How to generate a random alpha-numeric string?,<p>I've been looking for a <em>simple</em> Jav...,string random alphanumeric,<h2>Algorithm</h2>\n\n<p>To generate a random ...
14,64036,How do you make a deep copy of an object in Java?,<p>In java it's a bit difficult to implement a...,class clone,"<p>A safe way is to serialize the object, then..."
15,71585,How to get parametrized Class instance,"<p>Since generics were introduced, Class is pa...",generics,<p>The Class class is a run-time representatio...
16,71625,Why would a static nested interface be used in...,<p>I have just found a static nested interface...,interface static,<p>The static keyword in the above example is ...
18,86780,How to check if a String contains another Stri...,"<p>Say I have two strings,</p>\n\n<pre><code>S...",string,"<p>Yes, contains is case sensitive. You can u..."
20,107823,Why is my Java program leaking memory when I c...,"<p>(Jeopardy-style question, I wish the answer...",multithreading memory-leaks,"<p>This is a known bug in Java 1.4:\n<a href=""..."


In [19]:
df_ok.to_csv('./data/filtered_data_ready_for_app.csv')

In [20]:
for item in list(df_ok.Title):
    print(item)

Why can't I use a try block around my super() call?
Why can't I explicitly pass the type argument to a generic Java method?
How can an application use multiple cores or CPUs in .NET or Java?
How to deal with "java.lang.OutOfMemoryError: Java heap space" error (64MB heap size)
How to generate a random alpha-numeric string?
How do you make a deep copy of an object in Java?
How to get parametrized Class instance
Why would a static nested interface be used in Java?
How to check if a String contains another String in a case insensitive manner in Java?
Why is my Java program leaking memory when I call run() on a Thread object?
How do I remove objects from an array in Java?
How to perform string Diffs in Java?
How do I list / export private keys from a keystore?
How to round up the result of integer division?
How do you convert binary data to Strings and back in Java?
Is it possible to detect if an exception occurred before I entered a finally block?
How can I play sound in Java?
How do you g

How can I find out the serialVersionUID of a serialized Java object?
How to suppress all checks for a file in Checkstyle?
How to convert a byte array to its numeric value (Java)?
How can I save a PNG with a tEXt or iTXt chunk from Java?
How to use a bitwise operator to pass multiple Integer values into a function for Java?
Why the awkward Design of System.out?
Is it possible to extend a class with no constructors in Java?
Why isn't my @BeforeClass method running?
How to calculate the difference between two Java java.sql.Timestamps?
How do I sort a Set to a List in Java?
How can I make sure N threads run at roughly the same speed?
Why won't this generic java code compile?
How to analyze simple English sentences
How to walk through Java class resources?
How can I exchange the first and last characters of a string in Java?
Why does the Java List interface not support getLast()?
How can I handle an IOException which I know can never be thrown, in a safe and readable manner?
How does synchr

How to find the number of days between two dates in java or groovy?
Is it worth cleaning ThreadLocals in Filter to solve thread pool-related issues?
How to parse a cookie string
How to merge two sorted arrays into a sorted array?
Why do my SwingWorker threads keep running even though they are done executing?
How to set time to a date object in java
How do you specify a byte literal in Java?
How uncheck items in AlertDialog (setMultiChoiceItems)?
How to make file sparse?
how to wait for Android runOnUiThread to be finished?
How to combine two byte arrays
Why does Java limit the size of a method to 65535 byte?
How to pass parameters to anonymous class?
How to create IN OUT or OUT parameters in Java
Why java has a lot of duplicate methods?
Is it possible in java make something like Comparator but for implementing custom equals() and hashCode()
How to convert Map<String, String> to Map<Long, String> using guava
How to read a .NET Guid into a Java UUID
Why no readUnsignedInt in RandomAccess

How to create a variable that can be set only once but isn't final in Java
How to acquire a lock by a key
How to check whether a string contains at least one alphabet in java?
How does the visitor pattern not violate the Open Close Priniciple?
How to remove control characters from java string?
How serialization works when only subclass implements serializable
How to get all descendants of an element using webdriver?
Why Guava does not provide a way to transform map keys
How to stop a java program if it is determined it should not run?
Why does Java's RoundingMode HALF_UP round -2.5 to -3?
How to use enum in switch case
How does an Enumeration variable works?
Why I can't have int in the type of ArrayList?
Why do I need to escape unicode in java source files?
How to re-throw an exception
Is it possible that write a program with Java bytecode instructions directly?
Why java provide facility to declare interface inside interface
Why is BigDecimal.equals specified to compare both value and 

Is it possible to use sun.misc.Unsafe to call C functions without JNI?
Why does select() consume so much CPU time in my program?
Why this weird output with truncate and BigDecimal?
Why is the hash table resized by doubling it?
How to obtain the end of the day when given a LocalDate?
How does the JVM decided to JIT-compile a method (categorize a method as "hot")?
How to convert Java assignment expression to Kotlin
Why is the Java 8 'Collector' class designed in this way?
How to clear Java 9 JShell Console?
Why doesn't Java have true multidimensional arrays?
How to know if an array can be sorted by one swap or less?
Why can't AtomicBoolean be a replacement for Boolean?
Why java += get wrong result, and how can I prevent this?
Is it allowed/adviseable to reuse a Collector?
How do I get the most frequent word in a Map and it's corresponding frequency of occurrence using Java 8 streams?
Why is Arrays.binarySearch not improving the performance compared to walking the array?
Why is `synchroni