# Classifying the Java Posts from SO

* Dylan Butler
* 26/02/18

## Overview
This notebook will document the process of classifying new data which contains all the Java tagged how-to and why questions from stackoverflow into two categories: OK (1) for quizzes or NOT OK (0) for quizzes. A pipeline of processes will be created to automate preprocessing the raw data, inserting it into the model and storing the OK posts for the application to work off. 

## Process
1. Load the data into a dataframe
2. Preprocess the data:
    * Clean the tags
    * Chunk the title and tags into a single column
3. Passing each instance into the trained model and labelling with 1(OK) or 0 (NOT OK)
4. Discard all NOT OK posts
5. Save postID, Title, Tags and Accepted Answer to a PostgreSQL DB 

# 1) Load the data

In [1]:
import pandas as pd
df = pd.read_csv('./data/StackoverflowCompleteDS_JAVA.csv')

# 2) Preprocess the data

In [2]:
df = df[['Id', 'Title','Body','Tags', 'body']]

In [3]:
df.head()

Unnamed: 0,Id,Title,Body,Tags,body
0,5328,Why can't I use a try block around my super() ...,"<p>So, in Java, the first line of your constru...",<java><exception><mocking><try-catch>,"<p>Unfortunately, compilers can't work on theo..."
1,15690,How do you begin designing a large system?,<p>It's been mentioned to me that I'll be the ...,<java><oop><design><architecture>,"<p>Do you know much about OOP? If so, look in..."
2,24866,Is it essential that I use libraries to manipu...,<p>I am using Java back end for creating an XM...,<java><xml>,"<p>It's not essential, but advisable. However,..."
3,25449,How to create a pluginable Java program?,<p>I want to create a Java program that can be...,<java><plugins><plugin-architecture>,<p>I've done this for software I've written in...
4,24991,Why can't I explicitly pass the type argument ...,<p>I have defined a Java function:</p>\n\n<pre...,<java><generics><syntax>,<p>When the java compiler cannot infer the par...


## a) Clean the tags

In [4]:
def clean_tags(raw_tags):
    cleaned_tags = raw_tags.replace('>', " ").replace('<', " ").replace('java', '')
    return cleaned_tags

for index, row in df.iterrows():
    cleaned_tags = clean_tags(df.loc[index, 'Tags'])
    df.loc[index, 'Tags'] = cleaned_tags

## b) Chunking the title and tags per post

In [5]:
df['title_tag_chunk'] = df[df.columns[1:3]].apply(lambda x: ','.join(x),axis=1)

# 3) Load the Trained NaiveBayes Model and Filter Dataset

In [6]:
import pickle
trained_NB_model = pickle.load(open('./models/multinomialnb_classifier_ngrams_title_tag.sav', 'rb'))

## a) Test out approach on sample of data set and analyse results

In [7]:
import numpy as np

In [8]:
tmp_df = df[50:100]

In [10]:
test_list = list(tmp_df['title_tag_chunk'])

In [11]:
len(trained_NB_model.predict(test_list))

50

In [12]:
for l in test_list:
    prediction = trained_NB_model.predict([l])
    #print("Question: {}\n Prediciton: {}\n".format(l, prediction))

## b) - Filter the entire dataset

In [13]:
#create the target column
df['OK'] = None

#iterates over the dataframe
for index, row in df.iterrows():
    
    #extract the correct data to feed model
    data = df.loc[index, 'title_tag_chunk']
    #predicts whether or not it is ok
    prediction = trained_NB_model.predict([data])
    #saves prediction to row
    df.loc[index, 'OK'] = prediction

In [14]:
df.OK.count()

10728

In [22]:
#df

In [15]:
df['OK'] = df['OK'].str.get(0)

In [16]:
df_ok = df[df.OK == 1]

In [17]:
df_ok = df_ok.drop(['title_tag_chunk', 'OK'], axis=1)

In [26]:
test_list = list(df_ok.Tags.head(10))

In [51]:
#df_ok

In [47]:
df_ok.to_csv('./data/filtered_data_ready_for_app.csv', index=False)

In [48]:
'''
for item in list(df_ok.Title):
    print(item)
'''

'\nfor item in list(df_ok.Title):\n    print(item)\n'