# Post classification Experiment using Scikit learn

* Date 20/02/18
* Dylan Butler

## Task
The overall task of this experiment is to create a trained classifier to correctly classify whether or not a post is useful for quizes and knowledge testing of Java core concepts.

## Data
The data for this experiment consists of a manually labelled dataset of 1500 stackoverflow posts. These posts have been filtered according to the following characteristics:

* They posses the structure of either a "how-to"(procedural intent) or a "why"(casual intent) type of question
* They have a minimum score of 7 (post score)
* They have not been deleted
* They have not been closed
* They have an accepted answer

After extracting this data I conducted an analysis on the resulting dataset to gain a deeper understanding of the data:

### Extracted Data insights
* Group 1 (useful for quizzes):
    * How to split a string in Java?
    * Read and convert an input stream to a string?
    * How to read all files in a folder in Java?
    * How to round a number to n decimal places in Java?
    * How to parse JSON in Java?
    * How do I declare and initialize an array in Java?
    * Why is it faster to process an unsorted array vs a sorted array
    * How do I compare strings in Java?
* Group 2 (not useful fr quizzes):
    * How do I fix android.os.NetworkOnMainThreadException?
    * How do you assert that a certain exception is thrown in JUnit 4 tests?
    * How to fix java.lang.UnsupportedClassVersionError: Unsupported major.minor version
    * How to add local jar files to a Maven project?
    * How do I set up IntelliJ IDEA for Android applications?
    * How does autowiring work in Spring?
    * How do I tell Maven to use the latest version of a dependency?
    * Unfortunately MyApp has stopped. How can I solve this?
    * Why is subtracting these two times (in 1927) giving a strange result?

### Key Findings
* Useless Q's
    * A key difference I can spot is that most of the questions that pose no use are environment, framework, related and focus on a technology that uses Java.
    * Verbs like; set-up, fix, stopped ... i.e. less java specific and more generic - used in everyday language. 
* Useful Q's
    * The useful questions seem to be following a pattern in which the main words in the questions (split, string, read, java, JSON, declare, initialize) are all words closely related to Java and programming concepts in general.  
    * The verbs/action words used in the useful q's are closely associated with java itself.
    
    
# Experiment Process

1. Chunk titles and bodies into a single body
    * eliminate code snippets 
    * remove stop words
    * lemmatise each body
2. Extract the core features from the text that the algorithm can learn from
3. Train a classifier
4. Evaluate
5. Improve results

# 1) Generating the data
The format I will converting the data into for this first experiment will be flattened chunks of (tags, title and body) of each post. 

1. Remove all the code snippets from the bodys and titles of the  text --> using BeautifulSoup
2. Merge the title, bodies into a single chunk
3. remove all stop words


In [None]:
import pandas 
df = pandas.read_csv('./data/procedural_casual_Q_1500_SO_Java.csv')

In [None]:
df.head()

Merge each posts body and title into a single chunk

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

pipeline.fit(df_new['text'].values, df_new['OK'].values)
pipeline.predict(examples)

# 4) Cross-validating the model - K-fold

At this stage in the process it is required to cross validate the model i.e. check its accuracy to ensure that it can give accurate predictions when faced with new data.

Shuffling the data to ensure that our training and test sets are balanced when we perform the 80:20 split, training:test 

In [None]:
# frac keyword - specifies the number of rows to return in the rand
# sample -> 1 returns all rows
df_new = df_new.sample(frac=1)

## create an instance of K-Fold CV

In [None]:
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score

kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['text'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['text'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [None]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

In [None]:
type(pipeline)

### save the model 

In [None]:
import pickle
pickle.dump(pipeline, open('./models/multinomialnb_post_classifier.sav', 'wb'))

# Generating more features with N-grams

The counts where generated using the "bag of words" approach which counts single instances of words. Using n-grams we can count phrases for example "this is a phrase" --> "this is" "is a" "a phrase"

CountVectorizer can be instructed to use this approach

In [None]:
pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

In [None]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['text'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['text'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [None]:
import pickle
pickle.dump(pipeline, open('./models/ngrams_multinomialnb_post_classifier.sav', 'wb'))

## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer('''ngram_range=(1, 2)''')),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

This model performs exceptionally bad compared to the other two previous. An overall accuracy of 54% is recorded. We can disregard this model for the moment

# Bernoulli Naive Bayes Model

This algorithm focuses on the n-grams occurences rather than the counts. A vector of booleans representing the presence of absence of an n-gram. 

After some research I found that this model is said to perform better on shorter documents. 

In [None]:
#initialise a new column
df_new['text'] = ""

# loop thorugh the data frame
for index, row in df_new.iterrows():
    
    #target chunk of data
    words = row['cleaned_body_title']
    tmp =[]
    for word in words.split():
        #stopword removal
        if word not in stopWords:
            #lemmatise
            word = wordnet_lemmatizer.lemmatize(word)
            tmp.append(word)
    df_new.loc[index, 'text'] = ' '.join(tmp)

In [None]:
df_new = df_new.drop(['cleaned_body_title'], axis=1)

In [None]:
df_new.head()

# 2) Extracting Features from the documents

In [None]:
import numpy as np

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer()
counts = cv.fit_transform(df_new['text'].values)

In [None]:
counts

### list all of the elements in the CountVectorizer

In [None]:
#cv.get_feature_names()

# 3) Classifying the Posts

The first classifier I will be implementing is a naive bayes classifier. Bayes theorom - each feature (in this case word counts) is independent from every other one and each one contributes to the probability that an example belongs to a particular class

## Create, Initialize and train a new MultinomialNB

In [None]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

#targets are the OK column in the df_new dataframe above
targets = df_new['OK'].values
#train the NB classifier
classifier.fit(counts, targets)

### test out the classifier

In [None]:
df.columns[2:4]

In [None]:
#merges title and body into a single chunk
df['Title_Body_Chunk'] = df[df.columns[2:4]].apply(lambda x: ','.join(x),axis=1)

In [None]:
df.Title_Body_Chunk = df.Title_Body_Chunk.apply(str.lower)

In [None]:
from bs4 import BeautifulSoup
from bs4 import Tag

In [None]:
def _remove_attrs(soup):
    for tag in soup.findAll(True): 
        tag.attrs = None
    return soup

In [None]:
#initialise a new column
df['cleaned_body_title'] = ""

# loop thorugh the data frame
for index, row in df.iterrows():
        
        #print(row.Title_Body_Chunk)
        
        soup = BeautifulSoup(row['Title_Body_Chunk'], 'html5lib')
        
        for code in soup.find_all("code"):
            code.decompose()
        cleaned = soup.get_text()
        
        #create a new column to hold the cleaned data
        df.loc[index, "cleaned_body_title"] = cleaned

In [None]:
df = df.drop(['Title', 'Body', 'Title_Body_Chunk'], axis=1)

Generate a Dataframe with only the classification and the chunk of text

In [None]:
df_new = df[['cleaned_body_title', 'OK']]

## remove all stopwords and lemmatise remaining values

In [None]:
examples = ["How do I explicitly pass the type argument to a generic Java method? I do not understand how to achieve this", "How do I generate a new eclipse project? I am trying to create a new eclipse project and I need help setting it up"]
example_counts = cv.transform(examples)
predictions = classifier.predict(example_counts)

In [None]:
predictions

#### Notes on the above:

The predictor can correctly classify between the two examples that were generated using the chunk of text provided for each. 

## Pipelining - connecting the process

a pipeline can be introduced to merge both the feature extraction and classification into one operation