# Traninig and Classifying with title and tags chunks

In this experiment I will be analysing the results of two techniques. I will first be using the tags only as input into the classification models. Then I will be combining the tags and the title per post and analysing the differences, whether or not this improved the accuracy.

Steps:

1. clean the tags from <[tag]> format to [tag] (removing the angle brackets)
2. merge the tags and the titles into a single chunk
3. Drop all words to lowercase
4. Extract features using a CounterVectorizer instance
5. Train and evaluate various classification models:
    * Naive Bayes 
    * Naive Bayes n-grams
    * Naive Bayes n-grams TF-IDF
    * Bernoulli Naive Bayes

In [1]:
import pandas 
df = pandas.read_csv('./data/procedural_casual_Q_1500_SO_Java.csv')

In [2]:
df.head()

Unnamed: 0,Id,Score,Body,Title,Tags,AnswerCount,CommentCount,FavoriteCount,ViewCount,OK
0,13225,8,<p>I've recently inherited a internationalized...,How can I refactor HTML markup out of my prope...,<java><jsp><internationalization><struts>,4,0,,2078,0
1,24991,19,<p>I have defined a Java function:</p>\n\n<pre...,Why can't I explicitly pass the type argument ...,<java><generics><syntax>,4,1,6.0,21171,1
2,24866,11,<p>I am using Java back end for creating an XM...,Is it essential that I use libraries to manipu...,<java><xml>,11,0,,690,0
3,25449,29,<p>I want to create a Java program that can be...,How to create a pluginable Java program?,<java><plugins><plugin-architecture>,6,1,18.0,17544,1
4,26305,151,<p>I want to be able to play sound files in my...,How can I play sound in Java?,<java><audio>,9,1,57.0,262318,1


# 1- Training and Evaluating - Tags Only!

## Create a new dataframe with only tags and target

In [3]:
df_new = df[['Tags', 'OK']]

In [4]:
df_new.Tags.head()

0    <java><jsp><internationalization><struts>
1                     <java><generics><syntax>
2                                  <java><xml>
3         <java><plugins><plugin-architecture>
4                                <java><audio>
Name: Tags, dtype: object

In [5]:
df_new['cleaned_tags'] = ""
for index, row in df_new.iterrows():
    cleaned_tags = df_new.loc[index, 'Tags'].replace('>', " ").replace('<', " ").replace('java', '')
    df_new.loc[index, 'cleaned_tags'] = cleaned_tags

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [6]:
df_new = df_new.drop(['Tags'], axis=1)

# Traninig models on tags only

In [7]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score



In [8]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

In [9]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['cleaned_tags'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['cleaned_tags'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [10]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.813187302328
Confusion matrix:
[[781  86]
 [140 492]]


# N-grams

In [11]:
pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

In [12]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['cleaned_tags'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['cleaned_tags'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [13]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.816174667223
Confusion matrix:
[[783  84]
 [139 493]]


# TF-IDF

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer('''ngram_range=(1, 2)''')),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

In [15]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['cleaned_tags'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['cleaned_tags'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [16]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.804761296112
Confusion matrix:
[[791  76]
 [155 477]]


# Bernoulli 

In [17]:
from sklearn.naive_bayes import BernoulliNB

pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('classifier', BernoulliNB(binarize=0.3))
])

In [18]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['cleaned_tags'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['cleaned_tags'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [19]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.806190380159
Confusion matrix:
[[769  98]
 [138 494]]


# Combining the tags with title 

In [20]:
df_new['title'] = ""

for index, row in df_new.iterrows():
        
        df_new.loc[index, 'title'] = df.loc[index, 'Title']
    

In [21]:
df_new['title_tag_chunk'] = df_new[df_new.columns[1:3]].apply(lambda x: ','.join(x),axis=1)

In [22]:
df_new.title_tag_chunk = df_new.title_tag_chunk.apply(str.lower)

In [23]:
df_new = df_new.drop(['cleaned_tags', 'title'], axis=1)

In [27]:
from nltk.stem.snowball import SnowballStemmer
import nltk
stemmer = SnowballStemmer("english")

In [28]:
# loop thorugh the data frame
for index, row in df_new.iterrows():
    
    #target chunk of data
    words = row['title_tag_chunk']
    tmp =[]
    for word in words.split():
        #lemmatise
        word = stemmer.stem(word)
        tmp.append(word)
    df_new.loc[index, 'title_tag_chunk'] = ' '.join(tmp)

In [30]:
#df_new

In [31]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [32]:
from sklearn.naive_bayes import MultinomialNB


In [33]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

In [34]:
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score

kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['title_tag_chunk'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['title_tag_chunk'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [35]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.855323483373
Confusion matrix:
[[762 105]
 [ 77 555]]


In [69]:
import pickle
pickle.dump(pipeline, open('./models/multinomialnb_post_classifier_title_tag.sav', 'wb'))

### Predictions

In [36]:
examples = ["how to efficiently iterate over each entry in a map", "how to add local jar files to a maven project"]
pipeline.predict(examples)

array([1, 0])

# N-Grams 

In [37]:
pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

In [38]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['title_tag_chunk'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['title_tag_chunk'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [39]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.852974814957
Confusion matrix:
[[763 104]
 [ 81 551]]


In [73]:
import pickle
pickle.dump(pipeline, open('./models/multinomialnb_classifier_ngrams_title_tag.sav', 'wb'))

# TF-IDF

In [40]:
from sklearn.feature_extraction.text import TfidfTransformer

pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer('''ngram_range=(1, 2)''')),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

In [41]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['title_tag_chunk'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['title_tag_chunk'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [42]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.840610880414
Confusion matrix:
[[785  82]
 [112 520]]


# Bernoulli

In [43]:
from sklearn.naive_bayes import BernoulliNB

pipeline = Pipeline([
    ('count_vectorizer', CountVectorizer()),
    ('classifier', BernoulliNB(binarize=0.0))
])

In [44]:
kf = KFold(n=len(df_new), n_folds = 10)
scores = [] #holds the score for each
confusion = np.array([[0,0], [0,0]]) #initialize the confusion matrix

for train_ind, test_ind in kf:
    
    #training data(x) and classification(y)
    train_x = df_new.iloc[train_ind]['title_tag_chunk'].values
    train_y = df_new.iloc[train_ind]['OK'].values
    
    #testing training data
    test_x = df_new.iloc[test_ind]['title_tag_chunk'].values
    test_y = df_new.iloc[test_ind]['OK'].values
    
    #train and predict each of the values
    pipeline.fit(train_x, train_y)
    predictions = pipeline.predict(test_x)
    
    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=1)
    scores.append(score)

In [45]:
print('Total posts classified:', len(df_new))
print('Score:', sum(scores)/len(scores))
print('Confusion matrix:')
print(confusion)

Total posts classified: 1499
Score: 0.829413991311
Confusion matrix:
[[761 106]
 [106 526]]


# Reports

## Tags-only

| Classifier               | Features           | F1-Score       | TN  | FP | FN  | TP  |
|--------------------------|--------------------|----------------|-----|----|-----|-----|
| Multi-nomial Naive Bayes | Bag-of-words       | 0.813187302328 | 781 | 86 | 140 | 492 |
| Multi-nomial Naive Bayes | Bigram Counts      | 0.816174667223 | 783 | 84 | 139 | 493 |
| Multi-nomial Naive Bayes | Bigram Frequencies | 0.804761296112 | 791 | 76 | 155 | 477 |
| Bernoulli Naive Bayes    | Bigram Occurences  | 0.806190380159 | 769 | 98 | 138 | 494 |

## Tags-Titles Chunked

| Classifier               | Features           | F1-Score       | TN  | FP  | FN  | TP  |
|--------------------------|--------------------|----------------|-----|-----|-----|-----|
| Multi-nomial Naive Bayes | Bag-of-words       | 0.861472066928 | 764 | 103 | 72  | 560 |
| Multi-nomial Naive Bayes | Bigram Counts      | 0.856620216331 | 766 | 101 | 79  | 553 |
| Multi-nomial Naive Bayes | Bigram Frequencies | 0.844687424368 | 789 | 78  | 110 | 552 |
| Bernoulli Naive Bayes    | Bigram Occurences  | 0.83271814222  | 766 | 101 | 105 | 527 |