# Programming Assignment 04
Student: John Wu  

For this project, I applied naive Bayes algorithm to the **Systematic Review** data set. The machine learning library, `scikit-learn` is used extensively for this assignment. 

In [1]:
import sys, nltk, os, sklearn.preprocessing, sklearn.metrics
import tokenHelper as tkn # import custom function for tokenization of text
import numpy as np
from sklearn import naive_bayes as NB, svm as SVM, linear_model as LM
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline as skPipeline

In [2]:
d = os.path.join('e:', 'VirtualMachines', 'shared', 'tmp')
#d = os.path.join('c:', 'data', 'JHU', 'InfoRetriev', 'Program04')
trainFile = os.path.join(d, 'phase1.train.shuf.tsv')
devFile = os.path.join(d, 'phase1.dev.shuf.tsv')
testFile = os.path.join(d, 'phase1.test.shuf.tsv')

## Loading of Files ###

In [3]:
header = ['assessment','docID','title','authors','journal','ISSN','year',
		'language', 'abstract', 'keywords'] # the variable names for file

def readFile(fName):
	return np.genfromtxt(fName, delimiter='\t', dtype=None, names=header,
		comments=None, encoding='utf-8') # load file

In [4]:
# reading in files
train = readFile(trainFile) # training set
dev = readFile(devFile) # development set
test = readFile(testFile) # test set

### Helper Functions ###

In [5]:
# Calculates various stats related to validation
def validationStats(y_Prd, y_Act, msg='', algo='naive Bayes'):
    # confusion matrix, T=true, F=false, N=negative, P=positive
    TN, FP, FN, TP = sklearn.metrics.confusion_matrix(y_Act, y_Prd).ravel()
    precision,recall = TP/(TP+FP) , TP/(TP+FN) # precision and recall
    corr,tot = TN+TP , TN+TP+FN+FP # used for accuracy calculation
    print("Using %s, %s"%(algo,msg))
    print("\tRecall: %u/%u = %.1f%%" % (TP, TP+FN, recall*100) )
    print("\tPrecision: %u/%u = %.1f%%" % (TP, TP+FP, precision*100) )
    print("\tF1 score: %.3f" % (2*precision*recall / (precision+recall)) )
    print("\tAccuracy: %u/%u = %.1f%%" % (corr,tot,corr/tot*100) )
    return (TN, FP, FN, TP)

# get columns from raw data and concatenate data
def concatColsForFeat(rawTxt, cols, delDashStar=True):
    if delDashStar:
        tranTab = str.maketrans('/*','  ')
    else:
        tranTab = str.maketrans('','')
    f = np.vectorize(lambda x: ' '.join(x).translate(tranTab))
    return f(rawTxt[cols])

## Naive Bayes - Using Title Only
The following section builds a pipeline for a naive Bayes classifier. The pipeline includes two parts:
1. **TF-IDF vectorizer**, which extracts features from input text and builds a document-term matrix based on TF-IDF values. 
  * The text is tokenized via NLTK `word_tokenize` function.
    * It is based on [Treebank tokenization](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html) developed at UPenn.
    * It splits on all whitespaces as well as contractions i.e. "can't" -> "ca", "n't"
    * It tokenizes any consecutive number of punctuations, such as “,”, “?”, “—“, or “…”
    * Punctuations inmixed with letters, such as “03/20/2018” would be tokenized as one word, as well as
things like URL or hyphenated words like “open-faced”
  * Tokens are removed if they are on a list of NLTK English stopwords or any consecutive punctuation.
  * Only the top 10K terms by document frequency is retained, as well as any terms with less than DF<2 are also removed.
  * TF-IDF weights are used for document-term matrix.
1. **Complement Naive Bayes** model is used for the "baselie" classification
  * The algorithm is described in [Rennie et al (2003)](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf), and corrects for severe assumptions of multinomial or Bernoulli naive Bayes.
  * It is useful for when the training set has unbalanced classes (in this case 3.2% of the training sample is positive)
  * Laplace smoothing is used with $\alpha=0.05$, due to the large number of features and high possibility of a term not appearing in the training set.

However, to work correctly in `sklearn`, we must first binarize the asessment labels by marking +1 as positive, and the rest as negative. This binarizer are then used to mark the development sample also.

In [6]:
## Create binarizer for labeling assessments in the data
labelr = sklearn.preprocessing.LabelBinarizer(pos_label=1)
y_train = labelr.fit_transform(train['assessment']).ravel() # training set
y_actual = labelr.transform(dev['assessment']).ravel() # validation set

The following code builds a pipeline as described at the beginning of the section, combining TF-IDF vectorizer and a complement naive Bayes classifier.

In [7]:
NB_pipe = skPipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=10000, min_df=2) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_pred = NB_pipe.fit(train['title'], y_train).predict(dev['title'])

The performance of Naive Bayes model using only title when validated against the development sample is presented as follows, where the various performance statistics are printed.

In [8]:
validationStats(y_pred, y_actual, 'features from title only.');

Using naive Bayes, features from title only.
	Recall: 93/150 = 62.0%
	Precision: 93/569 = 16.3%
	F1 score: 0.259
	Accuracy: 4317/4850 = 89.0%


## Using Title, Abstract, and Keywords

First, we must extract text by cocatenating three fields from CSV file. This is done for both training and development samples.

In [9]:
txt3_train = concatColsForFeat(train, ['title','abstract','keywords'])
txt3_dev = concatColsForFeat(dev, ['title','abstract','keywords'])

Like in the previous section, a pipeline is built, but this time using the expanded set of features. With the increased amouont of text, the maximum allowed number of features is increased as well as the minimum document frequency to qualify as a feature.

In [10]:
NB_pipe_TAK = skPipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=15000, min_df=3) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])
################################################################################
# extract, train, and predict
y3_pred = NB_pipe_TAK.fit(txt3_train,y_train).predict(txt3_dev)

The performance of Naive Bayes model improves appreciably, with a large increase in recall and a small increase in precision. However, training and fitting time also increases as well.

In [11]:
validationStats(y3_pred, y_actual, 'features from title+abstract+keywords');

Using naive Bayes, features from title+abstract+keywords
	Recall: 118/150 = 78.7%
	Precision: 118/562 = 21.0%
	F1 score: 0.331
	Accuracy: 4374/4850 = 90.2%


## Naive Bayes with Alternative Hyperparameters
This section present several alternative Naive Bayes models with different tokenization, vectorization, feature selection, and algorithm. Since fitting a model with title, abstract, and keywords take quite a bit longer, this section will only use title for features to so as to allow the running of multiple alternative setups. The result will be benchmarked with the baselinse.

In [12]:
# 5 Stemming title text
NB_pipe2 = skPipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopwordNStem, 
                             max_features=5000, min_df=2) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_2_pred = NB_pipe2.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_2_pred, y_actual, 'title features, 5-stemmed');

Using naive Bayes, title features, 5-stemmed
	Recall: 96/150 = 64.0%
	Precision: 96/690 = 13.9%
	F1 score: 0.229
	Accuracy: 4202/4850 = 86.6%


5-stemming the title degrades performance, resulting in slightly worse precision.

In [13]:
# Using counts instead of TF-IDF weights
NB_pipe3 = skPipeline([ # establish pipeline
    ('vect', CountVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=10000, min_df=3) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_3_pred = NB_pipe3.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_3_pred, y_actual, 'title features, count vectors');

Using naive Bayes, title features, count vectors
	Recall: 90/150 = 60.0%
	Precision: 90/549 = 16.4%
	F1 score: 0.258
	Accuracy: 4331/4850 = 89.3%


Using count instead of TF-IDF document-term vectors result in similar performance, with slight drop in recall.

In [14]:
# Using word 2 and 3-gram 
NB_pipe4 = skPipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=10000, min_df=1, ngram_range=(2,3)) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_4_pred = NB_pipe4.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_4_pred, y_actual, 'features from title only.');

Using naive Bayes, features from title only.
	Recall: 77/150 = 51.3%
	Precision: 77/659 = 11.7%
	F1 score: 0.190
	Accuracy: 4195/4850 = 86.5%


Using word 2 and 3-gram result in significantly worse performance..

In [15]:
# Using character 4-5 grams
NB_pipe5 = skPipeline([ # establish pipeline
    ('vect', CountVectorizer(max_features=15000, min_df=3, analyzer='char', 
                             ngram_range=(4,5)) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_5_pred = NB_pipe5.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_5_pred, y_actual, 'features from title only.');

Using naive Bayes, features from title only.
	Recall: 101/150 = 67.3%
	Precision: 101/676 = 14.9%
	F1 score: 0.245
	Accuracy: 4226/4850 = 87.1%


Using character 4-5 grams is a large step-up over using word n-grams, but still significantly worse than using word features.

## Alternative Machine Learning Algorithms
In this section, support vector machine (SVM) algorithm is explored to test whether it can provide a better performance than naive Bayes. Like with naive Bayes classifier, we also run into the problem of unbalanced class. Since so much of the data are of the negative class, SVM classifier would get overwhelmed and predict all classes as negative. Therefore, one must weigh the classes so that the training sample would not be biased. Likewise, the input also need to be scaled as SVM works best when the features are between 0 and 1.

In [16]:
# Using features from title only
SVM_pipe = skPipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=10000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, 
                    class_weight='balanced') )
])

y_svm_pred = SVM_pipe.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_svm_pred, y_actual, 'features from title only.', 'SVM');

Using SVM, features from title only.
	Recall: 54/150 = 36.0%
	Precision: 54/207 = 26.1%
	F1 score: 0.303
	Accuracy: 4601/4850 = 94.9%


In [17]:
# Using features from title+abstract+keywords only
SVM3_pipe = skPipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=15000, min_df=3) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, 
                    class_weight='balanced') )
])

y3_svm_pred = SVM_pipe.fit(txt3_train, y_train).predict(txt3_dev)
validationStats(y3_svm_pred, y_actual, 'features from title only.', 'SVM');

Using SVM, features from title only.
	Recall: 22/150 = 14.7%
	Precision: 22/32 = 68.8%
	F1 score: 0.242
	Accuracy: 4712/4850 = 97.2%


From the results above, using SVM increases precision quite a bit, but at a high cost of low recall. Conceptually, this makes sense. SVM are better at splitting between the two classes as the algorithm keeps running until differences are smaller than a threshold. However, this also means the algorithm may not be computationally tractable and is prone to overfitting.

## Predicting Testing Set
The best performing setup so far was complement naive Bayes with all three fields as feature. It strictly dominates all the other naive Bayes method, and while the precision is not as good as using SVM with titles only, the recall is so much higher that it makes up for it. At the same time, it does not suffer from computational complexity problems of SVM algorithm. Therefore, it is best to predict the test set using this setup.

In [18]:
# get features on test set and predict the Y
txt3_test = concatColsForFeat(test, ['title','abstract','keywords'])
y3_test = NB_pipe_TAK.predict(txt3_test)

In [19]:
# write the output file
with open('wu-prog4.txt', 'w') as f:
    for x in zip(test['docID'], y3_test):
        f.write('%s\t%i\n' % (x[0], -1 if x[1]==0 else 1))