## Importing and Paths

For this project, 

In [70]:
import sys, nltk, os, sklearn.preprocessing, sklearn.metrics
import tokenHelper as tkn # import custom function for tokenization of text
import numpy as np
from sklearn import naive_bayes as NB, svm as SVM, linear_model as LM
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline

In [2]:
#d = os.path.join('e:', 'VirtualMachines', 'shared', 'tmp')
d = os.path.join('c:', 'data', 'JHU', 'InfoRetriev', 'Program04')
trainFile = os.path.join(d, 'phase1.train.shuf.tsv')
devFile = os.path.join(d, 'phase1.dev.shuf.tsv')
testFile = os.path.join(d, 'phase1.test.shuf.tsv')

## Loading of Files ###

In [3]:
header = ['assessment','docID','title','authors','journal','ISSN','year',
		'language', 'abstract', 'keywords'] # the variable names for file
convFun = {1: lambda b: b[5:]} # to capture hash code without prefix

def readFile(fName):
	return np.genfromtxt(fName, delimiter='\t', dtype=None, names=header,
		comments=None, converters=convFun, encoding='utf-8') # load file

In [4]:
train = readFile(trainFile) # training set
dev = readFile(devFile) # development set
test = readFile(testFile) # test set

In [None]:
## Create binarizer for labeling assessments in the data
# mark +1 as positive, and rest as negative based on data
labelr = sklearn.preprocessing.LabelBinarizer(pos_label=1)
labelr.fit(train['assessment']) # fit on training labels
y_train = labelr.transform(train['assessment']).ravel()
y_actual = labelr.transform(dev['assessment']).ravel()

### Helper Functions ###

In [9]:
# Calculates various stats related to validation
def validationStats(y_Prd, y_Act, msg=''): 
    # confusion matrix, T=true, F=false, N=negative, P=positive
    TN, FP, FN, TP = sklearn.metrics.confusion_matrix(y_Act, y_Prd).ravel()
    precision,recall = TP/(TP+FP) , TP/(TP+FN) # precision and recall
    corr,tot = TN+TP , TN+TP+FN+FP # used for accuracy calculation
    print("Using Naive Bayes, %s"%msg)
    print("\tRecall: %u/%u = %.1f%%" % (TP, TP+FN, recall*100) )
    print("\tPrecision: %u/%u = %.1f%%" % (TP, TP+FP, precision*100) )
    print("\tF1 score: %.3f" % (2*precision*recall / (precision+recall)) )
    print("\tAccuracy: %u/%u = %.1f%%" % (corr,tot,corr/tot*100) )
    return (TN, FP, FN, TP)

################################################################################
def concatColsForFeat(rawTxt, cols, delDashStar=True):
    if delDashStar:
        tranTab = str.maketrans('/*','  ')
    else:
        tranTab = str.maketrans('','')
    f = np.vectorize(lambda x: ' '.join(x).translate(tranTab))
    return f(rawTxt[cols])

## Naive Bayes - Using Title Only
The following section builds a pipeline for a Naive Bayes classifier. The pipeline includes two parts:
1. **TF-IDF vectorizer**, which extracts features from input text and builds a document-term matrix based on TF-IDF values. 
  * The text is tokenized via NLTK `word_tokenize` function.
  * Tokens are removed if they are on a list of NLTK English stopwords or any consecutive punctuation.
  * Only the top 10K terms by document frequency is retained, as well as any terms with less than DF<3 are also removed.
  * TF-IDF weights are used for document-term matrix.
1. **Complement Naive Bayes** model is used for classification
  * The algorithm is described in [Rennie et al (2003)](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf), and corrects for severe assumptions of Multinomial or Bernoulli Naive Bayes.
  * It is useful for when the training set has unbalanced classes (in this case 3.2% of the training sample is positive)
  * Laplace smoothing is used with $\alpha=0.05$, due to the large number of features.

In [6]:
NB_pipe = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=5000, min_df=3) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_pred = NB_pipe.fit(train['title'], y_train).predict(dev['title'])

AttributeError: module 'sklearn.naive_bayes' has no attribute 'ComplementNB'

The performance of Naive Bayes model using only title when validated against the development sample is presented as follows

In [291]:
# calculate validation stats

validationStats(y_pred, y_actual, 'features from title only.');

Using Naive Bayes, features from title only.
	Recall: 96/150 = 64.0%
	Precision: 96/622 = 15.4%
	F1 score: 0.249
	Accuracy: 4270/4850 = 88.0%


## Using Title, Abstract, and Keywords

In [50]:
# extract text by cocatenating three fields from CSV file
txt3_train = concatColsForFeat(train, ['title','abstract','keywords'])
txt3_dev = concatColsForFeat(dev, ['title','abstract','keywords'])

In [296]:
NB_pipe_TAK = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=5000, min_df=5) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])
################################################################################
# extract, train, and predict
y3_pred = NB_pipe.fit(txt3_train,y_train).predict(txt3_dev)

The performance of Naive Bayes model using only title when validated against the development sample is presented as follows

In [297]:
validationStats(y3_pred, y_actual, 'features from title+abstract+keywords');

Using Naive Bayes, features from title+abstract+keywords
	Recall: 135/150 = 90.0%
	Precision: 135/1373 = 9.8%
	F1 score: 0.177
	Accuracy: 3597/4850 = 74.2%


## Naive Bayes with Alternative Hyperparameters

In [310]:
NB_pipe2 = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopwordNStem, 
                             max_features=5000, min_df=3) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_2_pred = NB_pipe2.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_2_pred, y_actual, 'features from title only.');

Using Naive Bayes, features from title only.
	Recall: 97/150 = 64.7%
	Precision: 97/705 = 13.8%
	F1 score: 0.227
	Accuracy: 4189/4850 = 86.4%


In [314]:
NB_pipe3 = Pipeline([ # establish pipeline
    ('vect', CountVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=5000, min_df=3) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_3_pred = NB_pipe3.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_3_pred, y_actual, 'features from title only.');

Using Naive Bayes, features from title only.
	Recall: 92/150 = 61.3%
	Precision: 92/584 = 15.8%
	F1 score: 0.251
	Accuracy: 4300/4850 = 88.7%


In [316]:
################################################################################
NB_pipe4 = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=5000, min_df=3, ngram_range=(2,3)) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_4_pred = NB_pipe4.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_4_pred, y_actual, 'features from title only.');

Using Naive Bayes, features from title only.
	Recall: 78/150 = 52.0%
	Precision: 78/716 = 10.9%
	F1 score: 0.180
	Accuracy: 4140/4850 = 85.4%


In [324]:
NB_pipe5 = Pipeline([ # establish pipeline
    ('vect', CountVectorizer(max_features=5000, min_df=3, analyzer='char', 
                             ngram_range=(5,5) ) ), 
    ('clf', NB.ComplementNB(alpha=0.01))
])

y_5_pred = NB_pipe5.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_5_pred, y_actual, 'features from title only.');

Using Naive Bayes, features from title only.
	Recall: 110/150 = 73.3%
	Precision: 110/856 = 12.9%
	F1 score: 0.219
	Accuracy: 4064/4850 = 83.8%


## Alternative Machine Learning Algorithms

In [69]:
SVM_pipe = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=5000, min_df=3, ngram_range=(2,3)) ), 
    ('clf', LM.SGDClassifier(loss='log', penalty='l2', alpha=1e-3, 
                             random_state=1, tol=1e-3) )
])

y_svm_pred = SVM_pipe.fit(train['title'], y_train).predict(dev['title'])
#validationStats(y_svm_pred, y_actual, 'features from title only.');

print(sum(y_svm_pred))

0


In [90]:
SVM_pipe = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=tkn.tokenizeNoPunctStopword, 
                             max_features=5000, min_df=3) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(max_iter=-1, random_state=1, class_weight='balanced') )
])
################################################################################
y_svm_pred = SVM_pipe.fit(train['title'], y_train).predict(dev['title'])
validationStats(y_svm_pred, y_actual, 'features from title only.');

print(sum(y_svm_pred))

Using Naive Bayes, features from title only.
	Recall: 45/150 = 30.0%
	Precision: 45/162 = 27.8%
	F1 score: 0.288
	Accuracy: 4628/4850 = 95.4%
162
