# Award Type Classification

<img src="img/AwardType_rep.png" width="500">

There are four types of award:

1. **Standard Grant** <br>
    NSF provides a specific level of support for a specified period of time with
    no statement of NSF intent to provide additional future support without
    submission of another proposal.<br><br>

2. **Continuing grant** <br>
    NSF provides a specific level of support for an initial specified period of time,
    usually a year, with a statement of intent to provide additional support of the
    project for additional periods, provided funds are available and the results
    achieved warrant further support.<br><br>

3. **Fellowship** <br>
    The purpose of the NSF Graduate Research Fellowship Program (GRFP) is to help
    ensure the vitality and diversity of the scientific and engineering workforce
    of the United States. The program recognizes and supports outstanding graduate
    students who are pursuing research-based master's and doctoral degrees in science,
    technology, engineering, and mathematics (STEM) or in STEM education. The GRFP
    provides three years of support for the graduate education of individuals who 
    have demonstrated their potential for significant research achievements in STEM 
    or STEM education.<br><br>

4. **Cooperative Agreement** <br>
    A type of assistance award which should be used when substantial agency
    involvement is anticipated during the project performance period. Substantial
    agency involvement may be necessary when an activity is technically and/or
    managerially complex and requires extensive or close coordination between NSF
    and the awardee.<br><br>

Three others type of award were discarded because of their rarity and/or their meaning remained a mystery (namely GAA, Intergovernmental Personnel Award and Personnel Agreement). All three combined they represent less than 0.05% of the data set. 


## Predict Award Type based on Abstract

The idea is to use bag of words or N-grams analyzer on Abstract and use them as features for a classification algorithm.
Each abstract is preprocessed to remove remaining tags or web link. Stop words are also removed as well as first names. Lemmatization was added to remove plural forms of words.

Scikit-learn was used to create a pipeline for vectorization (bag of words/N-grams) and classification algorithm (Multinomial Naive-Bayes). Note that vectorization was followed by tf-idf (term-frequency multiply by inverse document-frequency) to normalized counts and limit the effect of high frequency words.

After optimization, an accuracy of 77% was achieved.
Confusion matrix is shown below:

<img src="img/ConfusionMat_AwardType.png" width="700">

## Python module:
  * utilsvectorizer (custom module for vectorization, Bag of words, N-grams)
  * Abstract_transformation (custom  module to clean Abstract data)
  * AwardInstr_transformation (custom  module to clean Award type data)
  * sklearn (Multinomial Naive Bayes)
  * nltk

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Wed Nov  8 23:52:41 2017

@author: herma
"""

# import custom vectorizer and associated function
from utils import utilsvectorizer

# don't use use stopwords from nltk
#from nltk.corpus import stopwords
#sw = stopwords.words('english')
## sklearn stopword list is more extensive, ENGLISH_STOP_WORDS is the same
## as stop_words='english' for CountVectorizer
from sklearn.feature_extraction import stop_words
# add list of first names from nltk, ATTENTION names has duplicates!!! use union()
# and each name starts with a CAPITAL LETTER
from nltk.corpus import names
firstname_corp = [na.lower() for na in names.words()]
sw = stop_words.ENGLISH_STOP_WORDS.union(firstname_corp)


# GET DATA
#######-----------------------------------------------------------------#######
from utils import Abstract_transformation as abt
# get data set
df_corpus = abt.get_Abstract('Abstract_full_Startdate.csv')
#######-----------------------------------------------------------------#######
from utils import AwardInstr_transformation as awt
# get Target
df_Award_Instr_target = awt.get_Award_Instrument('DB_1960_to_2017.csv')
#######-----------------------------------------------------------------#######

# MERGE
#######-----------------------------------------------------------------#######
import pandas as pd
# merge corpus and target on AwardID. AwardID is conserved
df = pd.merge(df_corpus, df_Award_Instr_target, how='inner', on=['AwardID'])
#######-----------------------------------------------------------------#######
# temporary downsizez of data
#df = df.iloc[:10]

# LABEL
#######-----------------------------------------------------------------#######
# label those categories
target = df.AwardInstrument
from sklearn.preprocessing import LabelEncoder
Award_Instr_encoder = LabelEncoder()
Award_Instr_coded = Award_Instr_encoder.fit_transform(target)
#######-----------------------------------------------------------------#######


###############################################################################
# divide data into train and test set
from sklearn.model_selection import train_test_split
# split data/target in train and test sets
corpus_train, corpus_test, target_train, target_test = train_test_split(
                                                    df.Raw_Abstract, Award_Instr_coded,\
                                                     test_size=0.3, random_state=42)
# retrieve target names
target_train_names =  Award_Instr_encoder.inverse_transform(target_train)
target_test_names =  Award_Instr_encoder.inverse_transform(target_test)
target_names_list = Award_Instr_encoder.classes_

################################################################################
## Define pipeline
## CountVectorizer--->tf-idf--->Naive Bayes
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from time import time
import matplotlib.pyplot as plt
import seaborn as sns

text_clf = Pipeline([('CustomVect', utilsvectorizer.CustomVectorizer(\
                              ngram_range=(3, 4),\
                              min_df = 1,\
                              max_df = 1.0,\
                              analyzer = 'word',\
                              stop_words = sw,\
                              strip_accents = 'unicode',\
                              token_pattern = r'(?u)\b[a-zA-Z][a-zA-Z]+\b',\
                        preprocessor = utilsvectorizer.remove_Tag_Http )),
                      ('tfidf', TfidfTransformer(use_idf=True)),
                      ('clf', MultinomialNB(alpha=1e-2)),
                         ])
t0 = time()
# train, get model named text_clf
text_clf.fit(corpus_train, target_train) 
print("done in %0.3fs" % (time() - t0))
# test
predicted = text_clf.predict(corpus_test)
print( 'Accuracy = {:.4f}'.format( np.mean(predicted == target_test) ) )
# 69 %
# more metrics
from sklearn import metrics
# precision, recall, f1 score
print(metrics.classification_report(target_test,\
                                     predicted,\
                                     target_names = target_names_list))
# confusion matrix
mat = metrics.confusion_matrix(target_test, predicted)
print(mat)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=target_names_list,
            yticklabels=target_names_list)
plt.xlabel('true label')
plt.ylabel('predicted label');

## save model
#from sklearn.externals import joblib
#joblib.dump(text_clf, 'NB_default_Model.pkl')
##text_clf = joblib.load('NB_default_Model.pkl') 

# ngram_range=(1, 2) ---> 77%

# ngram_range=(2, 3) ---> 78%
#Accuracy = 0.78
#                       precision    recall  f1-score   support
#
#     Continuing grant       0.71      0.59      0.64     31671
#Cooperative Agreement       0.72      0.27      0.40      1058
#           Fellowship       0.95      0.61      0.74      1850
#       Standard Grant       0.80      0.88      0.84     63765
#
#          avg / total       0.77      0.78      0.77     98344

###############################################################################
## Grid search
#from sklearn.model_selection import GridSearchCV
#
## ngram: used unigram (bag of word) or bigrams
## use_idf: Enable inverse-document-frequency reweighting.
## SGD alpha is the regularization constant
## pick 3 param out of 2 choices each: 2^3 = 8 possibilities
#parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
#               'tfidf__use_idf': (True, False),
#               'clf__alpha': (1e-2, 1e-3),
#}
#
#gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
#gs_clf = gs_clf.fit(corpus_train, target_train)
#
#print(gs_clf.best_score_)
#for param_name in sorted(parameters.keys()):
#     print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
## import to pandas to see it
#gs_clf.cv_results_
#
## save model
#from sklearn.externals import joblib
#joblib.dump(text_clf, 'SVM_default_Model.pkl')
##text_clf = joblib.load('SVM_default_Model.pkl') 