# Algorithm Benchmarking

The goal of the notebook is to benchmark and explore different algorithms and models to classify O\*Net Task data into General Work Activities (GWA). The most promising algorithms will them be tuned and used for a similar task on Occupational Requirements Survey data. 

**Last Updated**: Wednesday July 24, 2019
<br> **Author**: Rebecca Hu

In [50]:
import pandas as pd
import numpy as np

In [51]:
#Load in O*Net data
onet = pd.read_csv('onet_tasks_gwas.csv')
onet.head()

Unnamed: 0,Task,GWA
0,"Review and analyze legislation, laws, or publi...",Analyzing Data or Information
1,"Review and analyze legislation, laws, or publi...",Provide Consultation and Advice to Others
2,Direct or coordinate an organization's financi...,"Guiding, Directing, and Motivating Subordinates"
3,"Confer with board members, organization offici...","Communicating with Supervisors, Peers, or Subo..."
4,Analyze operations to evaluate performance of ...,Analyzing Data or Information


In [52]:
#Create training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(onet['Task'], onet['GWA'], test_size=.10)

In [53]:
# Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords=True)
analyzer = TfidfVectorizer(stop_words = 'english').build_analyzer() #TfidfVectorizer(stop_words = None).build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

### Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_recall_fscore_support

logit_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('logit', LogisticRegression())])

logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

print('Training Accuracy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accuracy:  0.645
**********************************************************
Test Accuracy:  0.52


  'precision', 'predicted', average, warn_for)


In [7]:
for i in range(len(test_p)):
    print(y_test.unique()[i])
    print('******************************')
    print('Precision: ', test_p[i])
    print('Recall: ', test_r[i])
    print('F1 Score: ', test_f1[i])
    print('Support: ', test_s[i])
    print('                                ')

Making Decisions and Solving Problems
******************************
Precision:  0.6071428571428571
Recall:  0.5862068965517241
F1 Score:  0.5964912280701754
Support:  58
                                
Documenting/Recording Information
******************************
Precision:  0.65625
Recall:  0.6268656716417911
F1 Score:  0.6412213740458016
Support:  67
                                
Monitoring and Controlling Resources
******************************
Precision:  0.5833333333333334
Recall:  0.25
F1 Score:  0.35000000000000003
Support:  28
                                
Provide Consultation and Advice to Others
******************************
Precision:  0.4748743718592965
Recall:  0.78099173553719
F1 Score:  0.590625
Support:  242
                                
Interpreting the Meaning of Information for Others
******************************
Precision:  0.325
Recall:  0.24528301886792453
F1 Score:  0.27956989247311825
Support:  53
                                
Handling and M

In [8]:
print('Average Precision: ', round(np.mean(test_p),3))
print('Average Recall: ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support: ', round(np.mean(test_s), 3))

Average Precision:  0.55
Average Recall:  0.383
Average F1 Score:  0.416
Average Support:  54.676


Sk-Learn's implementation of logistic regression can automatically detect multiple classes, but will not automatically assign multiple labels. A simple logistic regression works quite well considering we have 41 (really only 37) unique classes. We will consider this model our baseline. Next let's try multinomial naive bayes. 

---

### Multinomial Naive Bayes

In [11]:
from sklearn.naive_bayes import MultinomialNB

mnb_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('multinom nb', MultinomialNB())])

mnb_pipe.fit(X_train, y_train)

train_predicted = mnb_pipe.predict(X_train)
test_predicted = mnb_pipe.predict(X_test)

print('Training Accuracy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accuracy:  0.485
**********************************************************
Test Accuracy:  0.402


  'precision', 'predicted', average, warn_for)


In [12]:
print('Average Precision: ', round(np.mean(test_p),3))
print('Average Recall: ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support: ', round(np.mean(test_s), 3))

Average Precision:  0.327
Average Recall:  0.194
Average F1 Score:  0.195
Average Support:  54.676


The naive bayes performed slightly worse than the baseline. Let's try a more complex model, a RandomForest model.

---

### Random Forest

In [35]:
from sklearn.ensemble import RandomForestClassifier

rf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('extra trees', RandomForestClassifier(n_estimators = 1000, min_samples_leaf = 3, max_depth = 100))])

rf_pipe.fit(X_train, y_train)

train_predicted = rf_pipe.predict(X_train)
test_predicted = rf_pipe.predict(X_test)

print('Training Accurcy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accurcy:  0.548
**********************************************************
Test Accuracy:  0.461


  'precision', 'predicted', average, warn_for)


In [36]:
results = pd.DataFrame({'task': X_test, 'actual': y_test, 'predicted':list(test_predicted)})

In [41]:
print('Average Precision:', round(np.mean(test_p),3))
print('Average Recall:\t  ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support:  ', round(np.mean(test_s), 3))

Average Precision: 0.421
Average Recall:	   0.297
Average F1 Score:  0.318
Average Support:   54.676


We were able manipulate the parameters of the model to improve the overall performane of the model, but it still doesn't perform as well as the baseline and takes longer to run. Next, we'll try a linear SVM.

---

### Support Vector Machine with Linear Kernel

In [116]:
from sklearn.svm import SVC

svc_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('svc', SVC(C = 0.4, kernel = 'linear'))])

svc_pipe.fit(X_train, y_train)

train_predicted = svc_pipe.predict(X_train)
test_predicted = svc_pipe.predict(X_test)

print('Training Accuracy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accuracy:  0.653
**********************************************************
Test Accuracy:  0.515


  'precision', 'predicted', average, warn_for)


In [117]:
print('Average Precision: ', round(np.mean(test_p),3))
print('Average Recall: ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support: ', round(np.mean(test_s), 3))

Average Precision:  0.502
Average Recall:  0.382
Average F1 Score:  0.407
Average Support:  54.676


Once the parameters to the model were slightly altered, the results were quite similar to that of the logistic regression. The precision and recall scores for the model are not too far apart. This could be a promising model. Next, we'll try an emsemble method, gradient boosting.

---

### Gradient Boosting

In [107]:
from sklearn.ensemble import GradientBoostingClassifier

gb_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('gb', GradientBoostingClassifier(n_estimators = 100, min_samples_leaf = 3, max_depth = 100))])

gb_pipe.fit(X_train, y_train)

train_predicted = gb_pipe.predict(X_train)
test_predicted = gb_pipe.predict(X_test)

print('Training Accuracy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accuracy:  0.875
**********************************************************
Test Accuracy:  0.396


In [106]:
print('Average Precision: ', round(np.mean(test_p),3))
print('Average Recall: ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support: ', round(np.mean(test_s), 3))

Average Precision:  0.397
Average Recall:  0.353
Average F1 Score:  0.363
Average Support:  54.676


We have a similar problem with the gradient boosting, where the training accuracy is high but the test accuracy is quite low, also the computational resources required to run this mode are incredibly time intensive. Next we'll go back to a tree model. ExtraTrees differes from RandomForest in that they typically result in larger trees because of how splits are chosen. We'll try this algorithm to see if will be able to generalize better then the RandomForest. 

---

### Extremely Randomized Trees

In [58]:
from sklearn.ensemble import ExtraTreesClassifier

xtrees_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('extra trees', ExtraTreesClassifier(n_estimators = 100, max_depth = 100, min_samples_leaf = 5))])

xtrees_pipe.fit(X_train, y_train)

train_predicted = xtrees_pipe.predict(X_train)
test_predicted = xtrees_pipe.predict(X_test)

print('Training Accuracy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accuracy:  0.496
**********************************************************
Test Accuracy:  0.463


  'precision', 'predicted', average, warn_for)


In [59]:
for i in range(len(test_p)):
    print(y_test.unique()[i])
    print('******************************')
    print('Precision: ', test_p[i])
    print('Recall: ', test_r[i])
    print('F1 Score: ', test_f1[i])
    print('Support: ', test_s[i])
    print('                                ')

Getting Information
******************************
Precision:  0.44206008583690987
Recall:  0.7862595419847328
F1 Score:  0.5659340659340659
Support:  131
                                
Updating and Using Relevant Knowledge
******************************
Precision:  0.49333333333333335
Recall:  0.40217391304347827
F1 Score:  0.4431137724550898
Support:  92
                                
Performing General Physical Activities
******************************
Precision:  0.2857142857142857
Recall:  0.07017543859649122
F1 Score:  0.11267605633802817
Support:  57
                                
Selling or Influencing Others
******************************
Precision:  0.39236641221374047
Recall:  0.9178571428571428
F1 Score:  0.5497326203208557
Support:  280
                                
Training and Teaching Others
******************************
Precision:  0.6385542168674698
Recall:  0.39849624060150374
F1 Score:  0.4907407407407407
Support:  133
                                
Esti

In [60]:
print('Average Precision: ', round(np.mean(test_p),3))
print('Average Recall: ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support: ', round(np.mean(test_s), 3))

Average Precision:  0.447
Average Recall:  0.279
Average F1 Score:  0.295
Average Support:  54.676


We can able to reduce overfitting but configuring the parameters, but the precision and recall of this algorithm are not well balanced, and it failed to perform as well as the baseline model.

---

### Multi-layer Perceptron Classifier

In [None]:
#‘lbfgs’, ‘sgd’, ‘adam’

In [87]:
from sklearn.neural_network import MLPClassifier

mlp_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('extra trees', MLPClassifier(activation = 'tanh', alpha = 0.5, learning_rate = 'adaptive'))])

mlp_pipe.fit(X_train, y_train)

train_predicted = mlp_pipe.predict(X_train)
test_predicted = mlp_pipe.predict(X_test)

print('Training Accuracy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accuracy:  0.523
**********************************************************
Test Accuracy:  0.474


  'precision', 'predicted', average, warn_for)


In [82]:
print('Average Precision: ', round(np.mean(test_p),3))
print('Average Recall: ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support: ', round(np.mean(test_s), 3))

Average Precision:  0.259
Average Recall:  0.239
Average F1 Score:  0.224
Average Support:  54.676


We were able to configure the alpha parameter to decrease over-fitting, and the precision and recall scores are quite balanced, but the performance of the model is still worse than our baseline model. Next, we'll try another ensemble method known as AdaBoost. (I would try XGBoost, but I don't have it installed on this computer; I would recommend trying it though!)

---

### AdaBoost

In [102]:
from sklearn.ensemble import AdaBoostClassifier

ada_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('extra gb', AdaBoostClassifier(n_estimators = 1000, algorithm = 'SAMME', learning_rate = 2.5))])

ada_pipe.fit(X_train, y_train)

train_predicted = ada_pipe.predict(X_train)
test_predicted = ada_pipe.predict(X_test)

print('Training Accuracy: ', round(sum(train_predicted == y_train)/len(y_train), 3))
train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('**********************************************************')
print('Test Accuracy: ', round(sum(test_predicted == y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())

Training Accuracy:  0.35
**********************************************************
Test Accuracy:  0.332


  'precision', 'predicted', average, warn_for)


In [98]:
print('Average Precision: ', round(np.mean(test_p),3))
print('Average Recall: ', round(np.mean(test_r), 3))
print('Average F1 Score: ', round(np.mean(test_f1), 3))
print('Average Support: ', round(np.mean(test_s), 3))

Average Precision:  0.352
Average Recall:  0.08
Average F1 Score:  0.083
Average Support:  54.676


We were able to manipulate the parameters for reduce over-fitting but the AdaBoost algorithm does not perfom as well as the baseline. Also, the recall score is quite low.