# Exploring One-Vs-Rest Techniques

The purpose of this notebook is to explore using one-vs-rest as a technique to address the multi-label classification part of our project. 

**Last Updated**: Wednesday July 24, 2019
<br> **Author**: Rebecca Hu

In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('onet_tasks_gwas.csv')

Here is the pre-processing step.

In [5]:
#Remove punctuation & stopwords, use stemming to remove inflections, calculate TF-IDF matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords=True)
analyzer = TfidfVectorizer(lowercase = True, stop_words = 'english', ngram_range = (1, 3)).build_analyzer() 

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

Here is the model we'll be using, a simple linear regression.

In [6]:
#Logit pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

logit_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
    ('logit', LogisticRegression())])

### OneVsRest Scikit-Learn

---

First we'll try to use sk-learn's OneVsRestClassifier since its already pre-made.

In [8]:
#transform data to have one column for each unique class bc this is what the sklearn function requires
from sklearn.model_selection import train_test_split

y_dict = {}
for c in df.GWA.unique():
    labels = df.GWA.apply(lambda x: 1 if x == c else 0)
    y_dict[c] = labels
    
skdf = pd.concat([df,pd.DataFrame(y_dict)], axis = 1)

classes = skdf.GWA.unique()
train, test = train_test_split(skdf, test_size=0.33, shuffle=True)
X_train = train.Task
X_test = test.Task

In [9]:
# Define a pipeline combining a text feature extractor with multi lable classifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

lr_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer=stemmed_words)),
                ('clf', OneVsRestClassifier(LogisticRegression())),
            ])

for c in classes:
    print('... Processing {}'.format(c))
    # train the model using X_dtm & y
    lr_pipeline.fit(X_train, train[c])
    # compute the testing accuracy
    prediction = lr_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[c], prediction)))

... Processing Analyzing Data or Information
Test accuracy is 0.9674857656577764
... Processing Provide Consultation and Advice to Others
Test accuracy is 0.9709319748276896
... Processing Guiding, Directing, and Motivating Subordinates
Test accuracy is 0.948157027270003
... Processing Communicating with Supervisors, Peers, or Subordinates
Test accuracy is 0.9565477974228349
... Processing Making Decisions and Solving Problems
Test accuracy is 0.968684447108181
... Processing Developing Objectives and Strategies
Test accuracy is 0.9862151633203476
... Processing Resolving Conflicts and Negotiating with Others
Test accuracy is 0.9926580761162721
... Processing Documenting/Recording Information
Test accuracy is 0.9484566976326041
... Processing Communicating with Persons Outside Organization
Test accuracy is 0.9875636799520527
... Processing Interpreting the Meaning of Information for Others
Test accuracy is 0.9845669763260414
... Processing Selling or Influencing Others
Test accuracy is

On first glance, it looks like the sk-learn implementation of one-vs-rest is doing very well. But, remember that this implementation doesn't use any sampling methods and our data is highly imbalanced. So it is likely that although the accuracy is high, the model is probably just predicting the "rest" class instead of the target class for everything. In other words, there are many true negative, but also no true positives. 

So next, lets try to implement some sampling techniques that will balance our data. We'll over-sample the target class and under-sample the "rest" classes to create a training set that is more even split between the target class and the "rest". then, train on this new even dataset. First, lets try with a target class that is the label for a large proportion of the dataset: Handling and Moving Objects.

<b> NOTE (08/06/2019) </b> : This sampling method I performed here was before I realized I needed to reformat the data before sampling. This method allows some of the same tasks to leak appear in the training and test sets, but they are labeled with different GWAs. The sampling method used in the final classifier fixes this problem by reformatting the data into "easy-to-read" and "one-hotted" dataframe. See the task_classification_helper_functions.py for the code for these functions. The code below gives a general idea of the sampling method I ended up using, but isn't implemented correctly. 

---

### Target Class: Handling and Moving Objects

In [10]:
# baseline preprocessing + sampling
from sklearn.metrics import precision_recall_fscore_support

hamo_sample = df[df['GWA'] == 'Handling and Moving Objects'].sample(2000)
nonhamo_df = df[df['GWA'] != 'Handling and Moving Objects'].sample(3000)
nonhamo_df['GWA'] = ['Not Handling and Moving Objects'] * 3000
equal_hamo_data = pd.concat([hamo_sample, nonhamo_df])
X_train, X_test, y_train, y_test = train_test_split(equal_hamo_data['Task'], 
                                                    equal_hamo_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

In [11]:
# fit the model and make predictions
logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())

In [12]:
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)
#results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
#results[results['Actual'] != results['Predicted']].head(10)

Training Error:  0.059
Precision:  [0.93673616 0.94877765]
Recall:  [0.96733482 0.90254707]
F1-Score:  [0.95178963 0.92508513]
**********************************************************
Test Error:  0.15
Precision:  [0.87138264 0.81481481]
Recall:  [0.88562092 0.79381443]
F1-Score:  [0.87844408 0.80417755]


We still get a relatively high accuracy using the sampling technique and we can be more confident that our model is truly differentiating between the target class and other labels. 

---

Let's compare this to when we pre-process differently just to get an idea of what's working well.

In [15]:
# pre-processing without stemming + sampling

nonstem_logit_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase = True, stop_words = 'english', ngram_range = (1, 3))),
    ('logit', LogisticRegression())])

nonstem_logit_pipe.fit(X_train, y_train)

train_predicted = nonstem_logit_pipe.predict(X_train)
test_predicted = nonstem_logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']].head(10)

Training Error:  0.043
Precision:  [0.94150299 0.97405005]
Recall:  [0.9752322  0.93881197]
F1-Score:  [0.95807082 0.95610644]
**********************************************************
Test Error:  0.156
Precision:  [0.81568627 0.87346939]
Recall:  [0.87029289 0.81992337]
F1-Score:  [0.84210526 0.8458498 ]


Unnamed: 0,Task,Actual,Predicted
14022,Track animals by checking for signs such as dr...,Handling and Moving Objects,Not Handling and Moving Objects
4807,"Adjust or maintain equipment, such as lasers, ...",Handling and Moving Objects,Not Handling and Moving Objects
11753,Perform embalming duties as necessary.,Handling and Moving Objects,Not Handling and Moving Objects
11511,"Train horses or other equines for riding, harn...",Handling and Moving Objects,Not Handling and Moving Objects
15050,Measure and mark locations for installation of...,Not Handling and Moving Objects,Handling and Moving Objects
18334,Operate auxiliary equipment and control multip...,Not Handling and Moving Objects,Handling and Moving Objects
14374,"Dig ditches or trenches, backfill excavations,...",Not Handling and Moving Objects,Handling and Moving Objects
11784,"Order, display, and maintain supplies.",Handling and Moving Objects,Not Handling and Moving Objects
16268,Dismantle machinery and equipment for shipment...,Handling and Moving Objects,Not Handling and Moving Objects
13786,Assign to workers duties such as trees to be c...,Not Handling and Moving Objects,Handling and Moving Objects


In [16]:
# baseline preprocessing, no sampling

hamo_sample = df[df['GWA'] == 'Handling and Moving Objects']
nonhamo_df = df[df['GWA'] != 'Handling and Moving Objects']
nonhamo_df['GWA'] = ['Not Handling and Moving Objects'] * nonhamo_df.shape[0]
equal_hamo_data = pd.concat([hamo_sample, nonhamo_df])
X_train, X_test, y_train, y_test = train_test_split(equal_hamo_data['Task'], 
                                                    equal_hamo_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Training Error:  0.09
Precision:  [0.90925947 0.92781316]
Recall:  [0.99568528 0.35819672]
F1-Score:  [0.95051184 0.51685393]
**********************************************************
Test Error:  0.094
Precision:  [0.92340653 0.69871795]
Recall:  [0.97346132 0.43253968]
F1-Score:  [0.9477735  0.53431373]


Unnamed: 0,Task,Actual,Predicted
15997,"Repair holes in tire tubes, using scrapers and...",Not Handling and Moving Objects,Handling and Moving Objects
18047,"Stretch webbing and fabric, using webbing stre...",Handling and Moving Objects,Not Handling and Moving Objects
16553,"Repair and adjust safes, vault doors, and vaul...",Not Handling and Moving Objects,Handling and Moving Objects
15733,Fill small dents that cannot be worked out wit...,Handling and Moving Objects,Not Handling and Moving Objects
7953,Select and assemble equipment and required bac...,Handling and Moving Objects,Not Handling and Moving Objects
15236,"Start, stop, and control drilling speed of mac...",Not Handling and Moving Objects,Handling and Moving Objects
20093,"Adjust controls to guide, position, or move eq...",Not Handling and Moving Objects,Handling and Moving Objects
14259,Form a smooth foundation by stapling plywood o...,Handling and Moving Objects,Not Handling and Moving Objects
12093,"Perform work activities of subordinates, such ...",Handling and Moving Objects,Not Handling and Moving Objects
17316,"Position cores into lower sections of molds, a...",Not Handling and Moving Objects,Handling and Moving Objects


In [22]:
# no stemming, no sampling 
nonstem_logit_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase = True, stop_words = 'english', ngram_range = (1, 3))),
    ('logit', LogisticRegression())])

hamo_sample = df[df['GWA'] == 'Handling and Moving Objects']
nonhamo_df = df[df['GWA'] != 'Handling and Moving Objects']
nonhamo_df['GWA'] = ['Not Handling and Moving Objects'] * nonhamo_df.shape[0]
equal_hamo_data = pd.concat([hamo_sample, nonhamo_df])
X_train, X_test, y_train, y_test = train_test_split(equal_hamo_data['Task'], 
                                                    equal_hamo_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

nonstem_logit_pipe.fit(X_train, y_train)

train_predicted = nonstem_logit_pipe.predict(X_train)
test_predicted = nonstem_logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results#[results['Actual'] != results['Predicted']].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Training Error:  0.097
Precision:  [0.90186835 0.93291925]
Recall:  [0.99656969 0.30553295]
F1-Score:  [0.94685699 0.4603126 ]
**********************************************************
Test Error:  0.087
Precision:  [0.9287613  0.70422535]
Recall:  [0.9765232  0.42735043]
F1-Score:  [0.9520436  0.53191489]


Unnamed: 0,Task,Actual,Predicted
8171,"Order, perform, and interpret tests and analyz...",Not Handling and Moving Objects,Not Handling and Moving Objects
8588,Provide training and supervision in therapy te...,Not Handling and Moving Objects,Not Handling and Moving Objects
6595,Develop and use multimedia course materials an...,Not Handling and Moving Objects,Not Handling and Moving Objects
4182,"Identify, procure, or develop test equipment, ...",Not Handling and Moving Objects,Not Handling and Moving Objects
15187,Push levers and brake pedals to control gasoli...,Not Handling and Moving Objects,Not Handling and Moving Objects
9484,"Order, label, and count stock of medications, ...",Not Handling and Moving Objects,Not Handling and Moving Objects
15137,Install solar thermal system controllers and s...,Handling and Moving Objects,Handling and Moving Objects
20058,Weigh materials or products and record weight ...,Not Handling and Moving Objects,Not Handling and Moving Objects
19493,Maintain knowledge of first-aid procedures.,Not Handling and Moving Objects,Not Handling and Moving Objects
5303,Recommend materials for reliable performance i...,Not Handling and Moving Objects,Not Handling and Moving Objects


What we see with the experiments without sampling is that the resulting error is always around 10%, the "Handling and Moving Objects" class is approximately 10% of the data so we have the same problem we had before where the model is likely just commonly predicting True Negatives (it just keeps predicting "Not Handling and Moving Obejcts"), but rarely getting a True Positive and instead getting many False Negatives, as illustrated in the pd.DataFrame above.

Now let's try the one-vs-rest approach with a smaller class.

---

### Target Class: Analyzing Data or Information

In [18]:
#baseline preprocessing + sampling
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

adoi_sample = df[df['GWA'] == 'Analyzing Data or Information'].sample(600)
nonadoi_df = df[df['GWA'] != 'Analyzing Data or Information'].sample(600)
nonadoi_df['GWA'] = ['Not Analyzing Data or Information'] * 600
equal_adoi_data = pd.concat([adoi_sample, nonadoi_df])
X_train, X_test, y_train, y_test = train_test_split(equal_adoi_data['Task'], 
                                                    equal_adoi_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']].head(10)

Training Error:  0.009
Precision:  [0.98366606 0.99810964]
Recall:  [0.99815838 0.98324022]
F1-Score:  [0.99085923 0.99061914]
**********************************************************
Test Error:  0.233
Precision:  [0.71641791 0.83018868]
Recall:  [0.84210526 0.6984127 ]
F1-Score:  [0.77419355 0.75862069]


Unnamed: 0,Task,Actual,Predicted
4357,Calculate excavation tonnage and prepare graph...,Analyzing Data or Information,Not Analyzing Data or Information
521,"Review invoices, work orders, consumption repo...",Analyzing Data or Information,Not Analyzing Data or Information
12293,Supply the latest price quotes on any security...,Not Analyzing Data or Information,Analyzing Data or Information
2892,Evaluate local area network (LAN) or wide area...,Not Analyzing Data or Information,Analyzing Data or Information
8437,"Design or use surveillance tools, such as scre...",Not Analyzing Data or Information,Analyzing Data or Information
5415,Determine methods to incorporate geomethane or...,Not Analyzing Data or Information,Analyzing Data or Information
3066,Transfer or rescale information from original ...,Not Analyzing Data or Information,Analyzing Data or Information
8629,Review physician's referral and patient's medi...,Analyzing Data or Information,Not Analyzing Data or Information
5464,Evaluate research data in terms of its impact ...,Not Analyzing Data or Information,Analyzing Data or Information
7643,Study films or scripts to determine how musica...,Not Analyzing Data or Information,Analyzing Data or Information


In [19]:
# pre-processing without stemming + sampling

nonstem_logit_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase = True, stop_words = 'english', ngram_range = (1, 3))),
    ('logit', LogisticRegression())])

adoi_sample = df[df['GWA'] == 'Analyzing Data or Information'].sample(600)
nonadoi_df = df[df['GWA'] != 'Analyzing Data or Information'].sample(600)
nonadoi_df['GWA'] = ['Not Analyzing Data or Information'] * 600
equal_adoi_data = pd.concat([adoi_sample, nonadoi_df])
X_train, X_test, y_train, y_test = train_test_split(equal_adoi_data['Task'], 
                                                    equal_adoi_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

nonstem_logit_pipe.fit(X_train, y_train)

train_predicted = nonstem_logit_pipe.predict(X_train)
test_predicted = nonstem_logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']].head(10)

Training Error:  0.006
Precision:  [0.99626866 0.99264706]
Recall:  [0.99256506 0.99630996]
F1-Score:  [0.99441341 0.99447514]
**********************************************************
Test Error:  0.133
Precision:  [0.88333333 0.85      ]
Recall:  [0.85483871 0.87931034]
F1-Score:  [0.86885246 0.86440678]


Unnamed: 0,Task,Actual,Predicted
12360,Visit establishments to evaluate needs or to p...,Not Analyzing Data or Information,Analyzing Data or Information
11,Review reports submitted by staff members to r...,Analyzing Data or Information,Not Analyzing Data or Information
4113,Verify energy bills and meter readings.,Analyzing Data or Information,Not Analyzing Data or Information
9842,Perform tests to identify any potential hazard...,Not Analyzing Data or Information,Analyzing Data or Information
1820,Review employer practices or employee data to ...,Not Analyzing Data or Information,Analyzing Data or Information
16828,Bend inner coils of springs away from or towar...,Not Analyzing Data or Information,Analyzing Data or Information
1999,"Observe, interview, and survey employees and c...",Analyzing Data or Information,Not Analyzing Data or Information
15976,Test and inspect engines to determine malfunct...,Not Analyzing Data or Information,Analyzing Data or Information
3457,"Review existing standards, controls, or equipm...",Analyzing Data or Information,Not Analyzing Data or Information
3953,Conduct automotive design reviews.,Analyzing Data or Information,Not Analyzing Data or Information


In [20]:
# baseline preprocessing, no sampling

adoi_sample = df[df['GWA'] == 'Analyzing Data or Information']
nonadoi_df = df[df['GWA'] != 'Analyzing Data or Information']
nonadoi_df['GWA'] = ['Not Analyzing Data or Information'] * nonadoi_df.shape[0]
equal_adoi_data = pd.concat([adoi_sample, nonadoi_df])
X_train, X_test, y_train, y_test = train_test_split(equal_adoi_data['Task'], 
                                                    equal_adoi_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Training Error:  0.032
Precision:  [0.96826881 1.        ]
Recall:  [1.         0.02698145]
F1-Score:  [0.98387863 0.05254516]
**********************************************************
Test Error:  0.034
Precision:  [0.9681592  0.61538462]
Recall:  [0.99743721 0.11111111]
F1-Score:  [0.98258016 0.18823529]


Unnamed: 0,Task,Actual,Predicted
4698,Assemble or analyze electronics technologies t...,Analyzing Data or Information,Not Analyzing Data or Information
2122,"Analyze impact on, and risk to, essential busi...",Analyzing Data or Information,Not Analyzing Data or Information
9255,Identify tissue structures or cell components ...,Analyzing Data or Information,Not Analyzing Data or Information
135,Analyze the effectiveness of marketing tactics...,Analyzing Data or Information,Not Analyzing Data or Information
4717,Collect and analyze data related to quality or...,Analyzing Data or Information,Not Analyzing Data or Information
5065,Collect and dissect animal specimens and exami...,Analyzing Data or Information,Not Analyzing Data or Information
277,Perform tax planning work.,Analyzing Data or Information,Not Analyzing Data or Information
1859,"Review contractual commitments, customer speci...",Analyzing Data or Information,Not Analyzing Data or Information
2337,Examine documents to determine degree of risk ...,Analyzing Data or Information,Not Analyzing Data or Information
3990,"Evaluate data to develop new mining products, ...",Analyzing Data or Information,Not Analyzing Data or Information


In [21]:
# no stemming, no sampling

nonstem_logit_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase = True, stop_words = 'english', ngram_range = (1, 3))),
    ('logit', LogisticRegression())])

adoi_sample = df[df['GWA'] == 'Analyzing Data or Information']
nonadoi_df = df[df['GWA'] != 'Analyzing Data or Information']
nonadoi_df['GWA'] = ['Not Analyzing Data or Information'] * nonadoi_df.shape[0]
equal_adoi_data = pd.concat([adoi_sample, nonadoi_df])
X_train, X_test, y_train, y_test = train_test_split(equal_adoi_data['Task'], 
                                                    equal_adoi_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

nonstem_logit_pipe.fit(X_train, y_train)

train_predicted = nonstem_logit_pipe.predict(X_train)
test_predicted = nonstem_logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Training Error:  0.033
Precision:  [0.96722754 0.92857143]
Recall:  [0.99994315 0.02134647]
F1-Score:  [0.9833133  0.04173355]
**********************************************************
Test Error:  0.024
Precision:  [0.97665176 0.9       ]
Recall:  [0.99949161 0.16071429]
F1-Score:  [0.9879397  0.27272727]


Unnamed: 0,Task,Actual,Predicted
9883,Determine or coordinate treatment plans by req...,Analyzing Data or Information,Not Analyzing Data or Information
8117,Analyze test results and develop a treatment p...,Not Analyzing Data or Information,Analyzing Data or Information
8464,Evaluate medical information to determine pati...,Analyzing Data or Information,Not Analyzing Data or Information
1917,Compare locations or environmental policies of...,Analyzing Data or Information,Not Analyzing Data or Information
3269,Apply mathematical theories and techniques to ...,Analyzing Data or Information,Not Analyzing Data or Information
4698,Assemble or analyze electronics technologies t...,Analyzing Data or Information,Not Analyzing Data or Information
1731,"Identify relevant guidance documents, internat...",Analyzing Data or Information,Not Analyzing Data or Information
54,"Review financial statements, sales or activity...",Analyzing Data or Information,Not Analyzing Data or Information
13201,"Select shipment routes, based on nature of goo...",Analyzing Data or Information,Not Analyzing Data or Information
19870,Visit development or work sites to determine p...,Analyzing Data or Information,Not Analyzing Data or Information


The sampling appears to still be effective for a target class that is a small proportion of the overall dataset.  
Let's continue to experiment on an even smaller target class.

---

### Target Class: Performing for or working Directly with the Public

In [23]:
#baseline preprocessing + sampling
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

pwdwp_sample = df[df['GWA'] == 'Performing for or Working Directly with the Public'].sample(100)
nonpwdwp_df = df[df['GWA'] != 'Performing for or Working Directly with the Public'].sample(100)
nonpwdwp_df['GWA'] = ['Not PWDWP'] * 100
equal_pwdwp_data = pd.concat([pwdwp_sample, nonpwdwp_df])
X_train, X_test, y_train, y_test = train_test_split(equal_pwdwp_data['Task'], 
                                                    equal_pwdwp_data['GWA'], 
                                                    test_size = 0.15,
                                                    shuffle = True)

logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']].head(10)

Training Error:  0.0
Precision:  [1. 1.]
Recall:  [1. 1.]
F1-Score:  [1. 1.]
**********************************************************
Test Error:  0.067
Precision:  [0.86666667 1.        ]
Recall:  [1.         0.88235294]
F1-Score:  [0.92857143 0.9375    ]


Unnamed: 0,Task,Actual,Predicted
7471,Create and approve storyboards in conjunction ...,Performing for or Working Directly with the Pu...,Not PWDWP
7464,"Consult with writers, producers, or actors abo...",Performing for or Working Directly with the Pu...,Not PWDWP


In [30]:
#no stemming + sampling
nonstem_logit_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase = True, stop_words = 'english', ngram_range = (1, 3))),
    ('logit', LogisticRegression())])

pwdwp_sample = df[df['GWA'] == 'Performing for or Working Directly with the Public'].sample(100)
nonpwdwp_df = df[df['GWA'] != 'Performing for or Working Directly with the Public'].sample(100)
nonpwdwp_df['GWA'] = ['Not PWDWP'] * 100
equal_pwdwp_data = pd.concat([pwdwp_sample, nonpwdwp_df])
X_train, X_test, y_train, y_test = train_test_split(equal_pwdwp_data['Task'], 
                                                    equal_pwdwp_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

nonstem_logit_pipe.fit(X_train, y_train)

train_predicted = nonstem_logit_pipe.predict(X_train)
test_predicted = nonstem_logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results#[results['Actual'] != results['Predicted']].head(10)

Training Error:  0.0
Precision:  [1. 1.]
Recall:  [1. 1.]
F1-Score:  [1. 1.]
**********************************************************
Test Error:  0.25
Precision:  [0.54545455 1.        ]
Recall:  [1.         0.64285714]
F1-Score:  [0.70588235 0.7826087 ]


Unnamed: 0,Task,Actual,Predicted
13143,Hear and resolve complaints from customers or ...,Performing for or Working Directly with the Pu...,Performing for or Working Directly with the Pu...
3883,Recommend design modifications to eliminate ma...,Not PWDWP,Not PWDWP
13382,Confer or correspond with establishment repres...,Performing for or Working Directly with the Pu...,Performing for or Working Directly with the Pu...
11673,"Answer customer inquiries or explain cost, ava...",Performing for or Working Directly with the Pu...,Performing for or Working Directly with the Pu...
9642,"Make appointments, keep records, or perform ot...",Not PWDWP,Performing for or Working Directly with the Pu...
8862,"Instruct individuals, families, or other group...",Not PWDWP,Performing for or Working Directly with the Pu...
13689,"Communicate with customers, employees, and oth...",Performing for or Working Directly with the Pu...,Performing for or Working Directly with the Pu...
16674,Fit and fasten sheet metal coverings to surfac...,Not PWDWP,Not PWDWP
12508,Contact utility companies for service hookups ...,Not PWDWP,Performing for or Working Directly with the Pu...
8761,Explain treatment procedures to patients to ga...,Not PWDWP,Not PWDWP


In [25]:
#baseline preprocessing + no sampling
pwdwp_sample = df[df['GWA'] == 'Performing for or Working Directly with the Public']
nonpwdwp_df = df[df['GWA'] != 'Performing for or Working Directly with the Public']
nonpwdwp_df['GWA'] = ['Not PWDWP'] * nonpwdwp_df.shape[0]
equal_pwdwp_data = pd.concat([pwdwp_sample, nonpwdwp_df])
X_train, X_test, y_train, y_test = train_test_split(equal_pwdwp_data['Task'], 
                                                    equal_pwdwp_data['GWA'], 
                                                    test_size = 0.15,
                                                    shuffle = True)

logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Training Error:  0.006
Precision:  [0.99377509 0.        ]
Recall:  [1. 0.]
F1-Score:  [0.99687783 0.        ]
**********************************************************
Test Error:  0.01
Precision:  [0.98978247 0.        ]
Recall:  [1. 0.]
F1-Score:  [0.994865 0.      ]


  'precision', 'predicted', average, warn_for)


Unnamed: 0,Task,Actual,Predicted
12350,"Answer customers' questions about products, pr...",Performing for or Working Directly with the Pu...,Not PWDWP
12639,Answer customer questions regarding problems w...,Performing for or Working Directly with the Pu...,Not PWDWP
7580,Coordinate dancing with that of partners or da...,Performing for or Working Directly with the Pu...,Not PWDWP
11456,Answer patrons' questions about gaming machine...,Performing for or Working Directly with the Pu...,Not PWDWP
11427,Monitor stations and games and move dealers fr...,Performing for or Working Directly with the Pu...,Not PWDWP
12167,Answer telephones to provide information and r...,Performing for or Working Directly with the Pu...,Not PWDWP
7411,Sing or dance during dramatic or comedic perfo...,Performing for or Working Directly with the Pu...,Not PWDWP
11922,"Resolve any problems with itineraries, service...",Performing for or Working Directly with the Pu...,Not PWDWP
7664,"Play musical instruments as soloists, or as me...",Performing for or Working Directly with the Pu...,Not PWDWP
7645,"Collaborate with other colleagues, such as cop...",Performing for or Working Directly with the Pu...,Not PWDWP


In [29]:
#no stemming + no sampling
pwdwp_sample = df[df['GWA'] == 'Performing for or Working Directly with the Public']
nonpwdwp_df = df[df['GWA'] != 'Performing for or Working Directly with the Public']
nonpwdwp_df['GWA'] = ['Not PWDWP'] * nonpwdwp_df.shape[0]
equal_pwdwp_data = pd.concat([pwdwp_sample, nonpwdwp_df])
X_train, X_test, y_train, y_test = train_test_split(equal_pwdwp_data['Task'], 
                                                    equal_pwdwp_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

nonstem_logit_pipe.fit(X_train, y_train)

train_predicted = nonstem_logit_pipe.predict(X_train)
test_predicted = nonstem_logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results#[results['Actual'] != results['Predicted']].head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Training Error:  0.006
Precision:  [0.99357143 0.        ]
Recall:  [1. 0.]
F1-Score:  [0.99677535 0.        ]
**********************************************************
Test Error:  0.01
Precision:  [0.98961938 0.        ]
Recall:  [1. 0.]
F1-Score:  [0.99478261 0.        ]


  'precision', 'predicted', average, warn_for)


Unnamed: 0,Task,Actual,Predicted
15585,"Adjust, repair, or replace defective wiring an...",Not PWDWP,Not PWDWP
11826,"Study production information, such as characte...",Not PWDWP,Not PWDWP
11331,Move and arrange furniture and turn mattresses.,Not PWDWP,Not PWDWP
5921,Supervise the work of survey interviewers.,Not PWDWP,Not PWDWP
3151,Identify need for initial or supplemental proj...,Not PWDWP,Not PWDWP
5853,"Assemble, operate, or maintain field or labora...",Not PWDWP,Not PWDWP
7900,"Organize recording sessions and prepare areas,...",Not PWDWP,Not PWDWP
10487,"Testify in court cases involving fires, suspec...",Not PWDWP,Not PWDWP
3074,Design or prepare graphic representations of G...,Not PWDWP,Not PWDWP
18200,"Note malfunctions of equipment, instruments, o...",Not PWDWP,Not PWDWP


So far the results of the sampling methods are promising, although the accuracy of the model gets worse as the size of the target class gets smaller. Lastly, lets try the sampling method on the smallest target class.

---

### Target Class: Coaching and Developing Others

In [27]:
#baseline preprocessing + sampling
cado_sample = df[df['GWA'] == 'Coaching and Developing Others'].sample(40)
noncado_df = df[df['GWA'] != 'Coaching and Developing Others'].sample(40)
noncado_df['GWA'] = ['Not Coaching and Developing Others'] * 40
equal_cado_data = pd.concat([cado_sample, noncado_df])
X_train, X_test, y_train, y_test = train_test_split(equal_cado_data['Task'], 
                                                    equal_cado_data['GWA'], 
                                                    test_size = 0.10,
                                                    shuffle = True)

logit_pipe.fit(X_train, y_train)

train_predicted = logit_pipe.predict(X_train)
test_predicted = logit_pipe.predict(X_test)

train_p, train_r, train_f1, train_s = precision_recall_fscore_support(y_train, train_predicted, labels = y_train.unique())
print('Training Error: ', round(sum(train_predicted != y_train)/len(y_train), 3))
print('Precision: ', train_p)
print('Recall: ', train_r)
print('F1-Score: ', train_f1)
print('**********************************************************')
print('Test Error: ', round(sum(test_predicted != y_test)/len(y_test), 3))
test_p, test_r, test_f1, test_s = precision_recall_fscore_support(y_test, test_predicted, labels = y_train.unique())
print('Precision: ', test_p)
print('Recall: ', test_r)
print('F1-Score: ', test_f1)

results = pd.DataFrame({'Task': X_test, 'Actual': y_test, 'Predicted': test_predicted})
results[results['Actual'] != results['Predicted']].head(10)

Training Error:  0.0
Precision:  [1. 1.]
Recall:  [1. 1.]
F1-Score:  [1. 1.]
**********************************************************
Test Error:  0.0
Precision:  [1. 1.]
Recall:  [1. 1.]
F1-Score:  [1. 1.]


Unnamed: 0,Task,Actual,Predicted


The size of this class was so small it is difficult to put much confidence in our results for this experiment, although the model did supposedly perform very well. In general, the sampling method seems to be quite effective so we'll use this procedure in our final model.