The goal of this notebook is to test the performance of the pipeline on mutiple test data.

In [1]:
import pandas as pd

In [2]:
test_data_1 = pd.read_csv("../clean_data/Cleaned_test_text_with_pii_2018_12_31_05_35_46_815414.csv")
test_data_2 = pd.read_csv("../clean_data/Cleaned_test_text_with_pii_2019_01_18_06_21_07_588706.csv")
test_data_3 = pd.read_csv("../clean_data/Cleaned_test_text_with_pii_2019_01_18_06_39_35_811991.csv")

In [3]:
from w2vpipe.pipeline import text_clean, word_embedding

In [4]:
from sklearn.externals import joblib

In [5]:
word2vec_pipe = joblib.load( 'word2vec_pipe_cv_production.pkl')

In [6]:
%%time 
binary_pred = word2vec_pipe.predict(test_data_1["Text"])
binary_true = test_data_1["Target"]

from sklearn.metrics import classification_report

print(classification_report(y_true = binary_true, y_pred = binary_pred))

  0%|          | 0/80000 [00:00<?, ?it/s]

transforming while training word2vec model with new data.


100%|██████████| 80000/80000 [00:08<00:00, 8987.96it/s] 
100%|██████████| 80000/80000 [00:00<00:00, 221243.95it/s]


              precision    recall  f1-score   support

           0       0.95      0.95      0.95     10000
           1       0.99      0.99      0.99     70000

   micro avg       0.99      0.99      0.99     80000
   macro avg       0.97      0.97      0.97     80000
weighted avg       0.99      0.99      0.99     80000

CPU times: user 31.7 s, sys: 8.37 s, total: 40 s
Wall time: 20.7 s


In [7]:
%%time 
binary_pred = word2vec_pipe.predict(test_data_2["Text"])
binary_true = test_data_2["Target"]

from sklearn.metrics import classification_report

print(classification_report(y_true = binary_true, y_pred = binary_pred))

  0%|          | 0/80000 [00:00<?, ?it/s]

transforming while training word2vec model with new data.


100%|██████████| 80000/80000 [00:09<00:00, 8538.03it/s] 
100%|██████████| 80000/80000 [00:00<00:00, 218832.87it/s]


              precision    recall  f1-score   support

           0       0.97      0.92      0.95     10000
           1       0.99      1.00      0.99     70000

   micro avg       0.99      0.99      0.99     80000
   macro avg       0.98      0.96      0.97     80000
weighted avg       0.99      0.99      0.99     80000

CPU times: user 32.8 s, sys: 8.53 s, total: 41.4 s
Wall time: 21.8 s


In [8]:
%%time 
binary_pred = word2vec_pipe.predict(test_data_3["Text"])
binary_true = test_data_3["Target"]

from sklearn.metrics import classification_report

print(classification_report(y_true = binary_true, y_pred = binary_pred))

  0%|          | 0/80000 [00:00<?, ?it/s]

transforming while training word2vec model with new data.


100%|██████████| 80000/80000 [00:09<00:00, 8801.63it/s] 
100%|██████████| 80000/80000 [00:00<00:00, 222339.35it/s]


              precision    recall  f1-score   support

           0       0.97      0.93      0.95     10000
           1       0.99      1.00      0.99     70000

   micro avg       0.99      0.99      0.99     80000
   macro avg       0.98      0.96      0.97     80000
weighted avg       0.99      0.99      0.99     80000

CPU times: user 33.1 s, sys: 9.38 s, total: 42.5 s
Wall time: 21.7 s


The next step is to make sure that the PII in test data does not appear too often in train data.

In [9]:
train_data = pd.read_csv("../clean_data/Cleaned_train_text_with_pii_2018_12_29_07_26_56_266227.csv")

In [45]:
# a function to calculate the PII overlap in training and test data
# while excluding the 'None'. 
def pii_overlap(test, train):
    
    overlap_pii_binary = test.isin(train)
    total_overlap = sum(overlap_pii_binary)
    
    none_type_num = sum(test == "None")
    
    pii_overlap_num = total_overlap - none_type_num
    
    return pii_overlap_num

In [46]:
pii_overlap(test_data_1['PII'], train_data['PII'])

10992

In [47]:
pii_overlap(test_data_2['PII'], train_data['PII'])

11056

In [48]:
pii_overlap(test_data_3['PII'], train_data['PII'])

11097

In [49]:
test_data_3.shape[0]

80000

In [60]:
# a function to get all the overlap values
def pii_overlap_value(test, train):
    
    overlap_pii_binary = test.isin(train)
    none_type_binary = test != "None"
    
    combined_binary = none_type_binary & overlap_pii_binary
    
    return test.loc[combined_binary]

In [67]:
pii_overlap_value(test_data_3['PII'], train_data['PII']).value_counts().head(15)

Michael        72
Smith          63
Thomas         62
Williams       61
John           55
David          54
Joseph         47
James          46
Christopher    44
Matthew        43
Stephanie      42
Jennifer       41
Rodriguez      41
Wilson         40
Mark           38
Name: PII, dtype: int64

The majority of the overlap in PII value originates from the name, which is reasonable since name does not have as much variation as other PIIs. 

The total PII overlap between training data and test data is about 1/7 of the test data. 

## In conclusion, the entire pipeline does works on unseen PII data.