In this notebook:

Modelling pipeline: grid search over all classes.

Then to make bigrams and sentence filtering optional.

In [28]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
import time
from collections import defaultdict
from collections import Counter

import ds_utils_callum
import priv_policy_manipulation_functions as priv_pol_funcs

# pre-processing
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix, hstack

# modelling
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

# modelling pipeline
from tempfile import mkdtemp
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

# modelling evaluation
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

Future pipeline:

For each classifier -><br>
Separate to X and Y<br>
TF-IDF here option 1 <br>
Step for SF'ing<br>
TF-IDF here option 2 <br>
Split into folds (5-fold CV)<br>
3x3 SVM Hyperparameters<br>
Find best neg F1 score

Plus anything else

Pipeline to make now:

1. Separate into classifiers. For each classifier:
2. Apply SF'd
3. Separate into X and Y
4. Crate TF-IDF Matrix
4. Split each set into 5 folds
5. Grid search over SVM Hyperparameters to optimise F1 score

Output.

This will be a moderate approximation for a replication of most of their work. Main missing element will be better text pre-processing to get better results from the CFs and SF'ing.

Do it for one classifier, then find how to generalise it.

Train, Validate and Test dataframes to use:

In [104]:
df_for_pipelining = pd.read_pickle("crafted_features_df.pkl")

df_for_pipelining_train = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TRAINING' ].copy()
df_for_pipelining_val = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'VALIDATION' ].copy()
df_for_pipelining_test = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TEST' ].copy()

# now that I have used the 'policy type' column for referring to train/validate/test, 
# I can delete that column along with other uneccesary columns.
for dataframe in [df_for_pipelining_train, df_for_pipelining_val, df_for_pipelining_test]:
    dataframe.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
           'policy_segment_id', 'annotations', 'sentences'], inplace=True)

In [109]:
print(df_for_pipelining_train.shape)
print(df_for_pipelining_val.shape)
print(df_for_pipelining_test.shape)

(8068, 614)
(2651, 614)
(4824, 614)


# Step 1: select classifier

Let's start with 1st Party.

In [105]:
classifier = "1st_party"

# Step 2: apply SF'ing

1. Get CFs for 1st Party to use for SF'ing

In [106]:
annotation_features = pd.read_pickle("annotation_features.pkl")

# filtering the table to get the list object from the same row that lists the classifier
classifier_features = annotation_features[ annotation_features['annotation'] == classifier ].reset_index().at[0,'features']

classifier_features

[' we ', ' you ', ' us ', ' our ', 'the app', 'the software']

2. Filter the DF for rows where any of those features is 1.

In [107]:
df_for_pipelining_train.shape

(8068, 614)

In [6]:
df_for_pipelining_train_SF = df_for_pipelining_train[( (df_for_pipelining_train[classifier_features] > 0).sum(axis=1) > 0 )]
df_for_pipelining_train_SF.reset_index(inplace=True, drop=True)
df_for_pipelining_train_SF.shape

(5101, 614)

# Step 3: Separate into X and Y

## Create X
X requires a union of the Crafted Features columns and the TF-IDF matrix.

Create TF-IDF matrix:

In [7]:
tfidfTransformer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)

train_tfidf = tfidfTransformer.fit_transform(df_for_pipelining_train_SF['segment_text'])

Extract CF columns from X_train and convert to sparse so that it can be combined with TF-IDF:

In [8]:
# Extract CF columns:
classifier_X_train_cfs = df_for_pipelining_train_SF.loc[:,'contact info':].copy()
# Use every column after and including the first crafted feature, which happens to be 'contact info'
print(f"Should be left with the 579 crafted features (CF). CF shape is: {classifier_X_train_cfs.shape}")

#convert to sparse
classifier_X_train_cfs = csr_matrix(classifier_X_train_cfs)

# combine CF columns with TF-IDF to create X
classifier_X_train = hstack([classifier_X_train_cfs, train_tfidf ])

Should be left with the 579 crafted features (CF). CF shape is: (5101, 579)


## Create y

In [9]:
classifier_y_train = df_for_pipelining_train_SF.loc[:,classifier].copy()

In [10]:
# Ensure Y_train only has binary values
for i in range(len(classifier_y_train)):
    if classifier_y_train[i] > 1:
        classifier_y_train[i] = 1
print(f"Highest value should be one. Highest value is: {classifier_y_train.max()}") # should be 1

Highest value should be one. Highest value is: 1


# Step 4: 5-fold CV Grid Search over hyperparameters

In [11]:
cachedir = mkdtemp() # Memory dump to help with processing

pipeline_sequences = [
        ('SVC', SVC()) ]
pipe = Pipeline(pipeline_sequences, memory = cachedir)

svc_params = {'SVC__C': [0.1, 1, 10],
             'SVC__gamma': [0.001, 0.01, 0.1]}

# Create grid search object
grid_search_object = GridSearchCV(estimator=pipe, param_grid = svc_params, cv = 5, verbose=1, n_jobs=-1, scoring='f1')

In [12]:
%%time
fitted_search = grid_search_object.fit(classifier_X_train, classifier_y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
CPU times: user 5.5 s, sys: 144 ms, total: 5.65 s
Wall time: 50.7 s


# Evaluation on Train set

To compare to the per-classifier results given in the paper (Table 1 pg 4), I only need to look at F1 score.

In [13]:
classifier_prediction = fitted_search.predict(classifier_X_train)
print(classification_report(classifier_y_train, classifier_prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3745
           1       1.00      1.00      1.00      1356

    accuracy                           1.00      5101
   macro avg       1.00      1.00      1.00      5101
weighted avg       1.00      1.00      1.00      5101



Okay, I need to do my CV grid search on just the Train set, then evaluate performance using the Validate set, since it's massively overfitting on the train set.

# Evaluation on Validate set

Pre-processing steps to prepare the validate set for prediction:

In [44]:
df_for_pipelining_val = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'VALIDATION' ].copy()
df_for_pipelining_val.reset_index(inplace=True, drop=True)

In [45]:
val_tfidf = tfidfTransformer.transform(df_for_pipelining_val['segment_text'])
# Extract CF columns:
classifier_X_val_cfs = df_for_pipelining_val.loc[:,'contact info':].copy()
#convert to sparse
classifier_X_val_cfs = csr_matrix(classifier_X_val_cfs)

# combine CF columns with TF-IDF to create X
classifier_X_val = hstack([classifier_X_val_cfs, val_tfidf ])

classifier_y_val = df_for_pipelining_val.loc[:,classifier].copy()
# Ensure Y_val only has binary values
for i in range(len(classifier_y_val)):
    if classifier_y_val[i] > 1:
        classifier_y_val[i] = 1
print(f"Highest value should be one. Highest value is: {classifier_y_val.max()}") # should be 1

Highest value should be one. Highest value is: 1


Scoring:

In [47]:
classifier_val_prediction = fitted_search.predict(classifier_X_val)

model_results[classifier] = [fitted_search, classifier_y_val, classifier_val_prediction]
model_results.to_pickle("model_results.pkl")

# print(classification_report(classifier_y_val, classifier_val_prediction))

Nice! Looks like this scored well.  Let's set up the pipeline for all the other classifiers and score them too. But I'm still not sure whether I want the positive F1 score or the negative F1 score.  I think that "negative F-1 score" is important because it relates to when a policy fails to mention an important practice.  We want to be sure that if a policy fails to mention it, the classifier correctly states that it is not mentioned.

Okay so Negative Recall is the proportion of When it was not in, did it say that it was not in?<br>
Negative Precision then is when it predicted that it wasn't in, how often was that the case?

Positive recall is When it was in, what was the chance it was identified?  <br>Positive precision is When it was predicted to be in, what was the chance that it was in?

In [None]:
cf_matrix = confusion_matrix(classifier_y_val, classifier_val_prediction)
cf_df = pd.DataFrame(
    cf_matrix, columns=["Predicted Negative", "Predicted Positive"], index=["True Negative", "True Positive"])

I think that I want to store negative precision, negative recall, negative F1 and positive F1.

This seems like a lot of things to store for each classifier.

I think I can just store the tuple of `(classifier_y_val, classifier_val_prediction)`

Then from that I can extract and populate a bigger table if I want.

Could store as lists... for each model, have a list with 3 values: fitted search, classifier_y_val, classifier_val_prediction.  Then could store each of those lists in a series where the index is the classifier.

Could store as dictionaries.  Each key is the classifier and each value is the list.

Then could loop through to get matrix (table) of scores.

In fact I think it will be helpful to have the order be the same as the order that I pass the classifiers, so I should use a series.

### Requirements for modelling pipeline:

- List of all classifiers
- df_for_pipelining_train/val/test
- Empty table of classifier results

In [111]:
df_for_pipelining = pd.read_pickle("crafted_features_df.pkl")

df_for_pipelining_train = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TRAINING' ].copy()
df_for_pipelining_val = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'VALIDATION' ].copy()
df_for_pipelining_val.reset_index(inplace=True, drop=True)
df_for_pipelining_test = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TEST' ].copy()
df_for_pipelining_test.reset_index(inplace=True, drop=True)

# now that I have used the 'policy type' column for referring to train/validate/test, 
# I can delete that column along with other uneccesary columns.
for dataframe in [df_for_pipelining_train, df_for_pipelining_val, df_for_pipelining_test]:
    dataframe.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
           'policy_segment_id', 'annotations', 'sentences'], inplace=True)

List of all classifiers can be taken from any of the previous dataframes

In [112]:
list_of_18_classifiers = ['Contact', 'Contact_E_Mail_Address', 'Contact_Phone_Number', 
                       'Identifier_Cookie_or_similar_Tech', 'Identifier_Device_ID', 'Identifier_IMEI',
                        'Identifier_MAC', 'Identifier_Mobile_Carrier',
                        'Location', 'Location_Cell_Tower', 'Location_GPS', 'Location_WiFi',
                        'SSO', 'Facebook_SSO',
                        '1st_party', '3rd_party',
                        'PERFORMED', 'NOT_PERFORMED'] # cross-checked from table on pg 4 of the paper

In [113]:
model_results = pd.Series(range(len(list_of_18_classifiers)),
                          index=list_of_18_classifiers, dtype=object)

In [125]:
def full_modelling_pipeline(classifier, inspect_flow=False):
    
    """
    Passing inspect_flow=True will print out the shape of dataframes moving through the flow 
    """
    
    # step 1
    print(f"Running for classifier: {classifier}")
    start_code_time = time.time()
    
    # step 2:
    annotation_features = pd.read_pickle("annotation_features.pkl")
    df_for_pipelining_train_SF = model_pipeline_step_2(classifier, annotation_features)
    if inspect_flow == True: print(f"df_for_pipelining_train_SF: {df_for_pipelining_train_SF.shape}")
    
    # step 3:
    
    classifier_X_train, tfidfTransformer = model_pipeline_step_3_1(df_for_pipelining_train_SF)
    
    classifier_y_train = model_pipeline_step_3_2(df_for_pipelining_train_SF)
    
    if inspect_flow == True: 
        print(f"classifier_X_train (made of CFs plus tf-idf matrix): {classifier_X_train.shape}")
        print(f"classifier_y_train: {classifier_y_train.shape}")
    
    # step 4:
    
    fitted_search = model_pipeline_step_4(classifier_X_train, classifier_y_train)
    
    # step 5:
    
    classifier_X_val, classifier_y_val = model_pipeline_step_5_1(df_for_pipelining_val, tfidfTransformer)
    if inspect_flow == True: 
        print(f"classifier_X_val: {classifier_X_val.shape}")
        print(f"classifier_y_val: {classifier_y_val.shape}")
    
    model_pipeline_step_5_2(classifier, fitted_search, classifier_X_val, classifier_y_val)
    
    if type(model_results[classifier]) == int:
        print("Model results not saved.")
        raise NotSavedError("Check model results")
    
    print(f"The runtime for {classifier} was {round(time.time() - start_code_time, 5)}")
    print()

In [115]:
def model_pipeline_step_2(classifier, annotation_features):
    
    # step 2 – Get CFs for classifier to use for SF'ing
    
    # filtering the table to get the list object from the same row that lists the classifier:
    classifier_features = annotation_features[ annotation_features['annotation'] == classifier ].reset_index().at[0,'features']
    
    # Filter the DF for rows where any of those features is 1:
    df_for_pipelining_train_SF = df_for_pipelining_train[( (df_for_pipelining_train[classifier_features] > 0).sum(axis=1) > 0 )]
    df_for_pipelining_train_SF.reset_index(inplace=True, drop=True)
    print(f"Shape of {classifier} train df after sentence filtering is: {df_for_pipelining_train_SF.shape}")
    
    return df_for_pipelining_train_SF

In [116]:
def model_pipeline_step_3_1(df_for_pipelining_train_SF):
    # separate into X
    
    tfidfTransformer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)

    train_tfidf = tfidfTransformer.fit_transform(df_for_pipelining_train_SF['segment_text'])
    
    # Extract CF columns:
    classifier_X_train_cfs = df_for_pipelining_train_SF.loc[:,'contact info':].copy()
    # Use every column after and including the first crafted feature, which happens to be 'contact info'
    
    if classifier_X_train_cfs.shape[1] != 579:
        print(f"Should be left with the 579 crafted features (CF). CF shape is: {classifier_X_train_cfs.shape}")
        raise Step_3_CF_error("Crafted features not being applied correctly")

    #convert to sparse
    classifier_X_train_cfs = csr_matrix(classifier_X_train_cfs)

    # combine CF columns with TF-IDF to create X
    classifier_X_train = hstack([classifier_X_train_cfs, train_tfidf ])
    return classifier_X_train, tfidfTransformer

In [117]:
def model_pipeline_step_3_2(df_for_pipelining_train_SF):
    # separate into y
    
    classifier_y_train = df_for_pipelining_train_SF.loc[:,classifier].copy()
    # Ensure Y_train only has binary values
    for i in range(len(classifier_y_train)):
        if classifier_y_train[i] > 1:
            classifier_y_train[i] = 1
    
    if classifier_y_train.max() != 1:
        print(f"Highest value should be one. Highest value is: {classifier_y_train.max()}")
        raise Step_3_y_error("train target colum not binary")
    
    return classifier_y_train

In [118]:
def model_pipeline_step_4(classifier_X_train, classifier_y_train):
    cachedir = mkdtemp() # Memory dump to help with processing

    pipeline_sequences = [
            ('SVC', SVC()) ]
    pipe = Pipeline(pipeline_sequences, memory = cachedir)

    svc_params = {'SVC__C': [0.1, 1, 10],
                 'SVC__gamma': [0.001, 0.01, 0.1]}

    # Create grid search object
    grid_search_object = GridSearchCV(estimator=pipe, param_grid = svc_params, cv = 5, verbose=0, n_jobs=-1, scoring='f1')
    
    fitted_search = grid_search_object.fit(classifier_X_train, classifier_y_train)
    
    return fitted_search

In [119]:
def model_pipeline_step_5_1(df_for_pipelining_val, tfidfTransformer):
    # create validate X and y

    val_tfidf = tfidfTransformer.transform(df_for_pipelining_val['segment_text'])
    # Extract CF columns:
    classifier_X_val_cfs = df_for_pipelining_val.loc[:,'contact info':].copy()
    #convert to sparse
    classifier_X_val_cfs = csr_matrix(classifier_X_val_cfs)

    # combine CF columns with TF-IDF to create X
    classifier_X_val = hstack([classifier_X_val_cfs, val_tfidf ])

    classifier_y_val = df_for_pipelining_val.loc[:,classifier].copy()
    # Ensure Y_val only has binary values
    for i in range(len(classifier_y_val)):
        if classifier_y_val[i] > 1:
            classifier_y_val[i] = 1
    
    if classifier_y_val.max() != 1:
        print(f"Highest value should be one. Highest value is: {classifier_y_val.max()}")
        raise Step_5_val_error("Validation target column not binary")
    
    return classifier_X_val, classifier_y_val

In [120]:
def model_pipeline_step_5_2(classifier, fitted_search, classifier_X_val, classifier_y_val):
    
    # scoring
    classifier_val_prediction = fitted_search.predict(classifier_X_val)

    model_results[classifier] = [fitted_search, classifier_y_val, classifier_val_prediction]
    
    model_results.to_pickle("model_results.pkl")

In [121]:
three_classifiers = ['Identifier_Cookie_or_similar_Tech', 'Identifier_Device_ID', 'Identifier_IMEI']

In [122]:
for each_classifier in three_classifiers:
    full_modelling_pipeline(each_classifier, inspect_flow=True)

Running for classifier: Identifier_Cookie_or_similar_Tech
Shape of Identifier_Cookie_or_similar_Tech train df after sentence filtering is: (537, 614)
df_for_pipelining_train_SF: (537, 614)
classifier_X_train: (537, 16469)
classifier_y_train: (537,)
classifier_X_val: (2651, 16469)
classifier_y_val: (2651,)
The runtime for Identifier_Cookie_or_similar_Tech was 3.68092

Running for classifier: Identifier_Device_ID
Shape of Identifier_Device_ID train df after sentence filtering is: (151, 614)
df_for_pipelining_train_SF: (151, 614)
classifier_X_train: (151, 7172)
classifier_y_train: (151,)
classifier_X_val: (2651, 7172)
classifier_y_val: (2651,)
The runtime for Identifier_Device_ID was 0.53466

Running for classifier: Identifier_IMEI
Shape of Identifier_IMEI train df after sentence filtering is: (1, 614)
df_for_pipelining_train_SF: (1, 614)
classifier_X_train: (1, 701)
classifier_y_train: (1,)


ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=1.

In [123]:
display(model_results)

Contact                                                                              0
Contact_E_Mail_Address                                                               1
Contact_Phone_Number                                                                 2
Identifier_Cookie_or_similar_Tech    [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_Device_ID                 [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_IMEI                                                                      5
Identifier_MAC                                                                       6
Identifier_Mobile_Carrier                                                            7
Location                                                                             8
Location_Cell_Tower                                                                  9
Location_GPS                                                                        10
Location_WiFi                              