In this notebook:

Modelling pipeline: grid search over all classes.

Then to make bigrams and sentence filtering optional.

In [66]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
import time
from collections import defaultdict
from collections import Counter

import ds_utils_callum
import priv_policy_manipulation_functions as priv_pol_funcs

# pre-processing
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix, hstack

# modelling
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

# modelling pipeline
from tempfile import mkdtemp
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

# modelling evaluation
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

Future pipeline:

For each classifier -><br>
Separate to X and Y<br>
TF-IDF here option 1 <br>
Step for SF'ing<br>
TF-IDF here option 2 <br>
Split into folds (5-fold CV)<br>
3x3 SVM Hyperparameters<br>
Find best neg F1 score

Plus anything else

Pipeline to make now:

1. Separate into classifiers. For each classifier:
2. Apply SF'd
3. Separate into X and Y
4. Crate TF-IDF Matrix
4. Split each set into 5 folds
5. Grid search over SVM Hyperparameters to optimise F1 score

Output.

This will be a moderate approximation for a replication of most of their work. Main missing element will be better text pre-processing to get better results from the CFs and SF'ing.

Do it for one classifier, then find how to generalise it.

Train, Validate and Test dataframes to use:

In [67]:
df_for_pipelining = pd.read_pickle("crafted_features_df.pkl")

df_for_pipelining_train = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TRAINING' ].copy()
df_for_pipelining_val = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'VALIDATION' ].copy()
df_for_pipelining_test = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TEST' ].copy()

# now that I have used the 'policy type' column for referring to train/validate/test, 
# I can delete that column along with other uneccesary columns.
for dataframe in [df_for_pipelining_train, df_for_pipelining_val, df_for_pipelining_test]:
    dataframe.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
           'policy_segment_id', 'annotations', 'sentences'], inplace=True)

In [68]:
print(df_for_pipelining_train.shape)
print(df_for_pipelining_val.shape)
print(df_for_pipelining_test.shape)

(8068, 511)
(2651, 511)
(4824, 511)


In [69]:
# annotation features to use for sentence filtering
clean_annotation_features = pd.read_pickle("clean_annotation_features.pkl")

# Step 1: select classifier

Let's start with 1st Party as an example.

In [70]:
classifier = "1st_party"

# Step 2: apply SF'ing

1. Get CFs for 1st Party to use for SF'ing

In [71]:
# filtering the table to get the list object from the same row that lists the classifier
classifier_features = clean_annotation_features[ clean_annotation_features['annotation'] == classifier ].reset_index().at[0,'features']

classifier_features

['we', 'you', 'us', 'our', 'the app', 'the software']

2. Filter the DF for rows where any of those features is 1.

In [72]:
df_for_pipelining_train.shape

(8068, 511)

In [73]:
df_for_pipelining_train_SF = df_for_pipelining_train[( (df_for_pipelining_train[classifier_features] > 0).sum(axis=1) > 0 )]
df_for_pipelining_train_SF.reset_index(inplace=True, drop=True)
df_for_pipelining_train_SF.shape

(7297, 511)

# Step 3: Separate into X and Y

## Create X
X requires a union of the Crafted Features columns and the TF-IDF matrix.

Create TF-IDF matrix:

In [74]:
tfidfTransformer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)

train_tfidf = tfidfTransformer.fit_transform(df_for_pipelining_train_SF['segment_text'])

Extract CF columns from X_train and convert to sparse so that it can be combined with TF-IDF:

In [75]:
# Extract CF columns:
classifier_X_train_cfs = df_for_pipelining_train_SF.loc[:,'contact info':].copy()
# Use every column after and including the first crafted feature, which happens to be 'contact info'
print(f"Should be left with the 476 different crafted features (CFs). CF shape is: {classifier_X_train_cfs.shape}")

#convert to sparse
classifier_X_train_cfs = csr_matrix(classifier_X_train_cfs)

# combine CF columns with TF-IDF to create X
classifier_X_train = hstack([classifier_X_train_cfs, train_tfidf ])

Should be left with the 476 different crafted features (CFs). CF shape is: (7297, 476)


## Create y

In [76]:
classifier_y_train = df_for_pipelining_train_SF.loc[:,classifier].copy()

In [77]:
# Ensure Y_train only has binary values
for i in range(len(classifier_y_train)):
    if classifier_y_train[i] > 1:
        classifier_y_train[i] = 1
print(f"Highest value should be one. Highest value is: {classifier_y_train.max()}") # should be 1

Highest value should be one. Highest value is: 1


# Step 4: 5-fold CV Grid Search over hyperparameters

In [78]:
cachedir = mkdtemp() # Memory dump to help with processing

pipeline_sequences = [
        ('SVC', SVC()) ]
pipe = Pipeline(pipeline_sequences, memory = cachedir)

svc_params = {'SVC__C': [0.1, 1, 10],
             'SVC__gamma': [0.001, 0.01, 0.1]}

# Create grid search object
grid_search_object = GridSearchCV(estimator=pipe, param_grid = svc_params, cv = 5, verbose=1, n_jobs=-1, scoring='f1')

In [79]:
%%time
fitted_search = grid_search_object.fit(classifier_X_train, classifier_y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
CPU times: user 9.11 s, sys: 230 ms, total: 9.34 s
Wall time: 1min 15s


# Evaluation on Train set

To compare to the per-classifier results given in the paper (Table 1 pg 4), I only need to look at F1 score.

In [80]:
classifier_prediction = fitted_search.predict(classifier_X_train)
print(classification_report(classifier_y_train, classifier_prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5504
           1       1.00      1.00      1.00      1793

    accuracy                           1.00      7297
   macro avg       1.00      1.00      1.00      7297
weighted avg       1.00      1.00      1.00      7297



Okay, I need to do my CV grid search on just the Train set, then evaluate performance using the Validate set, since it's massively overfitting on the train set.

# Evaluation on Validate set

Pre-processing steps to prepare the validate set for prediction:

In [81]:
df_for_pipelining_val = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'VALIDATION' ].copy()
df_for_pipelining_val.reset_index(inplace=True, drop=True)

In [82]:
val_tfidf = tfidfTransformer.transform(df_for_pipelining_val['segment_text'])
# Extract CF columns:
classifier_X_val_cfs = df_for_pipelining_val.loc[:,'contact info':].copy()
#convert to sparse
classifier_X_val_cfs = csr_matrix(classifier_X_val_cfs)

# combine CF columns with TF-IDF to create X
classifier_X_val = hstack([classifier_X_val_cfs, val_tfidf ])

classifier_y_val = df_for_pipelining_val.loc[:,classifier].copy()
# Ensure Y_val only has binary values
for i in range(len(classifier_y_val)):
    if classifier_y_val[i] > 1:
        classifier_y_val[i] = 1
print(f"Highest value should be one. Highest value is: {classifier_y_val.max()}") # should be 1

Highest value should be one. Highest value is: 1


Scoring:

In [83]:
classifier_val_prediction = fitted_search.predict(classifier_X_val)

model_results[classifier] = [fitted_search, classifier_y_val, classifier_val_prediction]
model_results.to_pickle("model_results.pkl")

# print(classification_report(classifier_y_val, classifier_val_prediction))

Nice! Looks like this scored well.  Let's set up the pipeline for all the other classifiers and score them too. But I'm still not sure whether I want the positive F1 score or the negative F1 score.  I think that "negative F-1 score" is important because it relates to when a policy fails to mention an important practice.  We want to be sure that if a policy fails to mention it, the classifier correctly states that it is not mentioned.

Okay so Negative Recall is the proportion of When it was not in, did it say that it was not in?<br>
Negative Precision then is when it predicted that it wasn't in, how often was that the case?

Positive recall is When it was in, what was the chance it was identified?  <br>Positive precision is When it was predicted to be in, what was the chance that it was in?

In [84]:
def show_confusion_matrix(classifier_y_val, classifier_val_prediction):
    cf_matrix = confusion_matrix(classifier_y_val, classifier_val_prediction)
    cf_df = pd.DataFrame(
        cf_matrix, columns=["Predicted Negative", "Predicted Positive"], index=["True Negative", "True Positive"])
    display(cf_df)
show_confusion_matrix(classifier_y_val, classifier_val_prediction)

Unnamed: 0,Predicted Negative,Predicted Positive
True Negative,1893,81
True Positive,107,570


I think that I want to store negative precision, negative recall, negative F1 and positive F1.

This seems like a lot of things to store for each classifier.

I think I can just store the tuple of `(classifier_y_val, classifier_val_prediction)`

Then from that I can extract and populate a bigger table if I want.

Could store as lists... for each model, have a list with 3 values: fitted search, classifier_y_val, classifier_val_prediction.  Then could store each of those lists in a series where the index is the classifier.

Could store as dictionaries.  Each key is the classifier and each value is the list.

Then could loop through to get matrix (table) of scores.

In fact I think it will be helpful to have the order be the same as the order that I pass the classifiers, so I should use a series.

### Requirements for modelling pipeline:

- List of all classifiers
- df_for_pipelining_train/val/test
- Empty table of classifier results

In [172]:
df_for_pipelining = pd.read_pickle("crafted_features_df.pkl")

df_for_pipelining_train = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TRAINING' ].copy()
df_for_pipelining_val = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'VALIDATION' ].copy()
df_for_pipelining_val.reset_index(inplace=True, drop=True)
df_for_pipelining_test = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TEST' ].copy()
df_for_pipelining_test.reset_index(inplace=True, drop=True)

# now that I have used the 'policy type' column for referring to train/validate/test, 
# I can delete that column along with other uneccesary columns.
for dataframe in [df_for_pipelining_train, df_for_pipelining_val, df_for_pipelining_test]:
    dataframe.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
           'policy_segment_id', 'annotations', 'sentences'], inplace=True)

List of all classifiers can be taken from any of the previous dataframes

In [173]:
list_of_18_classifiers = ['Contact', 'Contact_E_Mail_Address', 'Contact_Phone_Number', 
                       'Identifier_Cookie_or_similar_Tech', 'Identifier_Device_ID', 'Identifier_IMEI',
                        'Identifier_MAC', 'Identifier_Mobile_Carrier',
                        'Location', 'Location_Cell_Tower', 'Location_GPS', 'Location_WiFi',
                        'SSO', 'Facebook_SSO',
                        '1st_party', '3rd_party',
                        'PERFORMED', 'NOT_PERFORMED'] # cross-checked from table on pg 4 of the paper

In [118]:
model_results = pd.Series(range(len(list_of_18_classifiers)),
                          index=list_of_18_classifiers, dtype=object)

In [174]:
def full_modelling_pipeline(classifier, model_results_series, sentence_filtering=True, inspect_flow=False):
    
    """
    Inputs:
        sentence_filtering: sentence_filtering takes a boolean, default True. If False, the flow will ommit the sentence filtering step.
        
        model_results_series: the empty series to which to save the results to. A pickle file of the same name will be saved with the results. 
        
        inspect_flow: Passing inspect_flow=True will print out the shape of dataframes moving through the flow 
    """
    
    # step 1
    print(f"Running for classifier: {classifier}")
    start_code_time = time.time()
    
    # step 2:
    clean_annotation_features = pd.read_pickle("clean_annotation_features.pkl")
    
    if sentence_filtering == True:
        df_for_pipelining_train_SF = model_pipeline_step_2(classifier, clean_annotation_features)
    elif sentence_filtering == False:
        df_for_pipelining_train_SF = df_for_pipelining_train.copy()
        df_for_pipelining_train_SF.reset_index(inplace=True, drop=True)
    
    if inspect_flow == True: print(f"df_for_pipelining_train_SF: {df_for_pipelining_train_SF.shape}")
    
    # step 3:
    
    classifier_X_train, tfidfTransformer = model_pipeline_step_3_1(df_for_pipelining_train_SF)
    
    classifier_y_train = model_pipeline_step_3_2(df_for_pipelining_train_SF)
    
    if inspect_flow == True: 
        print(f"classifier_X_train (made of CFs plus tf-idf matrix): {classifier_X_train.shape}")
        print(f"classifier_y_train: {classifier_y_train.shape}")
    
    # step 4:
    
    fitted_search = model_pipeline_step_4(classifier_X_train, classifier_y_train)
    
    # step 5:
    
    classifier_X_val, classifier_y_val = model_pipeline_step_5_1(df_for_pipelining_val, tfidfTransformer)
    if inspect_flow == True: 
        print(f"classifier_X_val: {classifier_X_val.shape}")
        print(f"classifier_y_val: {classifier_y_val.shape}")
    
    model_pipeline_step_5_2(classifier, fitted_search, classifier_X_val, classifier_y_val, model_results_series)
    
    if type(model_results_series[classifier]) == int:
        print("Model results not saved.")
        raise NotSavedError("Check model results")
    
    print(f"The runtime for {classifier} was {round(time.time() - start_code_time, 5)}")
    print()

In [175]:
def model_pipeline_step_2(classifier, clean_annotation_features):
    
    # step 2 – Get CFs for classifier to use for SF'ing
    
    # filtering the annotations & features table to get the list object from the same row that lists the classifier:
    classifier_features = clean_annotation_features[ clean_annotation_features['annotation'] == classifier ].reset_index().at[0,'features']
    
    # Filter the DF for rows where any of those features is 1:
    df_for_pipelining_train_SF = df_for_pipelining_train[( (df_for_pipelining_train[classifier_features] > 0).sum(axis=1) > 0 )]
    df_for_pipelining_train_SF.reset_index(inplace=True, drop=True)
    print(f"Shape of {classifier} train df after sentence filtering is: {df_for_pipelining_train_SF.shape}")
    
    return df_for_pipelining_train_SF

In [176]:
def model_pipeline_step_3_1(df_for_pipelining_train_SF):
    # separate into X
    
    tfidfTransformer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)

    train_tfidf = tfidfTransformer.fit_transform(df_for_pipelining_train_SF['segment_text'])
    
    # Extract CF columns:
    classifier_X_train_cfs = df_for_pipelining_train_SF.loc[:,'contact info':].copy()
    # Use every column after and including the first crafted feature, which happens to be 'contact info'
    
    if classifier_X_train_cfs.shape[1] != 476:
        print(f"Should be left with the 476 crafted features (CF). CF shape is: {classifier_X_train_cfs.shape}")
        raise Step_3_CF_error("Crafted features not being applied correctly")

    #convert to sparse
    classifier_X_train_cfs = csr_matrix(classifier_X_train_cfs)

    # combine CF columns with TF-IDF to create X
    classifier_X_train = hstack([classifier_X_train_cfs, train_tfidf ])
    return classifier_X_train, tfidfTransformer

In [177]:
def model_pipeline_step_3_2(df_for_pipelining_train_SF):
    # separate into y
    
    classifier_y_train = df_for_pipelining_train_SF.loc[:,classifier].copy()
    # Ensure Y_train only has binary values
    for i in range(len(classifier_y_train)):
        if classifier_y_train[i] > 1:
            classifier_y_train[i] = 1
    
    if classifier_y_train.max() != 1:
        print(f"Highest value should be one. Highest value is: {classifier_y_train.max()}")
        raise Step_3_y_error("train target colum not binary")
    
    return classifier_y_train

In [178]:
def model_pipeline_step_4(classifier_X_train, classifier_y_train):
    cachedir = mkdtemp() # Memory dump to help with processing

    pipeline_sequences = [
            ('SVC', SVC()) ]
    pipe = Pipeline(pipeline_sequences, memory = cachedir)

    svc_params = {'SVC__C': [0.1, 1, 10],
                 'SVC__gamma': [0.001, 0.01, 0.1]}

    # Create grid search object
    grid_search_object = GridSearchCV(estimator=pipe, param_grid = svc_params, cv = 5, verbose=0, n_jobs=-1, scoring='f1')
    
    fitted_search = grid_search_object.fit(classifier_X_train, classifier_y_train)
    
    return fitted_search

In [179]:
def model_pipeline_step_5_1(df_for_pipelining_val, tfidfTransformer):
    # create validate X and y

    val_tfidf = tfidfTransformer.transform(df_for_pipelining_val['segment_text'])
    # Extract CF columns:
    classifier_X_val_cfs = df_for_pipelining_val.loc[:,'contact info':].copy()
    #convert to sparse
    classifier_X_val_cfs = csr_matrix(classifier_X_val_cfs)

    # combine CF columns with TF-IDF to create X
    classifier_X_val = hstack([classifier_X_val_cfs, val_tfidf ])

    classifier_y_val = df_for_pipelining_val.loc[:,classifier].copy()
    # Ensure Y_val only has binary values
    for i in range(len(classifier_y_val)):
        if classifier_y_val[i] > 1:
            classifier_y_val[i] = 1
    
    if classifier_y_val.max() != 1:
        print(f"Highest value should be one. Highest value is: {classifier_y_val.max()}")
        raise Step_5_val_error("Validation target column not binary")
    
    return classifier_X_val, classifier_y_val

In [180]:
def model_pipeline_step_5_2(classifier, fitted_search, classifier_X_val, classifier_y_val, model_results_series):
    
    # scoring
    classifier_val_prediction = fitted_search.predict(classifier_X_val)

    model_results_series[classifier] = [fitted_search, classifier_y_val, classifier_val_prediction]
    
    model_results_series.to_pickle(f"most_recent_model_results.pkl")

# Run all classifiers through the pipeline

My initial results will be with sentence filtering included.

In [132]:
initial_results_sf = pd.Series(range(len(list_of_18_classifiers)),
                          index=list_of_18_classifiers, dtype=object)

In [133]:
for each_classifier in list_of_18_classifiers:
    full_modelling_pipeline(each_classifier, model_results_series=initial_results_sf, 
                            sentence_filtering=True, inspect_flow=True)

Running for classifier: Contact
Shape of Contact train df after sentence filtering is: (366, 511)
df_for_pipelining_train_SF: (366, 511)
classifier_X_train (made of CFs plus tf-idf matrix): (366, 13708)
classifier_y_train: (366,)
classifier_X_val: (2651, 13708)
classifier_y_val: (2651,)
The runtime for Contact was 0.9684

Running for classifier: Contact_E_Mail_Address
Shape of Contact_E_Mail_Address train df after sentence filtering is: (557, 511)
df_for_pipelining_train_SF: (557, 511)
classifier_X_train (made of CFs plus tf-idf matrix): (557, 16513)
classifier_y_train: (557,)
classifier_X_val: (2651, 16513)
classifier_y_val: (2651,)
The runtime for Contact_E_Mail_Address was 1.23097

Running for classifier: Contact_Phone_Number
Shape of Contact_Phone_Number train df after sentence filtering is: (487, 511)
df_for_pipelining_train_SF: (487, 511)
classifier_X_train (made of CFs plus tf-idf matrix): (487, 16970)
classifier_y_train: (487,)
classifier_X_val: (2651, 16970)
classifier_y_val: 

In [142]:
initial_results_sf.head(2)

Contact                   [GridSearchCV(cv=5,\n             estimator=Pi...
Contact_E_Mail_Address    [GridSearchCV(cv=5,\n             estimator=Pi...
dtype: object

In [138]:
print(classification_report(initial_results_sf['Contact_E_Mail_Address'][1] , initial_results_sf['Contact_E_Mail_Address'][2]))


              precision    recall  f1-score   support

           0       0.88      0.07      0.12      1974
           1       0.26      0.97      0.41       677

    accuracy                           0.30      2651
   macro avg       0.57      0.52      0.27      2651
weighted avg       0.72      0.30      0.20      2651



# Creating model evaluation table

In [154]:
initial_model_results = pd.read_pickle("most_recent_model_results.pkl")

In [158]:
# take the classifiers as the index for our model results df
initial_f1s = pd.DataFrame(initial_model_results, columns=["Neg F1"]).copy()

# then populate the dataframe with F1 scores by reference to the 'initial_model_results' series.
# the initial_model_results series stores the actual and predicted y-values in the form [model_object, true_y, predicted_y]
for index in initial_f1s.index:
    initial_f1s.loc[index, "Neg F1"] = f1_score(initial_model_results[index][1].copy(), initial_model_results[index][2].copy(), pos_label=0)
    initial_f1s.loc[index, "Pos F1"] = f1_score(initial_model_results[index][1].copy(), initial_model_results[index][2].copy(), pos_label=1)

In [159]:
initial_f1s

Unnamed: 0,Neg F1,Pos F1
Contact,0.89243,0.664075
Contact_E_Mail_Address,0.121641,0.414335
Contact_Phone_Number,0.778331,0.598834
Identifier_Cookie_or_similar_Tech,0.635948,0.503122
Identifier_Device_ID,0.323308,0.442916
Identifier_IMEI,0.0,0.406851
Identifier_MAC,0.899355,0.623771
Identifier_Mobile_Carrier,0.456351,0.448918
Location,0.521452,0.493204
Location_Cell_Tower,0.056835,0.40969


In [160]:
print(f"Negative F1 mean: {initial_f1s['Neg F1'].mean()}")
print(f"Positive F1 mean: {initial_f1s['Pos F1'].mean()}")

Negative F1 mean: 0.5937491023789958
Positive F1 mean: 0.5168821926220422


Many things to discuss based on the above:
- three 0 scores
- wide range of scores
- Neg F1 tends to be higher than Pos F1
- compare with results from the paper


# Discussion of Zero F1 scores
There is one 0 for negative F1 `Identifier_IMEI` and two zero positive F1 scores for the two SSO classifiers.  Let's start by investigating `Identifier_IMEI`. 

## Zero score for IMEI Identifier

A segment is annotated with 'Identifier_IMEI' if the segment describes how the company uses IMEI data to identify a customer. IMEI (International Mobile Equipment Identity) is a unique identification number all phone devices have, and can be used to track the history of the handset (including checking whether the phone has ever been reported as stolen).  First let's look at the confusion matrix. 

In [162]:
show_confusion_matrix(initial_model_results['Identifier_IMEI'][1].copy(), initial_model_results['Identifier_IMEI'][2].copy())

Unnamed: 0,Predicted Negative,Predicted Positive
True Negative,0,1974
True Positive,0,677


The classifier never predicted that this label is not present.  This is likely because in the subset of the data that the classifier was trained on, all of the observations were positive cases. This will be the result of the sentence filtering – the classifier was only trained using text segments that mentioned phrases related to IMEI, all of which happened to be annotated as indicating that the practice was performed or not performed.

This could mean that a rule-based classifier would be suitable for classifying the IMEI practice.  For a model-based classifier, I would need to use a wider range of features for sentence filtering, or would not conduct sentence filtering.

We can see the features used:

In [54]:
clean_annotation_features[clean_annotation_features['annotation']=='Identifier_IMEI']['features'].values

array([list(['imei', 'international mobile equipment', 'equipment id'])],
      dtype=object)

As expected there are very few related phrases.

Another implication could be that because IMEI is so specific, companies will not mention anything related to it in their privacy policies unless they collect this data.

## Zero score for SSO

The SSO ("Single Sign On") annotation is applied to a segment if it discusses what the company does with the data used to facilitate a single sign on service, such as when you sign into an app through your google or facebook account.  This is always passed to the appropriate third party to enact the service. The scores for both "SSO" and "Facebook SSO" are the same because the crafted features and most of the annotations are the same.  Again let's start by looking at the confusion matrix.

In [58]:
show_confusion_matrix(initial_model_results['SSO'][1].copy(), initial_model_results['SSO'][2].copy())

Unnamed: 0,Predicted Negative,Predicted Positive
True Negative,1974,0
True Positive,677,0


This time the model never predicted that a segment contained a practice relating to SSO, but also the positive and negative proportions in the data match those for Identifier_IMEI.  It's unlikely that this occurred organically in the data so there must be a problem with the way the data has been processed. 

# Modelling without sentence filtering

In [186]:
second_results_no_sf = pd.Series(range(len(list_of_18_classifiers)),
                          index=list_of_18_classifiers, dtype=object)

In [181]:
for each_classifier in ["Contact_E_Mail_Address", "Contact_Phone_Number"]:
    full_modelling_pipeline(each_classifier, second_results_no_sf, sentence_filtering=False, inspect_flow=True)

Running for classifier: Contact_E_Mail_Address
df_for_pipelining_train_SF: (8068, 511)
classifier_X_train (made of CFs plus tf-idf matrix): (8068, 113262)
classifier_y_train: (8068,)
classifier_X_val: (2651, 113262)
classifier_y_val: (2651,)
The runtime for Contact_E_Mail_Address was 80.01306

Running for classifier: Contact_Phone_Number
df_for_pipelining_train_SF: (8068, 511)
classifier_X_train (made of CFs plus tf-idf matrix): (8068, 113262)
classifier_y_train: (8068,)
classifier_X_val: (2651, 113262)
classifier_y_val: (2651,)
The runtime for Contact_Phone_Number was 76.93241



In [182]:
second_results_no_sf

Contact                              [GridSearchCV(cv=5,\n             estimator=Pi...
Contact_E_Mail_Address               [GridSearchCV(cv=5,\n             estimator=Pi...
Contact_Phone_Number                 [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_Cookie_or_similar_Tech    [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_Device_ID                 [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_IMEI                      [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_MAC                       [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_Mobile_Carrier            [GridSearchCV(cv=5,\n             estimator=Pi...
Location                             [GridSearchCV(cv=5,\n             estimator=Pi...
Location_Cell_Tower                  [GridSearchCV(cv=5,\n             estimator=Pi...
Location_GPS                         [GridSearchCV(cv=5,\n             estimator=Pi...
Location_WiFi                        [GridS

In [183]:
# take the classifiers as the index for our model results df
f1s_no_sf = pd.DataFrame(second_results_no_sf, columns=["Neg F1"]).copy()

# then populate the dataframe with F1 scores by reference to the 'initial_model_results' series.
# the initial_model_results series stores the actual and predicted y-values in the form [model_object, true_y, predicted_y]
for index in f1s_no_sf.index:
    f1s_no_sf.loc[index, "Neg F1"] = f1_score(second_results_no_sf[index][1].copy(), second_results_no_sf[index][2].copy(), pos_label=0)
    f1s_no_sf.loc[index, "Pos F1"] = f1_score(second_results_no_sf[index][1].copy(), second_results_no_sf[index][2].copy(), pos_label=1)
    

In [184]:
second_results_no_sf["3rd_party"][2]

array([0, 0, 0, ..., 0, 0, 0])

In [185]:
f1s_no_sf

Unnamed: 0,Neg F1,Pos F1
Contact,0.954009,0.861678
Contact_E_Mail_Address,0.954009,0.861678
Contact_Phone_Number,0.954009,0.861678
Identifier_Cookie_or_similar_Tech,0.954009,0.861678
Identifier_Device_ID,0.954009,0.861678
Identifier_IMEI,0.954009,0.861678
Identifier_MAC,0.954009,0.861678
Identifier_Mobile_Carrier,0.954009,0.861678
Location,0.954009,0.861678
Location_Cell_Tower,0.954009,0.861678


In [187]:
second_results_no_sf

Contact                               0
Contact_E_Mail_Address                1
Contact_Phone_Number                  2
Identifier_Cookie_or_similar_Tech     3
Identifier_Device_ID                  4
Identifier_IMEI                       5
Identifier_MAC                        6
Identifier_Mobile_Carrier             7
Location                              8
Location_Cell_Tower                   9
Location_GPS                         10
Location_WiFi                        11
SSO                                  12
Facebook_SSO                         13
1st_party                            14
3rd_party                            15
PERFORMED                            16
NOT_PERFORMED                        17
dtype: object

In [188]:
testingproblem = pd.read_pickle("most_recent_model_results.pkl")

In [189]:
testingproblem

Contact                              [GridSearchCV(cv=5,\n             estimator=Pi...
Contact_E_Mail_Address               [GridSearchCV(cv=5,\n             estimator=Pi...
Contact_Phone_Number                 [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_Cookie_or_similar_Tech    [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_Device_ID                 [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_IMEI                      [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_MAC                       [GridSearchCV(cv=5,\n             estimator=Pi...
Identifier_Mobile_Carrier            [GridSearchCV(cv=5,\n             estimator=Pi...
Location                             [GridSearchCV(cv=5,\n             estimator=Pi...
Location_Cell_Tower                  [GridSearchCV(cv=5,\n             estimator=Pi...
Location_GPS                         [GridSearchCV(cv=5,\n             estimator=Pi...
Location_WiFi                        [GridS

In [192]:
f1_score(testingproblem["Contact_E_Mail_Address"][1].copy(), testingproblem["Contact_E_Mail_Address"][2].copy(), pos_label=0)

0.9540085448605177