In this notebook:

Modelling pipeline: grid search over all classes.

Then to add optional bigrams and sentence filtering.

In [56]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
from collections import defaultdict
from collections import Counter

import ds_utils_callum
import priv_policy_manipulation_functions as priv_pol_funcs

# pre-processing
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix, hstack

# modelling
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

# modelling pipeline
from tempfile import mkdtemp
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

# modelling evaluation
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

Ideal pipeline:

For each classifier -><br>
Separate to X and Y<br>
Step for SF'ing<br>
Split into folds (5-fold CV)<br>
TFIDF (Include bi-grams?)<br>
3x3 SVM Hyperparameters

Plus anything else

Pipeline to make now:

1. separate into classifiers. For each classifier:
2. Apply SF'd
3. Separate into X and Y
4. Split each set into 5 folds
5. TF-IDF each fold
6. Grid search over SVM Hyperparameters

Output.

This will be a moderate approximation for a replication of most of their work. Main missing element will be better text pre-processing to get better results from the CFs and SF'ing.

Do it for one classifier, then find how to generalise it.

Train and Test Dataframes to use:

In [6]:
df_for_pipelining = pd.read_pickle("crafted_features_df.pkl")
df_for_pipelining_train = df_for_pipelining.loc[df_for_pipelining['policy_type'] != 'TEST' ].copy()
df_for_pipelining_test = df_for_pipelining.loc[df_for_pipelining['policy_type'] == 'TEST' ].copy()
df_for_pipelining_train.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'annotations', 'sentences'], inplace=True)
df_for_pipelining_test.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'annotations', 'sentences'], inplace=True)

# Step 1: select classifier

Let's start with 1st Party.

In [10]:
classifier = "1st_party"

# Step 2: apply SF'ing

1. Get CFs for 1st Party to use for SF'ing

In [20]:
annotation_features = pd.read_pickle("annotation_features.pkl")
classifier_features = annotation_features[ annotation_features['annotation'] == classifier ].reset_index().at[0,'features']
# filtering the table to get the list object from the same row that lists the classifier
classifier_features

[' we ', ' you ', ' us ', ' our ', 'the app', 'the software']

2. Filter the DF for rows where any of those features is 1.

In [29]:
df_for_pipelining_train_SF = df_for_pipelining_train[( (df_for_pipelining_train[classifier_features] > 0).sum(axis=1) > 0 )]
df_for_pipelining_train_SF.reset_index(inplace=True, drop=True)
df_for_pipelining_train_SF.shape

(6850, 614)

# Step 3: Separate into X and Y

In [38]:
df_for_pipelining_train_SF.head(3)

Unnamed: 0,segment_text,SSO,Facebook_SSO,1st_party,3rd_party,Contact,Contact_Address_Book,Contact_City,Contact_E_Mail_Address,Contact_Password,...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
0,"Privacy Policy 360 Security (the ""Software"") i...",0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,"Information we collect: Unless you use the ""Fi...",0,0,2,0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,Information we get from your use of our servic...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
# Extract CF columns from X_train
classifier_X_train_cfs = df_for_pipelining_train_SF.loc[:,'contact info':].copy()
# Remove every column before the first crafted feature, which happens to be 'contact info'

print(f"Should be left with the 579 crafted features. Shape is: {classifier_X_train_cfs.shape}") # should have 579 features

# Keep the segment_text
classifier_X_train = pd.concat([classifier_X_train_cfs, df_for_pipelining_train_SF['segment_text'] ], axis=1)
print(f"Columns should now be 580: {classifier_X_train.shape}")

Should be left with the 579 crafted features. Shape is: (6850, 579)
Columns should now be 580: (6850, 580)


In [132]:
classifier_y_train = df_for_pipelining_train_SF.loc[:,classifier].copy()

In [133]:
# Ensure Y_train only has binary values
for i in range(len(classifier_y_train)):
    if classifier_y_train[i] > 1:
        classifier_y_train[i] = 1
print(f"Highest value should be one. Highest value is: {classifier_y_train.max()}") # should be 1

Highest value should be one. Highest value is: 1


# Step 4: Split X into 5 folds

For Pipeline I will need:

- TFIDF Transformer on the segment text column
- Column transformer to remove the segment text column(?)
- Column transformer to combine the resulting TFIDF matrix with the CFs matrix
   - This is equal to appending the classifier_X_train_cfs dataframe (I think?)
- SVM

OR

- Pass in the segment_text column, filtered to the correct rows after SF'ing
- TFIDF
- Append the appropriate CFs matrix... but now it's a certain fold.

OR

- Pass in the segment_text and CFs dataframe
- Apply a custom transformer:
   - TFIDF the first column and save it
   - Remove the first column
   - Sparse the remaining df
   - combine the TFIDF with the sparse'd remaining df
- SVM

In [124]:
classifier_X_train.iloc[:,:-1]

Unnamed: 0,contact info,contact details,contact data,"e.g., your name",contact you,your contact,"identify, contact",identifying information,"your name, address, and e-mail address",including e-mail,...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6845,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6846,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6847,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6848,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [143]:
tfidfTransformer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)

col_transforms = [("tfidf", tfidfTransformer, 'segment_text')]

In [144]:
pipeline_sequences = [
        ("tfidf", ColumnTransformer(col_transforms, remainder='passthrough')),
        ('SVC', SVC())
    ]

In [145]:
cachedir = mkdtemp() # Memory dump to help with processing
pipe = Pipeline(pipeline_sequences, memory = cachedir)

svc_params = {'SVC__C': [0.1, 1, 10],
             'SVC__gamma': [0.001, 0.01, 0.1]}

# Create grid search object
grid_search_object = GridSearchCV(estimator=pipe, param_grid = svc_params, cv = 5, verbose=4, n_jobs=-1)

In [147]:
%%time
fitted_search = grid_search_object.fit(classifier_X_train, classifier_y_train) #.iloc[:,:-1]

Fitting 5 folds for each of 9 candidates, totalling 45 fits


If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  X, fitted_transformer = fit_transform_one_cached(
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS 

Traceback (most recent call last):
  File "/Users/chinchcliffe/opt/anaconda3/envs/priv_pol_nlp/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/chinchcliffe/opt/anaconda3/envs/priv_pol_nlp/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 429, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/Users/chinchcliffe/opt/anaconda3/envs/priv_pol_nlp/lib/python3.10/site-packages/sklearn/pipeline.py", line 699, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
  File "/Users/chinchcliffe/opt/anaconda3/envs/priv_pol_nlp/lib/python3.10/site-packages/sklearn/base.py", line 666, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/Users/chinchcliffe/opt/anaconda3/envs/priv_pol_nlp/lib/python3.10/site-packages/sklearn/svm/_base.py", line 810, in predict
    y = super().predict(X)
  File "/Users/chinchcli

KeyboardInterrupt: 

In [134]:
# All good with SVC part.  I would like to think about adding tfidf into cvgrid search now.

1

Looks like I have some kind of error.

Hypothesis 1: the error is due to creating a Sparse matrix with the transformer but combining it with a pd dataframe.

Test for hypothesis 1: run the transformer on data by itself and see whether it works, and what it outputs.

In [84]:
classifier_X_train.loc[:,['segment_text', 'contact info']]

Unnamed: 0,segment_text,contact info
0,"Privacy Policy 360 Security (the ""Software"") i...",0
1,"Information we collect: Unless you use the ""Fi...",0
2,Information we get from your use of our servic...,0
3,Log information. When you use our services and...,0
4,Information collected relating to installed pr...,0
...,...,...
6845,When you sign into your Zynga account or enter...,0
6846,While we take precautions against possible sec...,0
6847,back to top Changes to Our Privacy Policy If w...,0
6848,back to top Contact Us If you have any questio...,0


In [96]:
hyp1 = classifier_X_train.copy()

tfidfTransformer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)

col_transforms = [("tfidf", tfidfTransformer, 'segment_text')]

# Create the column transformer
col_trans = ColumnTransformer(col_transforms, remainder='passthrough')

col_trans.fit(hyp1)
hyp1_transformed = col_trans.transform(hyp1)
hyp1_transformed

# col_trans.get_feature_names_out()

<6850x108824 sparse matrix of type '<class 'numpy.float64'>'
	with 480123 stored elements in Compressed Sparse Row format>

6850x107985 with only segment text and contact info column
6850x107985 with all columns
So it's replacing / deleting my other columns?
Added remainder='passthrough'.
6850x108824

In [97]:
len(col_trans.get_feature_names_out())

108564

In [93]:
hyp1.shape

(6850, 580)

In [102]:
hyp1 = classifier_X_train.copy()
tfidfTransformer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)
col_transforms = [("tfidf", tfidfTransformer, 'segment_text')]
# Create the column transformer
col_trans = ColumnTransformer(col_transforms, remainder='drop')
col_trans.fit(hyp1)
hyp1_transformed = col_trans.transform(hyp1)
hyp1_transformed

<6850x107985 sparse matrix of type '<class 'numpy.float64'>'
	with 420459 stored elements in Compressed Sparse Row format>