In [1]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
from collections import defaultdict
from collections import Counter

import ds_utils_callum
import priv_policy_manipulation_functions as priv_pol_funcs

Description of steps taken:

Approaching EVERYTHING at the segment level.

Create targets to be classified for:
- Each practice (30 practices)
- Performed; not performed (2 modalities)
- 1st Party; 3rd Party (2 parties)

Initially summed for EDA then converted to binary.

Create "Crafted Features":<br>
579 Crafted Features created. Each target has a set of corresponding crafted features. The presence of that feature could indicate the presence of that target.

Not actioned: Some Pre-processing to apply to: 
- Crafted Features
- Segments

Pre-processing steps:
- Remove Whitespace
- Normalize punctuation
- Remove non-ASCII characters
- convert to lowercase

Then the Crafted Features columns could be re-populated and TF-IDF could be updated.

**TF-IDF:**<br>
`TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)`<br>
Could use stop_words from NLTK as there are known issues with the inbuild TfidVectorizer 'english'.

**Scaling**<br>
Scaling is not mentinoed in the paper but I could do Standard Scaler to see if get results.

**PCA**<br>
PCA is not required but I should explain why.

Modelling:
- LinearSVC
- class weight='balanced'

Could try SVC with kernel='linear', which is what they did, instead of LinearSVC.


5-fold CV grid search over:
- C=[0.1, 1, 10] 
- gamma=[0.001, 0.01, 0.1]

Could try different models and different hyperparameters

## Create Train set

In [82]:
train_set = pd.read_pickle("crafted_features_df.pkl")

In [83]:
train_set.head(3)

Unnamed: 0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences,SSO,Facebook_SSO,1st_party,...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W...",0,0,2,...,0,0,0,0,0,0,0,0,0,0


In [84]:
train_set.columns

Index(['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'segment_text', 'annotations', 'sentences', 'SSO',
       'Facebook_SSO', '1st_party',
       ...
       'never be acquired', 'never be viewed', 'never be located',
       'never be asked', 'never be utilized', 'never be requested',
       'never be transmitted', 'never be communicated', 'nor do we collect',
       'does not tell us'],
      dtype='object', length=620)

In [85]:
print(f"Old size {train_set.shape}")
train_set = train_set.loc[train_set['policy_type'] != 'TEST' ]
print(f"New size {train_set.shape}")

Old size (15543, 620)
New size (10719, 620)


In [86]:
# reset index
train_set = train_set.reset_index(drop = True)
# Remove other columns not required
train_set.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'annotations', 'sentences'], inplace=True)

In [88]:
train_set.head(3)

Unnamed: 0,segment_text,SSO,Facebook_SSO,1st_party,3rd_party,Contact,Contact_Address_Book,Contact_City,Contact_E_Mail_Address,Contact_Password,...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
0,Home Find my phone Blog,0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,"Privacy Policy 360 Security (the ""Software"") i...",0,0,0,0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,"Information we collect: Unless you use the ""Fi...",0,0,2,0,0.0,0.0,0.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [90]:
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words='english', binary=True)
tfidf.fit(train_set['segment_text'])

In [91]:
train_tfidf = tfidf.transform(train_set['segment_text'])
print(type(train_tfidf))
print(train_tfidf.shape)

<class 'scipy.sparse._csr.csr_matrix'>
(10719, 142578)


In [92]:
train_tfidf.sum() 

68222.02135058281

In [93]:
train_set.iloc[:,1:].sum().sum()

84299.0

Must take the targets from the train_set df and save them as Y_train

In [94]:
Y_train = train_set.loc[:,'SSO':'NOT_PERFORMED']
print(Y_train.shape) # should have 34 targets
display(Y_train.head(3))

(10719, 34)


Unnamed: 0,SSO,Facebook_SSO,1st_party,3rd_party,Contact,Contact_Address_Book,Contact_City,Contact_E_Mail_Address,Contact_Password,Contact_Phone_Number,...,Identifier_SIM_Serial,Identifier_SSID_BSSID,Location,Location_Bluetooth,Location_Cell_Tower,Location_GPS,Location_IP_Address,Location_WiFi,PERFORMED,NOT_PERFORMED
0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
1,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
2,0,0,2,0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,0


In [97]:
# Ensure Y_train only has binary values
for column in Y_train.columns:
    for i in range(len(Y_train[column])):

        if Y_train.at[i, column] > 1:
            Y_train.at[i, column] = 1
Y_train.max().max() # should be 1

1.0

X_train will be made up of the crafted features plus the tfidf matrix.  I will start by removing the targets from the train set to just be left with the crafted features.  Then I will combine with the tfidf matrix.

In [63]:
# Remove the targets from X_train
train_cfs = train_set.loc[:,'contact info':]
print(train_cfs.shape) # should have 579 features
display(train_cfs.head(3))

(10719, 579)


Unnamed: 0,contact info,contact details,contact data,"e.g., your name",contact you,your contact,"identify, contact",identifying information,"your name, address, and e-mail address",including e-mail,...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
4824,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4825,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4826,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [64]:
from scipy.sparse import csr_matrix, hstack

In [65]:
# convert to sparse
X_train = csr_matrix(train_cfs)
print(type(X_train))
print(X_train.shape)

<class 'scipy.sparse._csr.csr_matrix'>
(10719, 579)


In [66]:
X_train = hstack([X_train,train_tfidf])
print(X_train.shape)

(10719, 143157)


## Create Test set

Exact same steps as done for Train set

In [160]:
test_set = pd.read_pickle("crafted_features_df.pkl")
print(f"Old size {test_set.shape}")
test_set = test_set.loc[test_set['policy_type'] == 'TEST' ]
print(f"New size {test_set.shape}")
# reset index
test_set = test_set.reset_index(drop = True)
# Remove other columns not required
test_set.drop(columns=['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'annotations', 'sentences'], inplace=True)

test_tfidf = tfidf.transform(test_set['segment_text'])

#taking the targets from the test_set df and save them as Y_test
Y_test = test_set.loc[:,'SSO':'NOT_PERFORMED']
print(f"Y_test shape: {Y_test.shape} should equal 34 targets") # should have 34 targets

# Ensure Y_test only has binary values
for column in Y_test.columns:
    for i in range(len(Y_test[column])):

        if Y_test.at[i, column] > 1:
            Y_test.at[i, column] = 1
print(f"{Y_test.max().max()} should be one") # should be 1

# Remove the targets from X_test
test_cfs = test_set.loc[:,'contact info':]
print(f"test_cfs shape: {test_cfs.shape} should have 579 features") # should have 579 features

# convert to sparse
X_test = csr_matrix(test_cfs)
# combine
X_test = hstack([X_test,test_tfidf])
print(f"X_test shape: {X_test.shape}")

Old size (15543, 620)
New size (4824, 620)
Y_test shape: (4824, 34) should equal 34 targets
1.0 should be one
test_cfs shape: (4824, 579) should have 579 features
X_test shape: (4824, 143157)


I'm going to have to make a list of all the pre-processing/pipeline options so that I know what I have done and what I can play around with.

**Note: Optional EDA can be done on the tf-idf**

linear kernel (kernel=’linear’), balanced class weights (class weight=’balanced’), and a grid search with five-fold cross-validation over the penalty (C=[0.1, 1, 10]) and gamma (gamma=[0.001, 0.01, 0.1]) 

In [121]:
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

I would try using LinearSVC instead of what they used SVC with kernel=Linear because SKLearn documentation says scale better to large numbers of samples

In [76]:
SVM_model = SVC(kernel='linear', class_weight='balanced', C=0.1, gamma=0.1)

In [101]:
%%time
SVM_model.fit(X_train, Y_train["SSO"]) # just doing a single class to start with

CPU times: user 7.93 s, sys: 94.8 ms, total: 8.02 s
Wall time: 8.06 s


In [105]:
%%time
firstprediction = SVM_model.predict(X_train)

CPU times: user 8.17 s, sys: 37.7 ms, total: 8.2 s
Wall time: 8.29 s


In [108]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [107]:
print(accuracy_score(Y_train["SSO"], firstprediction))

0.9802220356376528


In [115]:
cf_matrix = confusion_matrix(Y_train["SSO"], firstprediction)
cf_df = pd.DataFrame(
    cf_matrix, columns=["Predicted Negative", "Predicted Positive"], index=["True Negative", "True Positive"])
cf_df

Unnamed: 0,Predicted Negative,Predicted Positive
True Negative,10309,212
True Positive,0,198


In [112]:
from sklearn.metrics import recall_score
recall_score(Y_train["SSO"], firstprediction) 

1.0

In [113]:
Y_train["SSO"].sum()

198

In [117]:
from sklearn.metrics import classification_report
print(classification_report(Y_train["SSO"], firstprediction))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     10521
           1       0.48      1.00      0.65       198

    accuracy                           0.98     10719
   macro avg       0.74      0.99      0.82     10719
weighted avg       0.99      0.98      0.98     10719



ROC curve is possible to create with SVM using proxies for threshold but not necessary for my purposes in comparing to the performance in MAPS.  Maybe interesting for some select classifier scores though.

# Including or ommitting Crafted Features

Based on a grid search, in cases where it improves classifier performance, we remove a segment’s sentences from further processing if they do not contain keywords related to the classifier in question. For example, the Location classifier is not trained on sentences which only describe cookies.

Hmm.  I think I will have to find a way to reduce the dataframe.  I guess I can do it just for SSO, and see what steps I did.  Then I can either:
- include it in the grid search. If it doubles training time, I might have to take it out.
- Test it on only the fitted grid search – so only one extra model for each classifier. In fact I might struggle to make it part of the grid search.  This is the quickest option so let's do this for now.

**Full Pipeline plans**

Which practice classifiers to train?

CV Grid Search
- Which Hyperparams to loop over – `C=[0.1, 1, 10]` and `gamma=[0.001, 0.01, 0.1]`
- Whether to include Crafted Features ("Based on a grid search, in cases where it improves classifier performance, we remove a segment’s sentences from further processing if they do not contain keywords related to the classifier in question. For example, the Location classifier is not trained on sentences which only describe cookies.") I can do this using the hyperparameters set by the grid search.

How it will select the "best" model from each run? It just does.

Consider using memory dump to make your grid search run faster – as below memory = cachedir

(After replication – Which other models to compare)

In [125]:
from tempfile import mkdtemp
cachedir = mkdtemp() # Memory dump to help with processing

In [127]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

**Create Pipeline**

In [140]:
estimators = [('SVC', SVC())]

pipe = Pipeline(estimators, memory = cachedir)

svc_params = {'SVC__C': [0.1, 1, 10],
             'SVC__gamma': [0.001, 0.01, 0.1]}

# Create grid search object
grid_search_object = GridSearchCV(estimator=pipe, param_grid = svc_params, cv = 5, verbose=2, n_jobs=-1)

In [141]:
y_train_SSO = Y_train["SSO"]
print(X_train.shape)
print(y_train_SSO.shape)

(10719, 143157)
(10719,)


In [143]:
%%time
fitted_search = grid_search_object.fit(X_train, y_train_SSO)

32 seconds!?!?

In [149]:
print(fitted_search.best_estimator_)

Pipeline(memory='/var/folders/d2/s5sb3p416xbgzjf589bgpp_w0000gn/T/tmp9cxkrvu2',
         steps=[('SVC', SVC(C=10, gamma=0.1))])


**Fitted Search Evaluation**

In [150]:
fitted_search.score(X_train, y_train_SSO)

0.9997201231458158

In [153]:
grid_search_prediction = fitted_search.predict(X_train)

In [156]:
print(classification_report(y_train_SSO, grid_search_prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10521
           1       0.99      0.99      0.99       198

    accuracy                           1.00     10719
   macro avg       1.00      0.99      1.00     10719
weighted avg       1.00      1.00      1.00     10719



In [158]:
cf_matrix2 = confusion_matrix(y_train_SSO, grid_search_prediction)
cf_df2 = pd.DataFrame(
    cf_matrix2, columns=["Predicted Negative", "Predicted Positive"], index=["True Negative", "True Positive"])
cf_df2

Unnamed: 0,Predicted Negative,Predicted Positive
True Negative,10520,1
True Positive,2,196


Well this is suspiciously overfitted.  Let's compare the results on the test data.

**Results for test_SSO**

In [163]:
test_grid_search_prediction = fitted_search.predict(X_test)

In [164]:
print(f"Accuracy score is {fitted_search.score(X_test, Y_test['SSO'])} ")
print(classification_report(Y_test['SSO'], test_grid_search_prediction))

Accuracy score is 0.9859038142620232 
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      4748
           1       0.59      0.34      0.43        76

    accuracy                           0.99      4824
   macro avg       0.79      0.67      0.71      4824
weighted avg       0.98      0.99      0.98      4824



Thought: need to compare train and test scores to ensure not overfitting?  But using CV helps with that. Worries that if I look at the test score, I will just be overfitting on that.

**What if I train on F1 score instead of accuracy?**

In [166]:
estimators = [('SVC', SVC())]

pipe = Pipeline(estimators, memory = cachedir)

svc_params = {'SVC__C': [0.1, 1, 10],
             'SVC__gamma': [0.001, 0.01, 0.1]}

# Create grid search object
grid_search_object = GridSearchCV(estimator=pipe, param_grid = svc_params, cv = 5, verbose=2, n_jobs=-1, scoring='f1')

In [167]:
fitted_search = grid_search_object.fit(X_train, y_train_SSO)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


In [168]:
print(fitted_search.best_estimator_)

Pipeline(memory='/var/folders/d2/s5sb3p416xbgzjf589bgpp_w0000gn/T/tmp9cxkrvu2',
         steps=[('SVC', SVC(C=10, gamma=0.1))])


It results in the same model (on this occassion).

## Comparing to if we reduce segments (remove ones not containing CFs when training)

Since MAPS got much worse scores on SSO when doing Sentence Filtering, I expect to get much worse results here.

Steps:
- Do it for one row for one column
- Do it for one row for all columns for that target
- Do it for all rows for all columns for that target
- Do it for all rows for all columns for all targets < Make a function?

1. Do it for one row for one column

Using target as "SSO"

Let's find the features

In [169]:
annotation_features = pd.read_pickle("annotation_features.pkl")
annotation_features.columns

Index(['annotation', 'features'], dtype='object')

In [172]:
SSO_features = annotation_features[ annotation_features['annotation'] == "SSO" ].at[29,'features']

['login credentials from one of your accounts',
 'application authentication options',
 'receives your information from an SNS',
 'accessed on a third party platform or social network',
 'register using your User credentials to certain social media sites',
 'allow us to access and/or collect certain information from your Third Party Platform profile/account',
 'logging in to the Application using Third Party Social Network',
 'accessing the Services through a social network',
 'when you choose to connect with those services',
 'if you choose to register your App account via such social media providers',
 'third party platform']

In [184]:
(train_set['application authentication options'] == 1)

Unnamed: 0,application authentication options,application authentication options.1
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
10714,False,False
10715,False,False
10716,False,False
10717,False,False


In [165]:
new_x_train # must get rows where one of the SSO CFs = 1
# Use features.yml (or objects from previous notebook) to get the SSO CFs
# Hmm... filter df on condition(s)?

NameError: name 'new_x_train' is not defined

Note that the F1 scores in the table I would like to compare to were based on Train & Validate sets.  I'm still not sure whether they used the F1 score of the practice occurs class (1, not 0), but I'm pretty sure they did.  So the remaining difference in performence could be due to less overfitting from the smaller train size?  Or because of bad pre-processing, my use of CFs are actually decreasing performance.  

I could test both possibilities, but I think these are not my priorities right now.

FRIDAY

Main goals: 
- Work out how to employ Sentence Filtering for each class
- Work out how to generate scores for each class (only 22 classes to consider I think – will have to cut down Y)
- Evaluation
- Review which classifiers to focus on
- Try other models / improvements? Could not include Bigrams for the classifiers where they found that it helped
- Go back to pre-processing, but focus on just the ones that will make the CFs match better.

# Things to talk to someone about

- Question I asked Shifath re train/validate
- General discussion about scoring in CV Grid Search
- AUC for SVM even when not mentioned in MAPS