In [1]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
from collections import defaultdict
from collections import Counter

import ds_utils_callum
import priv_policy_manipulation_functions as priv_pol_funcs

Story et al. (2019) "Natural Language Processing for Mobile App Privacy Compliance", available from https://usableprivacy.org/publications

### Story et al. did not build classifiers for all targets

In [2]:
segment_annots_df = pd.read_pickle("segment_annots_df.pkl")
segment_annots_df.columns

Index(['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'segment_text', 'annotations', 'sentences', 'SSO',
       'Facebook_SSO', '1st_party', '3rd_party', 'Contact',
       'Contact_Address_Book', 'Contact_City', 'Contact_E_Mail_Address',
       'Contact_Password', 'Contact_Phone_Number', 'Contact_Postal_Address',
       'Contact_ZIP', 'Demographic', 'Demographic_Age', 'Demographic_Gender',
       'Identifier', 'Identifier_Ad_ID', 'Identifier_Cookie_or_similar_Tech',
       'Identifier_Device_ID', 'Identifier_IMEI', 'Identifier_IMSI',
       'Identifier_IP_Address', 'Identifier_MAC', 'Identifier_Mobile_Carrier',
       'Identifier_SIM_Serial', 'Identifier_SSID_BSSID', 'Location',
       'Location_Bluetooth', 'Location_Cell_Tower', 'Location_GPS',
       'Location_IP_Address', 'Location_WiFi', 'PERFORMED', 'NOT_PERFORMED'],
      dtype='object')

I have 34 'targets' and could make 34 classifiers.

Story et al. made 4 rule-based classifiers for:
- Identifier
- Identifier IMSI
- Identifier SIM Serial
- Identifier SSID BSSID

Which takes the total ML classifiers down to 30.  They also avoided ML classifiers for:
- Contact city, 
- Contact ZIP code, 
- Contact postal address, 
- password, 
- Identifier ad ID, 
- address book, 
- Bluetooth Location, 
- IP address: identifier
- IP address: location
- General demographic
- Demographic: age,
- Demographic: gender

Which takes it down to 18 ML classifiers that I am trying to improve upon.

## Preprocessing

DID do:

1. normalize whitespace and punctuation, remove non-ASCII characters, and lowercase all policy text.

An optional preprocessing step of sentence filtering:<br>
Based on a grid search, in cases where it improves classifier performance, we remove a segment’s sentences from further processing if they do not contain keywords related to the classifier in question. For example, the Location classifier is not trained on sentences which only describe cookies.

union of a TF-IDF vector and a vector of manually crafted features. Our TF-IDF vector is created using the TfidfVectorizer (scikit-learn developers 2016a) configured with English
stopwords (stop words=’english’), unigrams and bigrams (ngram range=(1, 2)), and binary term counts
(binary=True). 

Did NOT do: 

Because stemming did not lead to performance improvements, we are omitting it.

### Task: confirm whether:<br>
normalize whitespace (using regex – code easily available online)<br>
normalize punctuation (check what punctuation is in the CFs; keep those; all other punctuation can be removed because it won't affect the feature columns (CFs of TF-IDF).
<br>
   - (What does this mean? Well, I should check what punctuation the CFs have. I should also check how TF-IDF will interact with punctuation. It might require punctuation to be removed.  It might not have a problem with punctuation.)
   - It does have a method for "character normalization"
   - I think the only reason I will want to keep punctuation will be if it prevents making n-grams using the end of one and start of another sentence.  TF-IDF uses n-grams in the same way as CountVectorizer. So how does it do it? 

*While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried by that internal structure.*

So basically the TF-IDF doesn't care about punctuation, so I can just focus on the Created Features. Let's confirm that that's how the TF-IDF works.

Remove non-ASCII characters – could cause issues with tf-idf. Google how to do this option 1: encode+decode. Option 2: ord(char) < 128 <br>

lowercase all policy text.  TF-IDF can do this one but perhaps would be better for checking the crafted features.

Now I need to check the Crafted Features to see what punctuation they contain and whether they are all lowercase (probably)

In [60]:
# Get a list of all the crafted features
annotation_features = pd.read_pickle('annotation_features.pkl')
list_all_crafted_features = [feature for row in annotation_features['features'] for feature in row]
len(list_all_crafted_features) # verify – should be 579

579

In [61]:
list_all_crafted_features

['contact info',
 'contact details',
 'contact data',
 'e.g., your name',
 'contact you',
 'your contact',
 'identify, contact',
 'identifying information',
 'your name, address, and e-mail address',
 'including e-mail',
 'phone book',
 'phonebook',
 'contact information in your device',
 'address book',
 'contacts',
 'contact names',
 'contact list',
 'contacts list',
 'phone contacts',
 'contact\xa0entries',
 'import contacts',
 'friend list',
 'friends list',
 'city',
 'hometown',
 'e-mail address',
 'email address',
 'e-mail and mailing address',
 'email and mailing address',
 'e-mail or mailing address',
 'email or mailing address',
 'password',
 'authentication process',
 'credential',
 'authentication token',
 'phone',
 'number call',
 'mailing address',
 'street address',
 ' address,',
 ' address ',
 'postal address',
 'billing address',
 'shipping address',
 'home or work address',
 'other address',
 'physical address',
 'your address',
 'home address',
 'residential address',

Checking punctuation:

In [65]:
%%time
list_of_punc = []
for ft in list_all_crafted_features:
    for char in ft:
        if char not in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ":
            if char not in list_of_punc:
                list_of_punc.append(char)

CPU times: user 635 µs, sys: 1e+03 ns, total: 636 µs
Wall time: 640 µs


In [66]:
list_of_punc

[' ', '.', ',', '-', '\xa0', '/', '(', ')', "'"]

So they do have some odd punctuation that we'll need to keep in when pre-processing the segments.  Now also to check upper / lower.

In [67]:
%%time
list_of_upper = []
for ft in list_all_crafted_features:
    for char in ft:
        if char in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
            if char not in list_of_upper:
                list_of_upper.append(char)

CPU times: user 1.26 ms, sys: 1e+03 ns, total: 1.26 ms
Wall time: 1.27 ms


In [68]:
list_of_upper

['S', 'N', 'U', 'T', 'P', 'A', 'I']

Hmm interestingly they do have some (but not all) capital letters. It would be interesting to see which words these are from.  It also has interesting implications for populating the CF columns, although I've already populated all the CF columns so I'm not sure that's an issue.

In [15]:
[segment for segment in segment_annots_df['segment_text'] ]

['PRIVACY POLICY This privacy policy (hereafter referred to as the "Privacy Policy") is applicable to our websites, apps and to all games and other activities(hereafter referred to as the "our products") that are offered by us on or through our products. Tiny Piece, having its registered office at Ajeltake Road, Ajeltake Island, Majuro, Republic of the Marshall Island MH 96960 (hereafter referred to as "6677g"). 6677g may use affiliates\' or reputable third parties\' services for the processing of personal data collected on or through our products. By using or accessing our products, you are accepting the practices described in this Privacy Policy.',
 '1. ABOUT OUR PRODUCTS 1.1 Our products offer a diverse, current, and exciting mix of games created by 6677g, as well as games created by independent developers and 6677g partners. Players can access our products to play games without registering; however, they may choose to register to create a public or semi-public profile and to save g

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

Demonstration that tfidf is unaffected by punctuation:

In [54]:
words_with_punctuation = ["word####sock )(*&^:%$) dog. \n But)(*&;''^%also[][]\here"]
tfidf = TfidfVectorizer(ngram_range=(1,2))
tfidf.fit_transform(words_with_punctuation)
print(words_with_punctuation)

["word####sock )(*&^:%$) dog. \n But)(*&;''^%also[][]\\here"]


In [55]:
tfidf.get_feature_names_out()

array(['also', 'also here', 'but', 'but also', 'dog', 'dog but', 'here',
       'sock', 'sock dog', 'word', 'word sock'], dtype=object)

We can see that all combinations of words are still there even when separated in the original text by full-stops and a range of standard punctuation.

## Modelling:

We train those with a linear kernel (kernel=’linear’), balanced class weights
(class weight=’balanced’), and a grid search with five-fold cross-validation over the penalty (C=[0.1, 1,
10]) and gamma (gamma=[0.001, 0.01, 0.1]) parameters.

We create rule-based classifiers for four data types
- Identifier
- Identifier IMSI
- Identifier SIM Serial
- Identifier SSID BSSID
due to the limited amount of data and their superior performance. Our rule-based classifiers identify the presence or absence of a data type based on indicative text strings.

a baseline configuration (Baseline), in which the TF-IDF vectors only include unigrams. For the baseline, bigrams (Bigrams) and manually crafted features (C.F.) are not included, and keyword-based sentence filtering (S.F.) is not performed.

Our final configuration (Final) includes bigrams as well as crafted features; sentence filtering is enabled on a per-classifier basis using a grid search.

Table 1 shows the effects of our features and preprocessing steps on the F1 scores of our non-rule-based classifiers.
The performance is calculated using our training and validation sets. We made sentence filtering an optional part of preprocessing because of the large detrimental effect it has on
some of our classifiers, as highlighted in Table 2. In general,
our results suggest that the chosen feature and preprocessing steps improve classifier performance. However, ideally
they should be chosen on a per-classifier basis to avoid any
negative performance impact.

Which targets did they not keep?

*In preliminary tests we also considered city, ZIP code, postal
address, username, password, ad ID, address book, Bluetooth, IP
address (identifier and location), age, and gender practices. However, we ultimately decided against further pursuing those as we
had insufficient data, unreliable annotations, or difficulty identifying a corresponding API for the app analysis.*

## Evaluation

Since our definition of potential compliance issues does not depend on the negative modality classifier, we do not include it in the table.

"We focus on segments instead of entire policies to make effective use of the annotated data and to identify the specific policy text locations that describe a certain practice." Story et al. (2019) pg 3



"First, approaching the classification task at the level of segments,
as suggested by prior work (Wilson et al. 2016; Liu et al.
2018), can pose difficulties for our subproblem classifiers.
For example, if a segment describes a 1stParty performing the Location practice, and a 3rdParty performing
Contact, our classifiers cannot distinguish which party
should be associated with which practice. Thus, performing classifications at the level of sentences may yield performance improvements. 

Second, the variety of technical
language in privacy policies poses challenges. For example, we observed a false positive when “location” was used
in the context of “co-location facility”, and a false negative
when “clear gifs” was used to refer to web beacons. Such
errors might be prevented by training on more data or using
domain-specific word embeddings (Kumar et al. 2019). 

Finally, a more sophisticated semantic representation might be
necessary in certain cases. For example, we observed misclassification of a sentence which said that although the first
party does not perform a practice, third parties do perform
the practice."

## Some reasons my model may be different to theirs:

- Different version of Sci-kit learn for tf-idf. It could behave differently in ways I haven't accounted for.
- Different pre-processing. Some of the steps listed were vague. 'Normalise punctuation' could be done in a variety of ways.