Dataset and related content taken from Natural Language Processing for Mobile App Privacy Compliance. Peter Story, Sebastian Zimmeck, Abhilasha Ravichander, Daniel Smullen, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. AAAI Spring Symposium on Privacy Enhancing AI and Language Technologies (PAL 2019), Mar 2019.

Available here – APP350: https://usableprivacy.org/data

EDA should be in the context of my data.  I should state what I expect BEFORE checking for it in the data.

Also at the top I can include links, context, goals

In [3]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
from collections import defaultdict
from collections import Counter

import ds_utils_callum
import priv_policy_manipulation_functions as priv_pol_funcs

Put the first policy into a df

## Populating top-level df

In [63]:
all_policies_df = priv_pol_funcs.load_all_policies()
all_policies_df.head(5)

Unnamed: 0,policy_id,policy_name,policy_type,contains_synthetic,segments
0,1,6677G,TEST,False,"[{'segment_id': 0, 'segment_text': 'PRIVACY PO..."
1,2,AIFactory,TEST,False,"[{'segment_id': 0, 'segment_text': 'AI Factory..."
2,3,AppliqatoSoftware,TEST,False,"[{'segment_id': 0, 'segment_text': 'Automatic ..."
3,4,BandaiNamco,TEST,False,"[{'segment_id': 0, 'segment_text': 'MOBILE APP..."
4,5,BarcodeScanner,TEST,False,"[{'segment_id': 0, 'segment_text': 'Skip to co..."


In [64]:
all_policies_df = priv_pol_funcs.add_metadata_to_policy_df(all_policies_df)
all_policies_df.head(5)

Unnamed: 0,policy_id,policy_name,policy_type,contains_synthetic,segments,num_segments,num_annotated_segs,total_characters
0,1,6677G,TEST,False,"[{'segment_id': 0, 'segment_text': 'PRIVACY PO...",36,11,12703
1,2,AIFactory,TEST,False,"[{'segment_id': 0, 'segment_text': 'AI Factory...",14,5,5995
2,3,AppliqatoSoftware,TEST,False,"[{'segment_id': 0, 'segment_text': 'Automatic ...",8,1,2450
3,4,BandaiNamco,TEST,False,"[{'segment_id': 0, 'segment_text': 'MOBILE APP...",57,14,32323
4,5,BarcodeScanner,TEST,False,"[{'segment_id': 0, 'segment_text': 'Skip to co...",32,3,6667


# Making segments

Now to make a new dataframe where each row represents a paragraph (segment).

First I will get this to work for a single policy. Then I will loop through all the policies to apply the same manipulation.

In [65]:
all_segments_df = priv_pol_funcs.generate_segment_df(all_policies_df)
all_segments_df.head()

Unnamed: 0_level_0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences
segment_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[]
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[]
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W..."
3,1,TEST,False,3,"2.2 In addition, we store certain information ...",[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': '6677g may use cookies, web..."
4,1,TEST,False,4,(c) to remember your preferences and registrat...,[],[]


In [66]:
all_segments_df.shape

(15543, 7)

In [67]:
all_segments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15543 entries, 0 to 15542
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   source_policy_number  15543 non-null  int64 
 1   policy_type           15543 non-null  object
 2   contains_synthetic    15543 non-null  bool  
 3   policy_segment_id     15543 non-null  int64 
 4   segment_text          15543 non-null  object
 5   annotations           15543 non-null  object
 6   sentences             15543 non-null  object
dtypes: bool(1), int64(2), object(4)
memory usage: 743.9+ KB


### Export all segments

To make it faster to load this dataframe in this notebook and others, I will save it as a pickle file.

`pd.to_pickle()` is better than `pd.to_csv()` because:
- The stored file size is smaller
- List objects in the dataframe are *not* converted to strings, as with csv.

In [68]:
all_segments_df.to_pickle('all_segments_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [74]:
confirm_save_1 = pd.read_pickle('all_segments_df.pkl')
print(all_segments_df.shape == confirm_save_1.shape)
print(confirm_save_1.equals(all_segments_df))

True
True


Now the below code can all be ran using the dataframe produced from the pickle file, instead of having to wait for the above code to run.

# Next step of extraction

I probably want to separate it out to the sentence level, as that is the max granularity of the annotations and some paragraphs are just one sentence anyway.

## Extracting list of practices

Employing my function to get the list of practices from the APP documentation

In [10]:
list_of_practice_groups = priv_pol_funcs.get_list_of_practice_groups()

29 different groups of practices returned, containing 58 individual practices.


In [11]:
# Expand the list of list with a list comprehension
list_of_practices = [practice for practice_group in list_of_practice_groups for practice in practice_group]
print(f"There are {len(list_of_practices)} different practices.")

There are 58 different practices.


In [12]:
list_of_practices[:5]

['Contact_1stParty',
 'Contact_3rdParty',
 'Contact_Address_Book_1stParty',
 'Contact_Address_Book_3rdParty',
 'Contact_City_1stParty']

## Applying all annotations to columns in segment dataframe

In [92]:
# Can be used to read in the dataframe without running the above code
# all_segments_df = pd.read_pickle('all_segments_df.pkl')

In [80]:
segment_annotations = priv_pol_funcs.add_empty_annotation_columns(all_segments_df, list_of_practices)

The shape of the returned dataframe is (15543, 65)


In [81]:
# segment_annotations.head(2) # Verify columns added

In [82]:
# populate the columns with the annotations
for index in range(len(segment_annotations)):
    practices_dictionaries = segment_annotations.loc[index, 'annotations']
    for each_practice in practices_dictionaries:
        segment_annotations.loc[index, each_practice['practice']] += 1

In [83]:
# Verify final row annotated
segment_annotations.iloc[15542,7:].sum() # columns after 7 cover all annotation columns

4

In [84]:
segment_annotations.iloc[:,7:].sum().sum() # Total paragraph annotations

10215

Further confirmation that all the annotations are added:

In [85]:
segment_annotations.head()

Unnamed: 0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences,Contact_1stParty,Contact_3rdParty,Contact_Address_Book_1stParty,...,Location_Bluetooth_1stParty,Location_Bluetooth_3rdParty,Location_Cell_Tower_1stParty,Location_Cell_Tower_3rdParty,Location_GPS_1stParty,Location_GPS_3rdParty,Location_IP_Address_1stParty,Location_IP_Address_3rdParty,Location_WiFi_1stParty,Location_WiFi_3rdParty
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W...",0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,TEST,False,3,"2.2 In addition, we store certain information ...",[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': '6677g may use cookies, web...",0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,TEST,False,4,(c) to remember your preferences and registrat...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Investigating rows with multiple of same annotation

It will be interesting to know whether there are any paragraphs with more than one of the same annotation, and why.

In [86]:
(segment_annotations.iloc[:,7:].max() == 2)
max_of_each_annotation_per_paragraph = segment_annotations.iloc[:,7:].max()
max_of_each_annotation_per_paragraph.loc[max_of_each_annotation_per_paragraph.values > 1]

Contact_E_Mail_Address_1stParty    2
Location_1stParty                  2
dtype: int64

Two annotations were investigated:
- Contact_E_Mail_Address_1stParty: only applied twice in one paragraph:
    - paragraph 7194: the paragraph mentioned that it both performed AND did not perform this practice.
- Location_1stParty: only applied twice in one paragraph:
    - paragraph 11150: it is a very long paragraph and has the annotation as both performed and not performed. (I note that the annotated sentences seem questionable.)
    
In conclusion, almost no segments have duplicate annotations.

## Investigating annotations that occur very rarely

In [87]:
annotation_segment_frequencies = segment_annotations.iloc[:,7:].sum() # Number of paragraphs with each annotation

In [88]:
annotation_segment_frequencies.sum()

10215

In [89]:
annotation_segment_frequencies.loc[annotation_segment_frequencies.values < 10]

Contact_City_3rdParty             8
Identifier_IMSI_3rdParty          3
Identifier_SIM_Serial_3rdParty    3
Identifier_SSID_BSSID_3rdParty    2
dtype: int64

Some examples:
- Contact_City_3rdParty: Tends to be describing a level of abstraction/annonimisation of location data
- Identifier_SSID_BSSID_3rdParty: A couple of apps use an advertising service that collects internet network info

## Export segment_annotations

To make it faster to load this dataframe in this notebook and others, I will save it as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [90]:
segment_annotations.to_pickle('segment_annotations.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [93]:
confirm_save_2 = pd.read_pickle('segment_annotations.pkl')
print(segment_annotations.shape == confirm_save_2.shape)
print(confirm_save_2.equals(segment_annotations))

True
True


In [7]:
confirm_save_2 = pd.read_pickle('segment_annotations.pkl')

In [8]:
confirm_save_2.shape

(15543, 65)

---

# Reduce Segment Annotations to [1st/3rd party] / [Practice]

In [94]:
# Can be used to read in the dataframe without running the above code
# segment_annotations = pd.read_pickle('segment_annotations.pkl')

In [95]:
segment_annotations.head(3)

Unnamed: 0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences,Contact_1stParty,Contact_3rdParty,Contact_Address_Book_1stParty,...,Location_Bluetooth_1stParty,Location_Bluetooth_3rdParty,Location_Cell_Tower_1stParty,Location_Cell_Tower_3rdParty,Location_GPS_1stParty,Location_GPS_3rdParty,Location_IP_Address_1stParty,Location_IP_Address_3rdParty,Location_WiFi_1stParty,Location_WiFi_3rdParty
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W...",0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [96]:
segment_annotations["1st_party"] = 0
segment_annotations["3rd_party"] = 0

In [97]:
_1st_party_practices = [column_name for column_name in segment_annotations.columns if "1stParty" in column_name]
_3rd_party_practices = [column_name for column_name in segment_annotations.columns if "3rdParty" in column_name]

I could use `_3rd_party_practices.extend(['SSO', 'Facebook_SSO'])` to add these to the 3rd party practices, but I'm not going to use these for training the 3rd party classifier, so I don't need to include them, although it is a variable that I should try to change with training the classifier.

In [98]:
for annotation_column in _1st_party_practices:
    for index in range(len(segment_annotations.index)):
        if segment_annotations.at[index, annotation_column] == 1:
            segment_annotations.at[index, "1st_party"] += 1

In [99]:
for annotation_column in _3rd_party_practices:
    for index in range(len(segment_annotations.index)):
        if segment_annotations.at[index, annotation_column] == 1:
            segment_annotations.at[index, "3rd_party"] += 1

In [100]:
print(segment_annotations["1st_party"].sum())
print(segment_annotations["3rd_party"].sum())

7536
2202


In [101]:
annotation_segment_frequencies.sum() # Verify

10215

### Review of steps to complete

- Add target columns for each practice to all segments
- Add target columns for each modality to all segments
- Save this as the basic dataframe (Segments; targets)
- Add crafted feature columns for each target
- Save this as another dataframe
- Split away the Test data
- Now it's ready for grid search.
- Now for each target, can test a classifier on vectorized version, with CV grid search including dropping the created feature columns
- Compare results to MAPS
- Then think about how to exclude sentences (apply sentence filtering) for each classifier

# Add target columns for each practice to all segments

In [102]:
segment_annotations.head(3)

Unnamed: 0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences,Contact_1stParty,Contact_3rdParty,Contact_Address_Book_1stParty,...,Location_Cell_Tower_1stParty,Location_Cell_Tower_3rdParty,Location_GPS_1stParty,Location_GPS_3rdParty,Location_IP_Address_1stParty,Location_IP_Address_3rdParty,Location_WiFi_1stParty,Location_WiFi_3rdParty,1st_party,3rd_party
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W...",0,0,0,...,0,0,0,0,0,0,0,0,2,0


In [103]:
# Do it for one column, then do it for all columns
practice_columns_df = segment_annotations.copy()

### Column for each practice

In [13]:
the_30_practices = [practice.removesuffix("_1stParty").removesuffix("_3rdParty") for practice in list_of_practices]
# Get the practices from features.yaml
the_30_practices = list(dict.fromkeys(the_30_practices)) # remove duplicates

In [14]:
# Don't need to add "SSO" and "Facebook_SSO" because they were already added.
the_28_practices = [practice for practice in the_30_practices if practice not in ["SSO", "Facebook_SSO"] ]

In [106]:
practice_columns_df = practice_columns_df.reindex(
    columns = practice_columns_df.columns.tolist() + the_28_practices) # add the practices to the dataframe

#### Practice-pairs for each practice

In [108]:
practice_pair_dict = dict.fromkeys(the_30_practices)
for practice in practice_pair_dict.keys():
    practice_pair_dict[practice] = [sub_practice for sub_practice in list_of_practices
                                    if sub_practice.removesuffix("_1stParty").removesuffix("_3rdParty") == practice]

Populate each practice column:

In [109]:
%%time

for practice_column in the_28_practices: # only 28 because SSO and Facebook_SSO were populated earlier
    for index in range(len(practice_columns_df.index)):
        practice_columns_df.at[index, practice_column] = practice_columns_df.loc[index, practice_pair_dict[practice_column] ].sum()
    print(f"Finished processing {practice_column}", end="\r")

Finished processing Contact
Finished processing Contact_Address_Book
Finished processing Contact_City
Finished processing Contact_E_Mail_Address
Finished processing Contact_Password
Finished processing Contact_Phone_Number
Finished processing Contact_Postal_Address
Finished processing Contact_ZIP
Finished processing Demographic
Finished processing Demographic_Age
Finished processing Demographic_Gender
Finished processing Identifier
Finished processing Identifier_Ad_ID
Finished processing Identifier_Cookie_or_similar_Tech
Finished processing Identifier_Device_ID
Finished processing Identifier_IMEI
Finished processing Identifier_IMSI
Finished processing Identifier_IP_Address
Finished processing Identifier_MAC
Finished processing Identifier_Mobile_Carrier
Finished processing Identifier_SIM_Serial
Finished processing Identifier_SSID_BSSID
Finished processing Location
Finished processing Location_Bluetooth
Finished processing Location_Cell_Tower
Finished processing Location_GPS
Finished pro

In [110]:
practice_columns_df["Location_Cell_Tower"].sum() # Verify

166.0

In [111]:
annotation_segment_frequencies.sum() # Verify

10215

In [112]:
_30_practices_filter = practice_columns_df[ the_30_practices ] > 0

In [113]:
practice_columns_df[the_30_practices]

Unnamed: 0,Contact,Contact_Address_Book,Contact_City,Contact_E_Mail_Address,Contact_Password,Contact_Phone_Number,Contact_Postal_Address,Contact_ZIP,Demographic,Demographic_Age,...,Identifier_MAC,Identifier_Mobile_Carrier,Identifier_SIM_Serial,Identifier_SSID_BSSID,Location,Location_Bluetooth,Location_Cell_Tower,Location_GPS,Location_IP_Address,Location_WiFi
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15539,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15540,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
practice_columns_df[the_30_practices][(_30_practices_filter)].sum().sum() # Verify

10215.0

It matches! Success!

### Export practice_columns_df

As before, to make it faster to load this dataframe in this notebook and others, I will save it as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [115]:
practice_columns_df.to_pickle('practice_columns_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [116]:
confirm_save_3 = pd.read_pickle('practice_columns_df.pkl')
print(practice_columns_df.shape == confirm_save_3.shape)
print(confirm_save_3.equals(practice_columns_df))

True
True


## Add target columns for each Modality to all segments

In [28]:
# Can be used to read in the dataframe without running the above code
# practice_columns_df = pd.read_pickle('practice_columns_df.pkl')

In [29]:
modality_columns_df = practice_columns_df.copy()

In [30]:
# Instantiate modality columns

In [31]:
modality_columns_df["PERFORMED"] = 0
modality_columns_df["NOT_PERFORMED"] = 0

In [32]:
# populate the modality columns with the annotations
for index in range(len(modality_columns_df)):
    practices_dictionaries = modality_columns_df.at[index, 'annotations']
    for each_practice in practices_dictionaries:
        modality_columns_df.at[index, each_practice['modality']] += 1

In [33]:
modality_columns_df["PERFORMED"].sum()

8205

In [34]:
modality_columns_df["NOT_PERFORMED"].sum()

2010

It adds up! Success!

In [35]:
modality_columns_df.columns

Index(['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'segment_text', 'annotations', 'sentences',
       'Contact_1stParty', 'Contact_3rdParty', 'Contact_Address_Book_1stParty',
       'Contact_Address_Book_3rdParty', 'Contact_City_1stParty',
       'Contact_City_3rdParty', 'Contact_E_Mail_Address_1stParty',
       'Contact_E_Mail_Address_3rdParty', 'Contact_Password_1stParty',
       'Contact_Password_3rdParty', 'Contact_Phone_Number_1stParty',
       'Contact_Phone_Number_3rdParty', 'Contact_Postal_Address_1stParty',
       'Contact_Postal_Address_3rdParty', 'Contact_ZIP_1stParty',
       'Contact_ZIP_3rdParty', 'Demographic_1stParty', 'Demographic_3rdParty',
       'Demographic_Age_1stParty', 'Demographic_Age_3rdParty',
       'Demographic_Gender_1stParty', 'Demographic_Gender_3rdParty', 'SSO',
       'Facebook_SSO', 'Identifier_1stParty', 'Identifier_3rdParty',
       'Identifier_Ad_ID_1stParty', 'Identifier_Ad_ID_3rdParty',
       'Identif

Since we now have a dataframe at the segment level with columns for each combination of annotations and columns for the specific annotations of practice, party and modality, I will name this `segment_all_targets_df`.

In [36]:
segment_all_targets_df = modality_columns_df.copy()

**Next:**

I will now remove some columns to create a new dataframe where the only target columns correspond to the specific annotations for a specific practice, the party and modality.  I will also change the target columns to be binary instead of a sum.

## Remove columns

In [37]:
segment_annots_df = segment_all_targets_df.copy()

The `list_of_practices` has all 58 specific annotations (the *practice* and whether it's *1st or 3rd party*), but two in this list ("SSO" and "Facebook_SSO") are 3rd party by default. We need to remove all the specific annotations from the current dataframe, except for those two, since they are already in the correct form.

In [42]:
the_56_specific_practices = [practice for practice in list_of_practices if practice not in ["SSO", "Facebook_SSO"] ]

In [43]:
segment_annots_df = segment_annots_df.drop(columns = the_56_specific_practices)

In [44]:
segment_annots_df.shape

(15543, 41)

In [51]:
segment_annots_df.columns

Index(['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'segment_text', 'annotations', 'sentences', 'SSO',
       'Facebook_SSO', '1st_party', '3rd_party', 'Contact',
       'Contact_Address_Book', 'Contact_City', 'Contact_E_Mail_Address',
       'Contact_Password', 'Contact_Phone_Number', 'Contact_Postal_Address',
       'Contact_ZIP', 'Demographic', 'Demographic_Age', 'Demographic_Gender',
       'Identifier', 'Identifier_Ad_ID', 'Identifier_Cookie_or_similar_Tech',
       'Identifier_Device_ID', 'Identifier_IMEI', 'Identifier_IMSI',
       'Identifier_IP_Address', 'Identifier_MAC', 'Identifier_Mobile_Carrier',
       'Identifier_SIM_Serial', 'Identifier_SSID_BSSID', 'Location',
       'Location_Bluetooth', 'Location_Cell_Tower', 'Location_GPS',
       'Location_IP_Address', 'Location_WiFi', 'PERFORMED', 'NOT_PERFORMED'],
      dtype='object')

## Convert to binary

In [90]:
# number of cells greater than 0 should not change
print(f"There are {(segment_annots_df.loc[:,'SSO':] == 0).sum().sum()} cells with 0")
print(f"There are {(segment_annots_df.loc[:,'SSO':] > 0).sum().sum()} cells above 0")

There are 510617 cells with 0
There are 17845 cells above 0


In [67]:
segment_annots_df.loc[:,'SSO':].shape

(15543, 34)

In [92]:
%%time
# For each column from SSO onwards, go into every cell, and if it's above 1, set it to 1.

for column in segment_annots_df.loc[:,'SSO':].columns:
    for i in range(len(segment_annots_df[column])):

        if segment_annots_df.at[i, column] > 1:
            segment_annots_df.at[i, column] = 1

CPU times: user 2.01 s, sys: 12.4 ms, total: 2.02 s
Wall time: 2.04 s


In [93]:
# number of cells greater than 0 should be same as above
print(f"There are {(segment_annots_df.loc[:,'SSO':] == 0).sum().sum()} cells with 0")
print(f"There are {(segment_annots_df.loc[:,'SSO':] > 0).sum().sum()} cells above 0")

There are 510617 cells with 0
There are 17845 cells above 0


In [96]:
# should now be no target cells with more than 1
segment_annots_df.loc[:,'SSO':].max().max() # shows the max value from all target columns

1.0

### Save to pkl

As before, to make it faster to load this dataframe in this notebook and others, I will save this dataframe as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [97]:
segment_annots_df.to_pickle('segment_annots_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [98]:
confirm_save_4 = pd.read_pickle('segment_annots_df.pkl')
print(segment_annots_df.shape == confirm_save_4.shape)
print(confirm_save_4.equals(segment_annots_df))

True
True


This above dataframe now has the granularity and target columns required to run a baseline model.  I will conduct the same preprocessing steps as done by Usable Privacy.org in the paper to create additional dataframes to use for modelling,

# Load crafted features for each target

To help to create accurate classifiers, columns will be added to the dataframe that contain key phrases that may be found in segment that has been annotated with a specific annotation. For example, the phrases 'phone book', 'phonebook' or 'address book' could be found in segments that have been annotated with the *Contact_Address_Book* annotation and adding these phrases as columns could help a classifier to correctly identify *Contact_Address_Book*.

These will be used as 'crafted feature' columns and for 'sentence filtering' when modelling.

This first function gets each practice.

In [54]:
def get_features_for_practices():

    with open("APP_350_v1_1/features.yml", "r") as stream:
        try:
            features_yml = (json_normalize(yaml.safe_load(stream)))
        except yaml.YAMLError as exc:
            print(exc)

    data_types = json_normalize(features_yml['data_types'])

    list_of_practice_features = []
    practice_groups = range(len(data_types.columns))

    for i in practice_groups:
        practices_and_features = json_normalize(data_types[i])

        list_of_practice_features.extend(practices_and_features['features'])

    print(f"{len(list_of_practice_features)} different groups of practices returned, containing {len([practice for practice_group in list_of_practice_features for practice in practice_group])} individual practices.")

    return list_of_practice_features

In [55]:
get_features_for_practices()[2:4] # returning just two to demonstrate

29 different groups of practices returned, containing 354 individual practices.


[['city', 'hometown'],
 ['e-mail address',
  'email address',
  'e-mail and mailing address',
  'email and mailing address',
  'e-mail or mailing address',
  'email or mailing address']]

Ultimately I will add each of these 412 individual practices to the dataframe.

In [29]:
features_for_practices = get_features_for_practices()

29 different groups of practices returned, containing 354 individual practices.


In [30]:
the_29_practices = [practice for practice in the_30_practices if practice != "SSO"]

In [31]:
practice_and_created_features = pd.DataFrame(data = [the_29_practices, features_for_practices]).T
practice_and_created_features.columns = ["practice", "features"]

In [33]:
practice_and_created_features.at[len(practice_and_created_features),"practice"] = "SSO"
practice_and_created_features.at[29,"features"] = practice_and_created_features.at[11,"features"]

In [34]:
practice_and_created_features.head(3)

Unnamed: 0,practice,features
0,Contact,"[contact info, contact details, contact data, ..."
1,Contact_Address_Book,"[phone book, phonebook, contact information in..."
2,Contact_City,"[city, hometown]"


Now I need to grab the features for the parties and modalities too, which are also stored in the 'features.yml' file.

In [35]:
with open("APP_350_v1_1/features.yml", "r") as stream:
    try:
        features_yml = (json_normalize(yaml.safe_load(stream)))
    except yaml.YAMLError as exc:
        print(exc)

features_yml

Unnamed: 0,data_types,modalities.PERFORMED,modalities.NOT_PERFORMED,parties.FirstParty,parties.ThirdParty
0,"[{'practices': ['Contact_1stParty', 'Contact_3...","[consent, permission, opt, collect, access, g...","[not collect, no longer collect, not access, n...","[ we , you , us , our , the app, the software]","[partner, third part, third-part, service prov..."


In [36]:
add_to_df = [
['1st_party', features_yml.loc[0,"parties.FirstParty"] ],
['3rd_party', features_yml.loc[0,"parties.ThirdParty"] ],
['PERFORMED', features_yml.loc[0,"modalities.PERFORMED"] ],
['NOT_PERFORMED', features_yml.loc[0,"modalities.NOT_PERFORMED"] ]
]

In [37]:
annotation_features = pd.concat( [practice_and_created_features, 
                                  pd.DataFrame(add_to_df, columns=['practice', 'features'])],
           axis = 0,
           ignore_index = True)
annotation_features.columns = ['annotation', 'features']

In [38]:
annotation_features.tail(6)

Unnamed: 0,annotation,features
28,Location_WiFi,"[wifi signal, wifi access point, wifi location..."
29,SSO,"[login credentials from one of your accounts, ..."
30,1st_party,"[ we , you , us , our , the app, the software]"
31,3rd_party,"[partner, third part, third-part, service prov..."
32,PERFORMED,"[consent, permission, opt, collect, access, g..."
33,NOT_PERFORMED,"[not collect, no longer collect, not access, n..."


Saving the file:

In [39]:
annotation_features.to_pickle('annotation_features.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [40]:
confirm_save_6 = pd.read_pickle('annotation_features.pkl')
print(annotation_features.shape == confirm_save_6.shape)
print(confirm_save_6.equals(annotation_features))

True
True


So now I have a dataframe with all the different annotations and a list of their respective crafted features called "annotation_features".

# /\/\ Move all the below to the end of the 'pre-processing' notebook ~/\/\/. 

# Append annotation features to dataframe

The next steps are to:
- 1. Append each feature as a column to the dataframe with the annotation as a prefix
- 2. Populate the columns using the segment

Then I can move to modelling.

I already have a function to help with 1 called `add_empty_annotation_columns`.  I just need to put the new features into a list.

First I want to check whether any of the features are the same.

In [41]:
list_all_crafted_features = [feature for row in annotation_features['features'] for feature in row]

In [46]:
all_features = []
duplicate_features = []
for feature in list_all_crafted_features:
    if feature in all_features:
        duplicate_features.append(feature)
    all_features.append(feature)
len(duplicate_features)

103

Hmm a lot of the features are exactly the same.  I'm not sure that this will be an issue though – the only step being done while a dataframe with multiple same column names is populating it, which I can do by looping through the columns one at a time, but this will be worth keeping in mind.  Ideally I would clean it up as some pandas functions are unavailable if the dataframe has multiple column names with the same name.

## Add crafted features columns to df

In [360]:
print(len(list_all_crafted_features))
print(segment_annots_df.shape)
crafted_features_df = priv_pol_funcs.add_empty_annotation_columns(segment_annots_df, list_all_crafted_features) 
!!! ### Change the above line.  Instead of segment_annots_df, 
# use whatever the result of the preprocessing df is: clean_segment_annots_df

579
(15543, 41)
The shape of the returned dataframe is (15543, 620)


In [361]:
crafted_features_df.iloc[:,40:].head(5)

Unnamed: 0,NOT_PERFORMED,contact info,contact details,contact data,"e.g., your name",contact you,your contact,"identify, contact",identifying information,"your name, address, and e-mail address",...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Can I populate this?

In [362]:
crafted_features_df.head(3)

Unnamed: 0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences,SSO,Facebook_SSO,1st_party,...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W...",0,0,2,...,0,0,0,0,0,0,0,0,0,0


- Take the column name
- take the segment text
- if column_name in segment text: put 1.

In [378]:
all_rows = range(len(crafted_features_df))

In [379]:
range(41, 620)

range(41, 620)

In [382]:
%%time

for column_number in range(41, 620): # Looping through each column with a feature

    column_name = crafted_features_df.columns[column_number] # for that column feature

    for row in all_rows: # and for every row
        if column_name in crafted_features_df.at[row, "segment_text"]: # if the segment has that feature
            crafted_features_df.at[row, column_name] = 1 # make the value for that feature on that row equal 1
    
    print(f"Processing {column_number}/620", end="\r")

CPU times: user 46min 27s, sys: 20.8 s, total: 46min 47s
Wall time: 46min 29s


In [391]:
# looking at some of the results to verify
sumations = crafted_features_df.iloc[:,41:].sum()
sumations[-230:-200]

location based on gps/wi-fi/communications                                                                 0
 precise location                                                                                         48
 precise geo                                                                                              33
 precise device location                                                                                   3
 precise device geo                                                                                        0
 specific location                                                                                         8
 specific geo                                                                                              7
 specific device location                                                                                  1
 specific device geo                                                                                       0
 exact location    

This looks roughly correct so I will use it for modelling

## Saving the df

As before, to make it faster to load this dataframe in this notebook and others, I will save this dataframe as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [392]:
crafted_features_df.to_pickle('crafted_features_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [393]:
confirm_save_5 = pd.read_pickle('crafted_features_df.pkl')
print(crafted_features_df.shape == confirm_save_5.shape)
print(confirm_save_5.equals(crafted_features_df))

True
True


# Conclusion

I now have a range of dataframes that I can use for EDA and modelling.

Those dataframes are listed here:

**Dataframes (DF):**<br>
“Initial DF – “all_segments_df” – contains policy metadata, segment text, and YAML-embedded annotations and sentences
<br> DF 2 – “segment_annotations” – all the above plus columns for annotations of the form [data_practice, party] (58 additional columns)
<br> DF 3 – “practice_columns_df” – as above with the addition of columns for annotations for each data practice (30 additional columns). Suitable for EDA (not modelling) since it counts number of occurrences of each practice in each segment.
<br> DF 4 – "segment_annots_df" – As with the first data frame and with the 30 columns added to the above Dataframe.  Essentially it features each segment and the targets. Suitable for modelling because each practice is binary (present or not)
<br> DF 5 – "crafted_features_df" – As above with 579 Crafted Featured added. Contains each segment, all crafted features, and columns for targets of practice, parties and modality.

Another dataframe listing each Crafted Feature for each annotation – "annotation_features"

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.style.use('seaborn')

Options for next steps:

- Make some mapping of the Crafted Features with the respective targets.
- Read the paper again and replicate their steps.
- I might need to revise and ask about SVM

EDA to do:

- Normal EDA on dataset.  I've already done lots of this IMO although some of it is for grouped targets (practice+party).
- EDA on individual targets. Some segments have many times of one annotation.
- TF-IDF such as most common words (see graph in Text Data notebook)
- EDA on Crafted Features – how common are they? How well do they match with the targets?
- EDA on pre-processing step: how messy are the CFs and Segments (whitespace, odd punctuation, non-ascii chars)