# Loading and Cleaning Data

In this notebook I aim to load all the data and manipulate it into a shape that I can then use for EDA and pre-processing.

Some functions are custom made for these steps and are imported in the separate python file priv_policy_manipulation_functions.py.  The purpose is to reduce the code length in this notebook.

Each annotated privacy policy is stored in YAML format in the file path APP_350_v1_1/annotations/

Important terminology: 
- 'Practice': descriptions of uses of personal data is referred to as a ‘privacy practice’. There are many different 'practices' being analyzed relating to contact information, location, identification, demographics etc.
- 'Segment': a segment is roughly equivalent to a paragraph of text data in a privacy policy.
- Annotations: each annotation contains the party (1st or 3rd party), the practice, and the modality (performed or not performed).  I will train a classifier to identify each of these.
- Targets: something I will be training a classifier to predict; one of the three elements of an annotation

The steps are as follows:

- Load each policy into a dataframe
- Making segments
- Extracting list of practices
- Applying all annotations to columns
- Separate annotations to party and practice
- Add target columns for each practice and modality
- Tidying

Separately: Load crafted features

Specifically:
1. Load each policy YAML file into a row in a dataframe
2. Create additional columns with metadata about each policy. This is for EDA.
3. Break the first data frame down further – each segment will be a row in the data frame, while maintaining some of the information around which policy it came from
4. Using another file from the APP350 data, load every different data practice annotation. In its initial form, each annotation is a combination the party (1st or 3rd party) with the practice.
5. Apply each practice as a column to the data frame (and populate the columns), so that we can see which annotations are applied to each segment. This is useful for further EDA to see the spread of annotations.
6. I explore the data a small amount by way of verification and discuss some insights.
7. Break the annotations down further to separate the party annotations from the practice annotations. Add these columns to the data frame (and populate these columns)
    - This is because I will create separate classifiers for each party and each practice
8. Add the modality (‘PERFORMED’ or ‘NOT PERFORMED’) annotations to the dataframe
9. Tidying – removing the columns containing a combination of party and practice and reducing to binary

Finally: 
- Loading the crafted features, which will be used for modelling
- Summary


Dataset and features taken from *Natural Language Processing for Mobile App Privacy Compliance. Peter Story, Sebastian Zimmeck, Abhilasha Ravichander, Daniel Smullen, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. AAAI Spring Symposium on Privacy Enhancing AI and Language Technologies (PAL 2019), Mar 2019.*

Available here – APP350: https://usableprivacy.org/data

Importing all libraries used:

In [2]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
from collections import defaultdict
from collections import Counter

import priv_policy_manipulation_functions as priv_pol_funcs

Put the first policy into a pandas DataFrame. I will abbreviate DataFrame as "df".

# Load each policy into a dataframe

Calling a function to go into each privacy policy and add it to a row in a dataframe.

In [2]:
all_policies_df = priv_pol_funcs.load_all_policies()

# Demonstrate result
display(all_policies_df.head(3)) 

Unnamed: 0,policy_id,policy_name,policy_type,contains_synthetic,segments
0,1,6677G,TEST,False,"[{'segment_id': 0, 'segment_text': 'PRIVACY PO..."
1,2,AIFactory,TEST,False,"[{'segment_id': 0, 'segment_text': 'AI Factory..."
2,3,AppliqatoSoftware,TEST,False,"[{'segment_id': 0, 'segment_text': 'Automatic ..."


You can see that each policy has information around it's 
- id, 
- company name, 
- whether it falls within the Train/Validate/Test set and 
- whether it contains synthetic text.
- The content and annotations are embedded in column "segments"

Synthetic text is text added by Story et al. to change segments stating that an action was performed to stating that the action was not performed.  This was done to help train their negative classifier (a classifier to identify that an action was not performed).

Next, calling a function to apply three columns of metadata to the dataframe: 
- Number of segments in the policy; 
- number of segments in the policy that contain annotations, and 
- the total characters in the policy.

In [3]:
all_policies_df = priv_pol_funcs.add_metadata_to_policy_df(all_policies_df)

# Demonstrate result
display(all_policies_df.head(3))

Unnamed: 0,policy_id,policy_name,policy_type,contains_synthetic,segments,num_segments,num_annotated_segs,total_characters
0,1,6677G,TEST,False,"[{'segment_id': 0, 'segment_text': 'PRIVACY PO...",36,11,12703
1,2,AIFactory,TEST,False,"[{'segment_id': 0, 'segment_text': 'AI Factory...",14,5,5995
2,3,AppliqatoSoftware,TEST,False,"[{'segment_id': 0, 'segment_text': 'Automatic ...",8,1,2450


I know that there are no true duplicates because the function loops through each file by number (1-350).  

A further step would be to check that none of the policies are duplicates of each other.

These additional columns allow for some EDA discussed in the next notebook.

# Making segments

Now to make a new dataframe where each row represents a paragraph (segment).  This is effectively breaking the first dataframe down further – each segment will be a row in the dataframe, while maintaining some of the information around which policy it came from.

The function works by splitting for a single policy. Then looping through all the policies to apply the same manipulation.

In [4]:
all_segments_df = priv_pol_funcs.generate_segment_df(all_policies_df)

# Demonstrate the result:
display(all_segments_df.head(3))

Unnamed: 0_level_0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences
segment_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[]
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[]
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W..."


Inspecting the result:

In [5]:
all_segments_df.shape

(15543, 7)

Total 15543 segments in the entire dataset.

In [6]:
all_segments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15543 entries, 0 to 15542
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   source_policy_number  15543 non-null  int64 
 1   policy_type           15543 non-null  object
 2   contains_synthetic    15543 non-null  bool  
 3   policy_segment_id     15543 non-null  int64 
 4   segment_text          15543 non-null  object
 5   annotations           15543 non-null  object
 6   sentences             15543 non-null  object
dtypes: bool(1), int64(2), object(4)
memory usage: 743.9+ KB


All policies were pulled through. The data types are appropriate.

The data has had some cleaning conducted by Story et al. We see no null values.

The entire dataset is only a small file size:

In [7]:
print(f"{round(sys.getsizeof(all_segments_df)/(1e6),4)} MB")

9.9535 MB


### Save as a file – Export the all segments dataframe.

To make it faster to load this dataframe in this notebook and others, I will save it as a pickle file.

`pd.to_pickle()` is better than `pd.to_csv()` because:
- The stored file size is smaller
- List objects in the dataframe are *not* converted to strings, as with csv.

In [8]:
all_segments_df.to_pickle('objects/all_segments_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [9]:
confirm_save_1 = pd.read_pickle('objects/all_segments_df.pkl')
print(all_segments_df.shape == confirm_save_1.shape)
print(confirm_save_1.equals(all_segments_df))

True
True


Now the below code can all be ran using the dataframe produced from the pickle file, instead of having to wait for the above code to run.

# Extracting list of practices

Using another file from the APP350 data, load every different data practice annotation. In its initial form, each annotation is a combination the party (1st or 3rd party) with the practice.

Employing my function to get the list of practices from the APP documentation

In [10]:
list_of_practice_groups = priv_pol_funcs.get_list_of_practice_groups()

29 different groups of practices returned, containing 58 individual practices.


In [11]:
# Expand the list of list with a list comprehension
list_of_practices = [practice for practice_group in list_of_practice_groups for practice in practice_group]
print(f"There are {len(list_of_practices)} different practices.")

There are 58 different practices.


Demonstrate the first 5 practices:

In [12]:
list_of_practices[:5]

['Contact_1stParty',
 'Contact_3rdParty',
 'Contact_Address_Book_1stParty',
 'Contact_Address_Book_3rdParty',
 'Contact_City_1stParty']

Now we can apply these practices to the dataframe.

---

# Applying all annotations to columns in the segment dataframe

Now I will apply each practice as a column to the data frame (and populate the columns), so that we can see which annotations are applied to each segment. This is useful for further EDA to see the spread of annotations.

In [13]:
# Can be used to read in the dataframe without running the above code
# all_segments_df = pd.read_pickle('objects/all_segments_df.pkl')

Using a function to add additional columns to the dataframe:

In [14]:
segment_annotations = priv_pol_funcs.add_empty_annotation_columns(all_segments_df, list_of_practices)

The shape of the returned dataframe is (15543, 65)


To demonstrate that the columns have been added:

In [15]:
display(segment_annotations.head(2) )

Unnamed: 0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences,Contact_1stParty,Contact_3rdParty,Contact_Address_Book_1stParty,...,Location_Bluetooth_1stParty,Location_Bluetooth_3rdParty,Location_Cell_Tower_1stParty,Location_Cell_Tower_3rdParty,Location_GPS_1stParty,Location_GPS_3rdParty,Location_IP_Address_1stParty,Location_IP_Address_3rdParty,Location_WiFi_1stParty,Location_WiFi_3rdParty
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0


Looping through the dataframe, accessing the information within the 'annotations' column to populate the new columns I have created:

In [16]:
# populate the columns with the annotations
for index in range(len(segment_annotations)):
    practices_dictionaries = segment_annotations.loc[index, 'annotations']
    for each_practice in practices_dictionaries:
        segment_annotations.loc[index, each_practice['practice']] += 1

As one verification step I will check that the final row has been annotated (by chance it happens to have annotations so if my annotation columns have value then they have been populated)

In [17]:
segment_annotations.iloc[15542,7:].sum() # columns after 7 cover all annotation columns

4

Total segment annotations:

In [18]:
segment_annotations.iloc[:,7:].sum().sum() 

10215

## Export segment_annotations

To make it faster to load this dataframe in this notebook and others, I will save it as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [19]:
segment_annotations.to_pickle('objects/segment_annotations.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [20]:
confirm_save_2 = pd.read_pickle('objects/segment_annotations.pkl')
print(segment_annotations.shape == confirm_save_2.shape)
print(confirm_save_2.equals(segment_annotations))

True
True


In [21]:
confirm_save_2 = pd.read_pickle('objects/segment_annotations.pkl')

In [22]:
confirm_save_2.shape

(15543, 65)

---

## Verifying the data loading (with some insights)

### Investigating rows with multiple of same annotation

It will be interesting to know whether there are any paragraphs with more than one of the same annotation, and why.

In [23]:
max_of_each_annotation_per_paragraph = segment_annotations.iloc[:,7:].max() # creating object to use as filter
max_of_each_annotation_per_paragraph.loc[max_of_each_annotation_per_paragraph.values > 1] # filtering using the filter

Contact_E_Mail_Address_1stParty    2
Location_1stParty                  2
dtype: int64

Two annotations were investigated:
- Contact_E_Mail_Address_1stParty: only applied twice in one paragraph:
    - paragraph 7194: the paragraph mentioned that it both performed AND did not perform this practice.
- Location_1stParty: only applied twice in one paragraph:
    - paragraph 11150: it is a very long paragraph and has the annotation as both performed and not performed. (I note that the annotated sentences seem questionable.)
    
Conclusions:
- These two occurrences are NOT errors in my data loading
- **almost no segments have duplicate annotations.**  This is expected and is a point in favour of the annotators.

### Investigating annotations that occur very rarely

This is another verification step that has general insights for both modelling and app privacy policies.

In [24]:
annotation_segment_frequencies = segment_annotations.iloc[:,7:].sum() # Number of paragraphs with each annotation

Total annotations

In [25]:
annotation_segment_frequencies.sum()

10215

In [26]:
annotation_segment_frequencies.loc[annotation_segment_frequencies.values < 10]

Contact_City_3rdParty             8
Identifier_IMSI_3rdParty          3
Identifier_SIM_Serial_3rdParty    3
Identifier_SSID_BSSID_3rdParty    2
dtype: int64

The bottom three are ways to identify a user using technical information about the user's system.

Some examples:
- Contact_City_3rdParty: Tends to be describing a level of abstraction/annonimisation of location data
- Identifier_SSID_BSSID_3rdParty: A couple of apps use an advertising service that collects internet network info

I am confident that these are not errors because they are rare combinations of data types to collect.

We also learn from this that app privacy policies rarely share technical identification information with third parties, or if they do, they fail to document it.  The fact that these annotations are so rare (2/350) lead me to question the relevance of this feature at all, but it does show that apps prefer to identify their users in other ways.

---

# Separate Segment Annotations to party (1st/3rd party) and Practice

Next I will break the annotations down further to separate the party annotations from the practice annotations. (Add these columns to the data frame and populate these columns. This is because I will create separate classifiers for each party and each practice.

In [27]:
# Can be used to read in the dataframe without running the above code
# segment_annotations = pd.read_pickle('objects/segment_annotations.pkl')

Reminder of the dataframe at this moment: features segment text and annotations such as "Contact_Address_Book_1stParty", as demonstrated below:

In [28]:
display(segment_annotations.head(3))

Unnamed: 0,source_policy_number,policy_type,contains_synthetic,policy_segment_id,segment_text,annotations,sentences,Contact_1stParty,Contact_3rdParty,Contact_Address_Book_1stParty,...,Location_Bluetooth_1stParty,Location_Bluetooth_3rdParty,Location_Cell_Tower_1stParty,Location_Cell_Tower_3rdParty,Location_GPS_1stParty,Location_GPS_3rdParty,Location_IP_Address_1stParty,Location_IP_Address_3rdParty,Location_WiFi_1stParty,Location_WiFi_3rdParty
0,1,TEST,False,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,TEST,False,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[],0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,TEST,False,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W...",0,0,0,...,0,0,0,0,0,0,0,0,0,0


Adding the party annotations to the dataframe:

In [29]:
segment_annotations["1st_party"] = 0
segment_annotations["3rd_party"] = 0

Now to populate these columns, I will use the 1st or 3rd party information from the columns I have previously populated.  For each row, if any of the annotations mentioning 1st party have a value, I will populate the '1st party' column (and same for 3rd party).

First I will get a list of each annotation mentioning 1st party and a list of each annotation mentioning 3rd party (by reference to the column names of the annotations already added above).

Example: the "Contact_3rdParty" column is already added, so I will add this column to the list of 3rd parties.

In [30]:
_1st_party_practices = [column_name for column_name in segment_annotations.columns if "1stParty" in column_name]
_3rd_party_practices = [column_name for column_name in segment_annotations.columns if "3rdParty" in column_name]

Now I can check for each row, if any of the annotations in the list created above mentioning 1st party have a value, I will populate the new '1st party' column (and same for 3rd party).

In [31]:
for annotation_column in _1st_party_practices:
    for index in range(len(segment_annotations.index)):
        if segment_annotations.at[index, annotation_column] == 1:
            segment_annotations.at[index, "1st_party"] += 1

In [32]:
for annotation_column in _3rd_party_practices:
    for index in range(len(segment_annotations.index)):
        if segment_annotations.at[index, annotation_column] == 1:
            segment_annotations.at[index, "3rd_party"] += 1

Verifying that many were added:

In [33]:
print(segment_annotations["1st_party"].sum())
print(segment_annotations["3rd_party"].sum())

7536
2202


I could include the practices 'SSO' and 'Facebook_SSO' to additionally populate the 3rd party practices, since they are both 3rd party by default.  But I'm not going to use these for training the 3rd party classifier, so I don't need to include them, although it is a variable that I should try to change with training the classifier.

It means that the total 1st and 3rd party annotation columns will not quite add up to the total annotations we saw earlier.

In [34]:
annotation_segment_frequencies[['SSO', 'Facebook_SSO']]

SSO             274
Facebook_SSO    199
dtype: int64

Sanity check: the total 1st party, 3rd party, SSO and Facebook SSO annotations adds up to the total number of annotations seen earlier: 

In [35]:
annotation_segment_frequencies.sum() # Verify

10215

This looks correct.

### Review of further steps to complete

I am adding each specific annotation type to the dataframe (party/practice/modality). I have just added the 1st and 3rd party annotations to the dataframe.  Further steps:
- Add target columns for each practice to all segments
- Add target columns for each modality to all segments

# Add target columns for each practice to all segments

The method for this will be to:
- get a list of each practice to create a column for
- add the empty practice columns to the dataframe
Currently the dataframe has, for each practice I need, a column for 1st party and 3rd party. e.g. I need to populate the "Contact_E_Mail_Address" column using the sum of the values in the Contact_E_Mail_Address_1stParty and Contact_E_Mail_Address_3rdParty. So as a sub-step I will:
- make a dictionary where the key is the practice I want to populate and the values are the two 1st and 3rd party versions from which to sum
- use this to populate each column
- verify

In [36]:
practice_columns_df = segment_annotations.copy()

Get a list of each practice to create a column for:

In [37]:
# Get the practices generated above from features.yaml:
the_30_practices = [practice.removesuffix("_1stParty").removesuffix("_3rdParty") for practice in list_of_practices]

# remove duplicates:
the_30_practices = list(dict.fromkeys(the_30_practices))

Don't need to add "SSO" and "Facebook_SSO" because they were already added, so will remove them:

In [38]:
# in python, a good way to remove items from a list is to use a list comprehension
the_28_practices = [practice for practice in the_30_practices if practice not in ["SSO", "Facebook_SSO"] ]

Add the empty practice columns to the dataframe:

In [39]:
# combining the current columns with the new list of columns:
practice_columns_df = practice_columns_df.reindex(
    columns = practice_columns_df.columns.tolist() + the_28_practices
)

Make a dictionary where the key is the practice I want to populate and the values are the two 1st and 3rd party versions from which to sum.

In [40]:
practice_pair_dict = dict.fromkeys(the_30_practices)

for practice in practice_pair_dict.keys():
    practice_pair_dict[practice] = [sub_practice for sub_practice in list_of_practices
                                    if sub_practice.removesuffix("_1stParty").removesuffix("_3rdParty") == practice]

Populate each practice column:

This takes around 4 minutes.

In [41]:
%%time

# looping through each column name of interest and each row
for practice_column in the_28_practices: # only 28 because SSO and Facebook_SSO were populated earlier
    for index in range(len(practice_columns_df.index)):
        
        # assign the value by reference to the above dictionary
        practice_columns_df.at[index, practice_column] = practice_columns_df.loc[index, practice_pair_dict[practice_column] ].sum()
    
    # help see how long is remaining
    print(f"Finished processing {practice_column}", end="\r")

CPU times: user 4min 16s, sys: 2.1 s, total: 4min 18s
Wall time: 4min 24s


Verification:

Checking a single column:

In [42]:
practice_columns_df["Location_Cell_Tower"].sum() # Verify

166.0

This looks reasonable. Now to check all annotations.

As we saw earlier, the total annotations are 10215, so if the sum of the newly added annotations matches, this proves that the correct amount has been added.

In [43]:
# filtering and summing twice to get the total columns and rows of interest
_30_practices_filter = practice_columns_df[ the_30_practices ] > 0
practice_columns_df[the_30_practices][(_30_practices_filter)].sum().sum() # Verify

10215.0

It matches! Success!

### Export practice_columns_df

As before, to make it faster to load this dataframe in this notebook and others, I will save it as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [44]:
practice_columns_df.to_pickle('objects/practice_columns_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [45]:
confirm_save_3 = pd.read_pickle('objects/practice_columns_df.pkl')
print(practice_columns_df.shape == confirm_save_3.shape)
print(confirm_save_3.equals(practice_columns_df))

True
True


## Add target columns for each Modality to all segments

Now to add the modality (‘PERFORMED’ or ‘NOT PERFORMED’) annotations to the dataframe.

In [46]:
# Can be used to read in the dataframe without running the above code
# practice_columns_df = pd.read_pickle('objects/practice_columns_df.pkl')

In [47]:
modality_columns_df = practice_columns_df.copy()

Instantiate empty modality columns to be populated:

In [48]:
modality_columns_df["PERFORMED"] = 0
modality_columns_df["NOT_PERFORMED"] = 0

Populate the modality columns with the annotations.  This uses a similar process to when initially populating the segments because the modality is stored at the same level in the YAML structure.

In [49]:
for index in range(len(modality_columns_df)):
    practices_dictionaries = modality_columns_df.at[index, 'annotations']
    for each_practice in practices_dictionaries:
        modality_columns_df.at[index, each_practice['modality']] += 1

Verification:

In [50]:
modality_columns_df["PERFORMED"].sum()

8205

In [51]:
modality_columns_df["NOT_PERFORMED"].sum()

2010

Once again, these add up to the total annotations seen earlier of 10215, which is a helpful verification.

We also see that there are not as many annotations stating that the practice is not performed. The implication is that  Privacy policies tend to only state what they are doing, and rarely add extra clarity by stating what they are not doing.  Remember that most of these NOT_PERFORMED annotations were added synthetically so the true number would be much less.

To review, the columns now are:
- columns about the policy e.g. whether it is train/validate/test; 
- each segment; 
- columns for concatenated annotations (of the form practice_party), 
- columns for each element of annotation:
    - party (two columns: either 1st party or 3rd party)
    - practice (30 columns)
    - modality (two columns: either Performed or Not Performed)

In [52]:
# all columns can be seen using:
# modality_columns_df.columns

Since we now have a dataframe at the segment level with columns for each combination of annotations and columns for the specific annotations of practice, party and modality, I will name this `segment_all_targets_df`.

In [53]:
segment_all_targets_df = modality_columns_df.copy()

**Next:**

I will now remove some columns to create a new dataframe where the only target columns correspond to the specific annotations for a specific practice, the party and modality.  I will also change the target columns to be binary instead of a sum.


---

# Tidying

## Remove columns

In [54]:
segment_annots_df = segment_all_targets_df.copy()

The `list_of_practices` has all 58 concatenated annotations (the *practice* concatenated with whether it's *1st or 3rd party*), but two in this list ("SSO" and "Facebook_SSO") are 3rd party by default. We need to remove all the concatenated annotations from the current dataframe, except for those two, since they are already in the correct form.

In [55]:
the_56_specific_practices = [practice for practice in list_of_practices if practice not in ["SSO", "Facebook_SSO"] ]
segment_annots_df = segment_annots_df.drop(columns = the_56_specific_practices)

Verify columns have been dropped and we now have the correct number:

In [56]:
segment_annots_df.shape

(15543, 41)

In [57]:
segment_annots_df.loc[:,'SSO':].shape # target columns are all those that happen to be including and after "SSO"

(15543, 34)

There are 34 different 'target' columns ready.

In [58]:
# columns can be further inspected using:
# segment_annots_df.columns

---

## Convert to binary

Previously the columns were populated with the frequency of each annotation, which was helpful for EDA to see insights. But for modelling each column must be binary, otherwise the data does not fairly represent binary classification (target classes of 0 or 1)

I will be reducing the numbers above 1 down to one.  One verification step is to check that the number of cells with and above zero do not change, so first I will note these numbers:

In [59]:
# number of cells greater than 0 should not change
print(f"There are {(segment_annots_df.loc[:,'SSO':] == 0).sum().sum()} cells with 0")
print(f"There are {(segment_annots_df.loc[:,'SSO':] > 0).sum().sum()} cells above 0")

There are 510617 cells with 0
There are 17845 cells above 0


Reducing the numbers above 1 down to one:

In [60]:
%%time
# For each column from SSO onwards, go into every cell, and if it's above 1, set it to 1.

for column in segment_annots_df.loc[:,'SSO':].columns:
    for i in range(len(segment_annots_df[column])):

        if segment_annots_df.at[i, column] > 1:
            segment_annots_df.at[i, column] = 1

CPU times: user 2.08 s, sys: 15.3 ms, total: 2.09 s
Wall time: 2.11 s


Verification: 

The number of cells greater than 0 should be same as above

In [61]:
print(f"There are {(segment_annots_df.loc[:,'SSO':] == 0).sum().sum()} cells with 0")
print(f"There are {(segment_annots_df.loc[:,'SSO':] > 0).sum().sum()} cells above 0")

There are 510617 cells with 0
There are 17845 cells above 0


Should now be no target cells with more than 1:

In [62]:
segment_annots_df.loc[:,'SSO':].max().max() # shows the max value from all target columns

1.0

### Save to pkl

As before, to make it faster to load this dataframe in this notebook and others, I will save this dataframe as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [63]:
segment_annots_df.to_pickle('objects/segment_annots_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [64]:
confirm_save_4 = pd.read_pickle('objects/segment_annots_df.pkl')
print(segment_annots_df.shape == confirm_save_4.shape)
print(confirm_save_4.equals(segment_annots_df))

True
True


This above dataframe now has the granularity and target columns required to run a baseline model using only text vectorization, but in a later notebook I will conduct the same preprocessing steps as done by Usable Privacy.org in the paper to create additional dataframes to use for modelling.

In [3]:
finding_location_sentence = pd.read_pickle('objects/segment_annots_df.pkl')

In [4]:
finding_location_sentence.columns

Index(['source_policy_number', 'policy_type', 'contains_synthetic',
       'policy_segment_id', 'segment_text', 'annotations', 'sentences', 'SSO',
       'Facebook_SSO', '1st_party', '3rd_party', 'Contact',
       'Contact_Address_Book', 'Contact_City', 'Contact_E_Mail_Address',
       'Contact_Password', 'Contact_Phone_Number', 'Contact_Postal_Address',
       'Contact_ZIP', 'Demographic', 'Demographic_Age', 'Demographic_Gender',
       'Identifier', 'Identifier_Ad_ID', 'Identifier_Cookie_or_similar_Tech',
       'Identifier_Device_ID', 'Identifier_IMEI', 'Identifier_IMSI',
       'Identifier_IP_Address', 'Identifier_MAC', 'Identifier_Mobile_Carrier',
       'Identifier_SIM_Serial', 'Identifier_SSID_BSSID', 'Location',
       'Location_Bluetooth', 'Location_Cell_Tower', 'Location_GPS',
       'Location_IP_Address', 'Location_WiFi', 'PERFORMED', 'NOT_PERFORMED'],
      dtype='object')

In [30]:
some_df = finding_location_sentence[(finding_location_sentence['Location_GPS'] == 1) & (finding_location_sentence['contains_synthetic'] == False)]

In [48]:
some_df[['segment_text','Location_GPS', 'contains_synthetic']].sample(10)

Unnamed: 0,segment_text,Location_GPS,contains_synthetic
7211,Geolocation. We can collect your unique user i...,1.0,False
4484,"Read More We may sponsor contests, challenges,...",1.0,False
6494,Precise Device Location Tracking.,1.0,False
115,Some of BANDAI NAMCO's mobile applications may...,1.0,False
8409,(4)Location Data. When you have enabled geogra...,1.0,False
202,g) Location information: When you use BlackBer...,1.0,False
6495,If you authorized us and/or our service provid...,1.0,False
7167,Location information - we collect information ...,1.0,False
7584,"We do not ask you for, access or track any pre...",1.0,False
8170,"Notwithstanding anything else in this policy, ...",1.0,False


In [50]:
some_df[['segment_text','Location_GPS']].loc[8409,'segment_text']

'(4)Location Data. When you have enabled geographical location-based or GPS services on your device in relation to Our Services, we will collect information about your geographical location or GPS position based on the location of the device you are using to access our Services. That information helps us identify your physical location. Please note that we will not store or transfer your GPS information nor do we use such information to specify you.'

# Load crafted features for each target

To help to create accurate classifiers, columns will be added to the dataframe that contain key phrases that may be found in segment that has been annotated with a specific annotation. For example, the phrases 'phone book', 'phonebook' or 'address book' could be found in segments that have been annotated with the *Contact_Address_Book* annotation and adding these phrases as columns could help a classifier to correctly identify *Contact_Address_Book*.

Story et al. created these features based on their expertise and findings across the train and validation policies and have made them available along with the data.

These will be used as 'crafted feature' columns and for 'sentence filtering' when modelling.  (Along with adding the crafted feature columns, 'Sentence filtering' is another preprocessing step that Story et al. found can improve the performance of some classifiers and will be explained in my Modelling Pipeline notebook.)

The form that I want all the features in is a dataframe with a column with each target and another with a list of the crafted features related to it.  The format it is initially in is a YAML file with the crafted features being nested under each target.

This first function loads a list of each crafted feature for all of the *practices*.

In [65]:
# if running this section of the notebook by itself, you will need to also run these lines of code from above:

# list_of_practice_groups = priv_pol_funcs.get_list_of_practice_groups()
# list_of_practices = [practice for practice_group in list_of_practice_groups for practice in practice_group]
# the_30_practices = [practice.removesuffix("_1stParty").removesuffix("_3rdParty") for practice in list_of_practices]
# the_30_practices = list(dict.fromkeys(the_30_practices))

Demonstrating the function by returning just the first two lists of crafted features:

In [66]:
priv_pol_funcs.get_features_for_practices()[2:4] 

29 different groups of features returned, containing 354 individual features.


[['city', 'hometown'],
 ['e-mail address',
  'email address',
  'e-mail and mailing address',
  'email and mailing address',
  'e-mail or mailing address',
  'email or mailing address']]

Ultimately I will add each of these individual features to the dataframe.

In [67]:
features_for_practices = priv_pol_funcs.get_features_for_practices()

29 different groups of features returned, containing 354 individual features.


Creating the dataframe from the list of practices.

The crafted features for 'SSO' are the same as for 'facebook_SSO' so I don't need 'SSO'.

In [68]:
the_29_practices = [practice for practice in the_30_practices if practice != "SSO"]

In [69]:
practice_and_created_features = pd.DataFrame(data = [the_29_practices, features_for_practices]).T
practice_and_created_features.columns = ["practice", "features"]

Manually adding the features for SSO to the end of the dataframe by manually looking up the index of Facebook SSO to use.

Author's note: I'm not sure why I had to add SSO separately, nor why I did it by manual hard coding instead of programatically.

In [70]:
practice_and_created_features.at[len(practice_and_created_features),"practice"] = "SSO"
practice_and_created_features.at[29,"features"] = practice_and_created_features.at[11,"features"]

Demonstrating the resulting dataframe containing each practice and its crafted features:

In [71]:
practice_and_created_features.head(3)

Unnamed: 0,practice,features
0,Contact,"[contact info, contact details, contact data, ..."
1,Contact_Address_Book,"[phone book, phonebook, contact information in..."
2,Contact_City,"[city, hometown]"


Now I need to grab the features for the other targets (parties and modalities) too, which are also stored in the 'features.yml' file. They are stored near the top of the YAML structure.

In [72]:
with open("APP_350_v1_1/features.yml", "r") as stream:
    try:
        features_yml = (json_normalize(yaml.safe_load(stream)))
    except yaml.YAMLError as exc:
        print(exc)

features_yml

Unnamed: 0,data_types,modalities.PERFORMED,modalities.NOT_PERFORMED,parties.FirstParty,parties.ThirdParty
0,"[{'practices': ['Contact_1stParty', 'Contact_3...","[consent, permission, opt, collect, access, g...","[not collect, no longer collect, not access, n...","[ we , you , us , our , the app, the software]","[partner, third part, third-part, service prov..."


Adding these targets to the dataframe.  I make them into a list of lists, then make this into a new dataframe, then concatenate it with the above.

In [73]:
add_to_df = [
['1st_party', features_yml.loc[0,"parties.FirstParty"] ],
['3rd_party', features_yml.loc[0,"parties.ThirdParty"] ],
['PERFORMED', features_yml.loc[0,"modalities.PERFORMED"] ],
['NOT_PERFORMED', features_yml.loc[0,"modalities.NOT_PERFORMED"] ]
]

In [74]:
annotation_features = pd.concat( [practice_and_created_features, 
                                  pd.DataFrame(add_to_df, columns=['practice', 'features'])],
           axis = 0,
           ignore_index = True)
annotation_features.columns = ['annotation', 'features']

Verifying that all the new targets have been added at the end of the dataframe:

In [75]:
annotation_features.tail(6)

Unnamed: 0,annotation,features
28,Location_WiFi,"[wifi signal, wifi access point, wifi location..."
29,SSO,"[login credentials from one of your accounts, ..."
30,1st_party,"[ we , you , us , our , the app, the software]"
31,3rd_party,"[partner, third part, third-part, service prov..."
32,PERFORMED,"[consent, permission, opt, collect, access, g..."
33,NOT_PERFORMED,"[not collect, no longer collect, not access, n..."


Saving the file for use in other notebooks:

In [76]:
annotation_features.to_pickle('objects/annotation_features.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [77]:
confirm_save_6 = pd.read_pickle('objects/annotation_features.pkl')
print(annotation_features.shape == confirm_save_6.shape)
print(confirm_save_6.equals(annotation_features))

True
True


So now I have a dataframe with all the different annotations and a list of their respective crafted features called "annotation_features".

# Conclusion

I now have a range of dataframes that I can use for EDA and modelling.

Those dataframes are listed here:

**Dataframes (DF):**<br>
“Initial DF – “all_segments_df” – contains policy metadata, segment text, and YAML-embedded annotations and sentences

DF 2 – “segment_annotations” – all the above plus columns for annotations of the form [data_practice, party] (58 additional columns)

DF 3 – “practice_columns_df” – as above with the addition of columns for annotations for each data practice (30 additional columns). Suitable for EDA (not modelling) since it counts number of occurrences of each practice in each segment.

DF 4 – "segment_annots_df" – As with the first data frame and with the 30 columns added to the above Dataframe.  Essentially it features each segment and the targets. Suitable for modelling because each practice is binary (present or not)

Another dataframe listing each Crafted Feature for each annotation – "annotation_features"

In the pre-processing notebook, the next dataframe to make will be DF 5 – "crafted_features_df" – Same as DF 4 but with Crafted Featured added. So it will contain each segment, all crafted features, and columns for targets of practice, parties and modality.