Dataset and related content taken from Natural Language Processing for Mobile App Privacy Compliance. Peter Story, Sebastian Zimmeck, Abhilasha Ravichander, Daniel Smullen, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. AAAI Spring Symposium on Privacy Enhancing AI and Language Technologies (PAL 2019), Mar 2019.

Available here – APP350: https://usableprivacy.org/data

EDA should be in the context of my data.  I should state what I expect BEFORE checking for it in the data.

Also at the top I can include links, context, goals

In [1]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm
import statsmodels.api as sm

import sys
from collections import defaultdict
from collections import Counter

import ds_utils_callum
import priv_policy_manipulation_functions as priv_pol_funcs

Put the first policy into a df

## Populating top-level df

In [2]:
all_policies_df = priv_pol_funcs.load_all_policies()
all_policies_df.head(5)

Unnamed: 0,policy_id,policy_name,policy_type,contains_synthetic,segments
0,1,6677G,TEST,False,"[{'segment_id': 0, 'segment_text': 'PRIVACY PO..."
1,2,AIFactory,TEST,False,"[{'segment_id': 0, 'segment_text': 'AI Factory..."
2,3,AppliqatoSoftware,TEST,False,"[{'segment_id': 0, 'segment_text': 'Automatic ..."
3,4,BandaiNamco,TEST,False,"[{'segment_id': 0, 'segment_text': 'MOBILE APP..."
4,5,BarcodeScanner,TEST,False,"[{'segment_id': 0, 'segment_text': 'Skip to co..."


In [3]:
all_policies_df = priv_pol_funcs.add_metadata_to_policy_df(all_policies_df)
all_policies_df.head(5)

Unnamed: 0,policy_id,policy_name,policy_type,contains_synthetic,segments,num_segments,num_annotated_segs,total_characters
0,1,6677G,TEST,False,"[{'segment_id': 0, 'segment_text': 'PRIVACY PO...",36,11,12703
1,2,AIFactory,TEST,False,"[{'segment_id': 0, 'segment_text': 'AI Factory...",14,5,5995
2,3,AppliqatoSoftware,TEST,False,"[{'segment_id': 0, 'segment_text': 'Automatic ...",8,1,2450
3,4,BandaiNamco,TEST,False,"[{'segment_id': 0, 'segment_text': 'MOBILE APP...",57,14,32323
4,5,BarcodeScanner,TEST,False,"[{'segment_id': 0, 'segment_text': 'Skip to co...",32,3,6667


# Making segments

Now to make a new dataframe where each row represents a paragraph (segment).

First I will get this to work for a single policy. Then I will loop through all the policies to apply the same manipulation.

In [4]:
all_segments_df = priv_pol_funcs.generate_segment_df(all_policies_df)
all_segments_df.head()

Unnamed: 0_level_0,source_policy_number,policy_segment_id,segment_text,annotations,sentences
segment_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,0,PRIVACY POLICY This privacy policy (hereafter ...,[],[]
1,1,1,1. ABOUT OUR PRODUCTS 1.1 Our products offer a...,[],[]
2,1,2,2. THE INFORMATION WE COLLECT The information ...,[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': 'IP ADDRESS, COOKIES, AND W..."
3,1,3,"2.2 In addition, we store certain information ...",[{'practice': 'Identifier_Cookie_or_similar_Te...,"[{'sentence_text': '6677g may use cookies, web..."
4,1,4,(c) to remember your preferences and registrat...,[],[]


In [5]:
all_segments_df.shape

(15543, 5)

In [6]:
all_segments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15543 entries, 0 to 15542
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   source_policy_number  15543 non-null  int64 
 1   policy_segment_id     15543 non-null  int64 
 2   segment_text          15543 non-null  object
 3   annotations           15543 non-null  object
 4   sentences             15543 non-null  object
dtypes: int64(2), object(3)
memory usage: 607.3+ KB


# Next step of extraction

I probably want to separate it out to the sentence level, as that is the max granularity of the annotations and some paragraphs are just one sentence anyway.

In [7]:
all_segments_df.loc[2,:]

source_policy_number                                                    1
policy_segment_id                                                       2
segment_text            2. THE INFORMATION WE COLLECT The information ...
annotations             [{'practice': 'Identifier_Cookie_or_similar_Te...
sentences               [{'sentence_text': 'IP ADDRESS, COOKIES, AND W...
Name: 2, dtype: object

In [8]:
all_segments_df.loc[2, 'segment_text']

"2. THE INFORMATION WE COLLECT The information that our products collect includes (among others) the following: A) IP ADDRESS, COOKIES, AND WEB BEACONS 2.1 When you visit our products, our servers automatically save your computer's IP address. IP addresses will be collected, along with information about the actual web pages that you visit on our products. If you arrive at our products via a link from another product, the URL of the linking product and the URL of any product that you link to next will also be collected."

In [9]:
all_segments_df.loc[2, 'annotations']

[{'practice': 'Identifier_Cookie_or_similar_Tech_1stParty',
  'modality': 'PERFORMED'},
 {'practice': 'Identifier_IP_Address_1stParty', 'modality': 'PERFORMED'}]

In [10]:
all_segments_df.loc[2, 'sentences']

[{'sentence_text': 'IP ADDRESS, COOKIES, AND WEB BEACONS',
  'annotations': [{'practice': 'Identifier_Cookie_or_similar_Tech_1stParty',
    'modality': 'PERFORMED'},
   {'practice': 'Identifier_IP_Address_1stParty', 'modality': 'PERFORMED'}]},
 {'sentence_text': 'IP addresses will be collected, along with information about the actual web pages that you visit on our products.',
  'annotations': [{'practice': 'Identifier_IP_Address_1stParty',
    'modality': 'PERFORMED'}]},
 {'sentence_text': 'The information that our products collect includes (among others) the following:',
  'annotations': [{'practice': 'Identifier_Cookie_or_similar_Tech_1stParty',
    'modality': 'PERFORMED'},
   {'practice': 'Identifier_IP_Address_1stParty', 'modality': 'PERFORMED'}]},
 {'sentence_text': "When you visit our products, our servers automatically save your computer's IP address.",
  'annotations': [{'practice': 'Identifier_IP_Address_1stParty',
    'modality': 'PERFORMED'}]}]

In [11]:
all_segments_df.loc[3, 'segment_text']

'2.2 In addition, we store certain information from your browser, using "cookies." A cookie is a piece of data stored on the user\'s computer and is tied to information about the user. 6677g may use cookies, web beacons (web bugs), or similar technologies to enhance and personalize your experience on our products, including the following: (a) to operate and improve offerings on our products; (b) to help authenticate you when you are on our products;'

In [12]:
all_segments_df.loc[4, 'segment_text']

'(c) to remember your preferences and registration information, as applicable;'

In [13]:
all_segments_df.loc[5, 'segment_text']

'(d) to present and help measure and research the effectiveness of 6677g offerings, advertisements, and email communications; and'

In [14]:
all_segments_df.loc[6, 'segment_text']

'(e) to customize the content and advertisements provided to you through our products.'

In [15]:
all_segments_df.loc[7, 'segment_text']

'2.3 6677g may also use ad network providers to help present advertisements on our products. These ad network providers use cookies, web beacons, or similar technologies to help the presenting, better targeting, and measuring of the effectiveness of their advertisements, using data gathered over time and across their networks of web pages to determine or predict the characteristics and preferences of their audiences. 6677g offers some services in connection with other products. Personal information that you provide to those sites may be sent to 6677g in order to deliver these services. 6677g processes such information in accordance with this Privacy Policy.'

I probably want to separate it out to the sentence level, as that is the max granularity of the annotations and some paragraphs are just one sentence anyway.

## Extracting list of practices

Employing my function to get the list of practices from the APP documentation

In [21]:
list_of_practice_groups = priv_pol_funcs.get_list_of_practice_groups()

29 different groups of practices returned, containing 58 individual practices.


In [27]:
# Expand the list of list with a list comprehension
list_of_practices = [practice for practice_group in list_of_practice_groups for practice in practice_group]
print(f"There are {len(list_of_practices)} different practices.")

There are 58 different practices.


In [23]:
list_of_practices[:5]

['Contact_1stParty',
 'Contact_3rdParty',
 'Contact_Address_Book_1stParty',
 'Contact_Address_Book_3rdParty',
 'Contact_City_1stParty']

## Applying all annotations to columns in segment dataframe

In [115]:
segment_annotations = all_segments_df.copy()

annotation_df = pd.DataFrame( data = 0, columns = list_of_practices, index = range(len(all_segments_df)) ) 
# make the list of annotations into columns

segment_annotations = pd.concat([segment_annotations, annotation_df], axis=1) 
# put the columns onto the segment dataframe

print(f"The shape of the segment_annotations dataframe is {segment_annotations.shape}") # Verify

The shape of the segment_annotations dataframe is (15543, 63)


In [50]:
# segment_annotations.head(2) # Verify columns added

In [117]:
for index in range(len(segment_annotations)):
    practices_dictionaries = segment_annotations.loc[index, 'annotations']
    for each_practice in practices_dictionaries:
        segment_annotations.loc[index, each_practice['practice']] += 1

In [113]:
# Verify final row annotated
segment_annotations.iloc[15542,5:].sum()

4

In [123]:
# Any paragraphs with more than one of the same annotation?
segment_annotations.iloc[:,5:].max() # axis? Is this a pandas function?

Contact_1stParty                              1
Contact_3rdParty                              1
Contact_Address_Book_1stParty                 1
Contact_Address_Book_3rdParty                 1
Contact_City_1stParty                         1
Contact_City_3rdParty                         1
Contact_E_Mail_Address_1stParty               2
Contact_E_Mail_Address_3rdParty               1
Contact_Password_1stParty                     1
Contact_Password_3rdParty                     1
Contact_Phone_Number_1stParty                 1
Contact_Phone_Number_3rdParty                 1
Contact_Postal_Address_1stParty               1
Contact_Postal_Address_3rdParty               1
Contact_ZIP_1stParty                          1
Contact_ZIP_3rdParty                          1
Demographic_1stParty                          1
Demographic_3rdParty                          1
Demographic_Age_1stParty                      1
Demographic_Age_3rdParty                      1
Demographic_Gender_1stParty             

Two rows to investigate.

In [124]:
# What does this mean?
max(segment_annotations.iloc[:,5:])

KeyError: 0

In [102]:
segment_annotations.iloc[:,5:].sum() # Number of paragraphs with each annotation

Contact_1stParty                               211
Contact_3rdParty                                36
Contact_Address_Book_1stParty                  219
Contact_Address_Book_3rdParty                   14
Contact_City_1stParty                           65
Contact_City_3rdParty                            8
Contact_E_Mail_Address_1stParty               1111
Contact_E_Mail_Address_3rdParty                143
Contact_Password_1stParty                      224
Contact_Password_3rdParty                       18
Contact_Phone_Number_1stParty                  565
Contact_Phone_Number_3rdParty                   69
Contact_Postal_Address_1stParty                364
Contact_Postal_Address_3rdParty                 62
Contact_ZIP_1stParty                            93
Contact_ZIP_3rdParty                            24
Demographic_1stParty                           146
Demographic_3rdParty                            73
Demographic_Age_1stParty                       260
Demographic_Age_3rdParty       

In [101]:
segment_annotations.iloc[:,5:].sum().sum() # Total paragraph annotations

10215

Need to investigate Identifier_SSID_BSSID_3rdParty because only has 2. 
Would also like to check Contact_City_3rdParty (has 8)

## Loop through the segments to encode the annotations

## Demo EDA