# **Privacy Policy Report Card**

## **Milestone 2 Notebook**





---


#Problem Definition and Proposed Solution


---



We are used to consenting to a website’s or app’s privacy policy assuming that our personal data will be used responsibly. This is not always the case. Most privacy policies are either too long to read or too difficult to understand. This means that most of us will have no choice but to continue accepting terms that we don’t fully understand.

We propose training a model using NLP to recognize data usage and collection clauses within online privacy policies, and to present a user with a “report card” outlining what data the service collects about the user, and how that data is used. This should allow users to feel more confident in their decisions regarding online privacy. 




---


#The Data


---



###Labeled Training Data

For our training dataset, we chose the APP 350 Corpus a collection of 350 privacy policies whose contents have been annotated, paragraph by paragraph, by legal experts. The APP 350 used 60 annotations to code privacy policy contents. While the APP 350 is quite a clean dataset on its own, it will still require some work to get it into a structure that can be used for training.

In [None]:
!unzip Data/APP-350_v1.1.zip
!unzip Data/MAPS_Policies_Dataset_v1.0.zip

Archive:  Data/APP-350_v1.1.zip
replace APP-350_v1.1/documentation/annotator_agreement.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: Archive:  Data/MAPS_Policies_Dataset_v1.0.zip
replace MAPS Policies Dataset/LICENSE? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
import pandas as pd
import yaml
import os

######
#Reading the data
######


directory = os.path.normpath('APP-350_v1.1/annotations')

policy_list = os.listdir(directory)

raw_data = []

for file in policy_list:
    path = os.path.join(directory, file)
    with open (path) as f:
        record = yaml.safe_load(f)

    policy_id = record['policy_id']
    policy_name = record ['policy_name']
    contains_synthetic = record['contains_synthetic']

    for segment in record['segments']:
        segment.update({
            'policy_id': policy_id,
            'policy_name': policy_name,
            'contains_synthetic': contains_synthetic, })
        raw_data.append(segment)


with open(os.path.normpath('APP-350_v1.1/features.yml'))as f:
    features = yaml.safe_load(f)

tags = []
for i in features['data_types']:
    for p in i['practices']:
        tags.append(p)


APP_350 = pd.DataFrame(raw_data)

Because the APP 350 uses so many annotations, the individual occurrences of each is too small to meaningfully train a model. However, these 60 annotations fall into 6 broad categories:

1.   Web Identifier and Trackers
2.   Demographic Info
3.   Contact Info
4.   Location Data
5.   Single Sign On
6.   Sharing with 3rd parties

Aggregating annotations into these catagories allows the model to recognize general patterns more easily, and makes the final product more easily understandable for the user. 


In [None]:
####
#Making the labels
####

def parse_annotations(annotation, tag):
    """
    Function for parsing APP_350 annotations into binary response
    :param annotation: List of dicts containing 'practice' and 'modality' annotations
    :param tag: str. the tag being searched for
    :return: bool - does the annotation contain the given tag
    """
    practice_performed = False
    for n in annotation:
        if n['practice'] == tag and n['modality'] == 'PERFORMED':
            practice_performed = True

    return practice_performed



for tag in tags:
    col_name = 'y_' + tag
    APP_350[col_name] = APP_350['annotations'].apply(parse_annotations, args=[tag])


categories = ['3RD',
              'LOCATION',
              'DEMOGRAPHIC',
              'CONTACT',
              'IDENTIFIER',
              'SSO',
              ]

targets = [i for i in APP_350.columns if 'y_' in i]

for cat in categories:
    cols = [i for i in targets if cat in i.upper()]
    APP_350[cat] = APP_350[cols].any(axis = 1, bool_only=True)

rel_cols = ['policy_id','policy_name','segment_id', 'segment_text', *categories]

cleaned_data = APP_350[rel_cols]

save_path = os.path.normpath('Data/Labeled_Data.csv')
if not os.path.exists(save_path):
    cleaned_data.to_csv(save_path)


###Unlabeled Tuning Data

The limited availability of labeled training data makes it necessary to scrape unlabeled privacy policies from the web to train larger models. This data must be parsed and divided into paragraphs to make it resemble the training data that will be used for the final classification. 

For this demonstration, we have limited the number of URLs scraped, however the base dataset includes 400,000 unlabeled privacy policies.


In [None]:
from bs4 import BeautifulSoup
import pandas as pd


"""
1. take a url as input and make it into a bs4 soup object
"""
from urllib.request import urlopen

def make_soup(url):
    # print('in make_soup')
    html = urlopen(url).read()
    # print(html)
    return BeautifulSoup(html, "lxml")

"""
2. take a soup object and find all the links on the page
"""
def find_links(soup):
    links = []
    for link in soup.find_all('a'):
        links.append(link.get('href'))
    return links

"""
3. take a soup object and find all the text on the page
"""
def find_text(soup):
    text = soup.get_text()
    return text

"""
4. take a soup object and find all the images on the page
"""
def find_images(soup):
    images = []
    for image in soup.find_all('img'):
        images.append(image.get('src'))
    return images

"""
5. take a soup object and find all the tables on the page
"""
def find_tables(soup):
    tables = []
    for table in soup.find_all('table'):
        tables.append(table)
    return tables

"""
6. take a soup object and find all the forms on the page
"""
def find_forms(soup):
    forms = []
    for form in soup.find_all('form'):
        forms.append(form)
    return forms

"""
7. take a soup object and find all the headings on the page
"""
def find_headings(soup):
    headings = []
    for heading in soup.find_all(re.compile('^h[1-6]$')):
        headings.append(heading)
    return headings

"""
8. take a soup object and find all the paragraphs on the page
"""
def find_paragraphs(soup):
    paragraphs = []
    for paragraph in soup.find_all('p'):
        paragraphs.append(paragraph)
    return paragraphs

"""
9. take a soup object and find all the divs on the page
"""
def find_divs(soup):
    divs = []
    for div in soup.find_all('div'):
        divs.append(div)
    return divs

"""
10. take a soup object and find all the spans on the page
"""
def find_spans(soup):
    spans = []
    for span in soup.find_all('span'):
        spans.append(span)
    return spans

"""
11. take a soup object and find all the list items on the page
"""
def find_list_items(soup):
    list_items = []
    for list_item in soup.find_all('li'):
        list_items.append(list_item)
    return list_items

"""
12. take a soup object and find all the unordered list on the page
"""
def find_unordered_lists(soup):
    unordered_lists = []
    for unordered_list in soup.find_all('ul'):
        unordered_lists.append(unordered_list)
    return unordered_lists

"""
13. take a soup object and find all the ordered list on the page
"""
def find_ordered_lists(soup):
    ordered_lists = []
    for ordered_list in soup.find_all('ol'):
        ordered_lists.append(ordered_list)
    return ordered_lists

"""
14. take a soup object and find all the h1-h6 on the page
"""
import re
def find_headings(soup):
    headings = []
    for heading in soup.find_all(re.compile('^h[1-6]$')):
        headings.append(heading)
    return headings

"""
15. take a soup object and find all the paragraphs on the page
"""
def find_paragraphs(soup):
    paragraphs = []
    for paragraph in soup.find_all('p'):
        paragraphs.append(paragraph)
    return paragraphs

"""
16. Given a list of URLs, extract the paragraphs and return a pandas DataFrame
"""
def create_df(urls):
    print("in create_df")
    df = pd.DataFrame()
    for i, url in enumerate(urls):
        try:
            soup = make_soup(url)
            paragraphs = find_paragraphs(soup)
            paragraphs_list = [p.text for p in paragraphs]
            tuples_list = [(url, i, p) for i,p in enumerate(paragraphs_list)]
            for tup in tuples_list:
                df=df.append(pd.Series(tup, index= ['url','paragraph_index', 'paragraph_text']), ignore_index=True)

        except:
            print('error at index', i, url)
            pass

    return df


path_to_policy_urls_csv = os.path.normpath('MAPS Policies Dataset/april_2018_policies.csv')
num_policies_to_extract = 100
# load the privacy policies URL csv file
df_policies = pd.read_csv(path_to_policy_urls_csv)
# print(df_policies.head(5))

# select 10 urls for demo
df_policies = df_policies.iloc[:num_policies_to_extract]

# extract urls of privacy policies
# TODO: Needs to be modified to include all urls in the dataset. Currently just looks at the first "Final URL" of each row.
urls = [item.split("'Final URL': '")[1].split("'}")[0] for item in df_policies['Policy Sources']]
# print(urls)
# Create a dataframe with the policy texts
df = create_df(urls)
print(df.head())
print(f"DataFrame consists of {df.shape[0]} paragraphs from {num_policies_to_extract} policies.")

# Save dataframe as csv file
df.to_csv('policy_texts.csv')

in create_df
error at index 3 http://www.kidzooly.com/privacy.html
error at index 22 http://www.kidzooly.com/privacy.html
error at index 25 http://www.kidloland.com/privacypolicy.php
error at index 26 http://www.kidzooly.com/privacy.html#privacy
error at index 28 http://corporate.mattel.com/privacy-statement-shared.aspx
error at index 29 http://corporate.mattel.com/privacy-statement-shared.aspx
error at index 32 http://corporate.mattel.com/privacy-statement-shared.aspx
error at index 33 http://corporate.mattel.com/privacy-statement-shared.aspx
error at index 36 http://corporate.mattel.com/privacy-statement-shared.aspx
error at index 37 http://ellenwhite.org/content/article/egw-writings-privacy-policy
error at index 38 http://ellenwhite.org/content/article/egw-writings-privacy-policy
error at index 40 http://ellenwhite.org/content/article/egw-writings-privacy-policy?numFound=2&collection=true&query=privacy+policy&curr=0&sqid=591334551
error at index 41 https://www.bible.com/privacy
erro

###EDA


In [None]:
######
#Labeled Data
#######

cleaned_data.sample(5)

Unnamed: 0,policy_id,policy_name,segment_id,segment_text,3RD,LOCATION,DEMOGRAPHIC,CONTACT,IDENTIFIER,SSO
4305,243,io.utk.android,2,Passwords The UTK.io staff will NEVER ask for ...,False,False,False,False,False,False
4739,17,com.atomicadd.fotos,11,Location information When you use AtomicAdd se...,False,True,False,False,True,False
7824,98,Xender,7,(d) We may collect and use such data for promo...,False,False,False,False,False,False
6871,336,Viber,39,Here are a few additional important things you...,False,False,False,False,True,False
12536,164,com.eharmony,10,"Purchase Information. To process purchases, we...",False,False,False,True,False,False


In [None]:
cleaned_data.describe()

Unnamed: 0,policy_id,segment_id
count,15507.0,15507.0
mean,174.877862,48.647127
std,102.593012,75.95288
min,1.0,0.0
25%,82.0,11.0
50%,171.0,26.0
75%,263.0,52.0
max,350.0,607.0


In [None]:
######
#Unlabeled Data
#######
df.head()

Unnamed: 0,paragraph_index,paragraph_text,url
0,0.0,Privacy Policy & Term of Use (Last updated Se...,https://www.eznetsoft.com/index.php/about-us/p...
1,1.0,Samuel J or Eznetsoft is committed to protecti...,https://www.eznetsoft.com/index.php/about-us/p...
2,2.0,"In general, you can visit us on the Web withou...",https://www.eznetsoft.com/index.php/about-us/p...
3,3.0,"Samuel J or Eznetsoft will not sell, rent or d...",https://www.eznetsoft.com/index.php/about-us/p...
4,4.0,"Information that we gather and track, in accor...",https://www.eznetsoft.com/index.php/about-us/p...


###Prepping the data for modeling


In [None]:
import pandas as pd
import os
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
import re
import string

data = cleaned_data
x = data.segment_text
y = data[[
    'IDENTIFIER',
    '3RD',
    'LOCATION',
    'DEMOGRAPHIC',
    'CONTACT',
    'SSO']]


y = y.astype(int)

x_train, x_test, y_train, y_test = train_test_split(x,y)



BATCH_SIZE = 32
TRAIN_SHUFFLE_BUFFER_SIZE = len(x_train)
VALIDATION_SHUFFLE_BUFFER_SIZE = len(x_test)
AUTOTUNE = tf.data.experimental.AUTOTUNE

def standardize_text(input_string):
  output_string = tf.strings.lower(input_string)
  output_string = tf.strings.regex_replace(output_string, "<br />", " ")
  output_string = tf.strings.regex_replace(output_string, "[%s]" % re.escape(string.punctuation), "")
  return output_string

text_vectorizer = keras.layers.TextVectorization(
    standardize=standardize_text,
    max_tokens=25000,
    output_mode="int",
    output_sequence_length=1500,
)

text_data = tf.data.Dataset.from_tensor_slices(x.values)
text_vectorizer.adapt(text_data.batch(64))

x_train = text_vectorizer.apply(x_train)
x_test = text_vectorizer.apply(x_test)





---


#Models



---



Our project's central task is text classification. However, it does have one unique challenge; because our labels are independent, any given text can belong to multiple “classes”. For example, a given paragraph may contain clauses that allow for the collection of both location data and demographic information. 

To solve this challenge, we intend to implement an ensemble model consisting of binary response outputs to allow for this kind of multi-classification. Our current approach is to train separate binary-response models, which we are evaluating independently. 

Because of the computational expense of building and training SOTA NLP models, we have chosen to prototype using fairly simple, home-grown models for iteration and testing. The best method we have found for training these is demonstrated below. 


In [None]:
import os
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from sklearn.metrics import confusion_matrix

#import matplotlib.pyplot as plt
y_label = "IDENTIFIER"


y_trainM = y_train[y_label]
y_testM  = y_test[y_label]


####
#Prototype/Proof of Concept Model
####

def binary_cnn_with_embeddings(sequence_length, vocab_size, embedding_dim,
                              model_name='cnn_with_embeddings'):
    model_input = keras.layers.Input(shape=(sequence_length))

    hidden = keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, name="embedding")(model_input)

    hidden = keras.layers.Conv1D(filters=256, kernel_size=5, padding="valid", activation="relu", strides=3)(hidden)
    hidden = keras.layers.GlobalMaxPooling1D()(hidden)

    hidden = keras.layers.Dense(units=128, activation="tanh")(hidden)
    hidden = keras.layers.Dense(units=64, activation="tanh")(hidden)
    hidden = keras.layers.Dense(units=32, activation="tanh")(hidden)

    output = keras.layers.Dense(units=1, activation='sigmoid')(hidden)

    # Create model
    model = Model(inputs=model_input, outputs=output, name=model_name)

    return model

learning_rate = 0.01
epochs = 10
embedding_dim = 100
optimizer = keras.optimizers.Adam(learning_rate=learning_rate)
loss = keras.losses.BinaryCrossentropy()

model = binary_cnn_with_embeddings(1500,25000, embedding_dim)
model.compile(optimizer = optimizer, loss = loss, metrics = 'binary_accuracy')
model.summary()
model.fit(x = x_train, y = y_trainM.values, validation_data = (x_test, y_testM), epochs = 3, class_weight = {0:.1, 1:.9})

Model: "cnn_with_embeddings"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 1500)]            0         
_________________________________________________________________
embedding (Embedding)        (None, 1500, 100)         2500000   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 499, 256)          128256    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_9 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_10 (Dense)             (None, 32)        

<keras.callbacks.History at 0x7f752d5de6d0>

In [None]:
confusion_matrix(y_testM, model.predict(x_test)>.5)

array([[3206,  226],
       [  47,  398]])

Even our simple prototype models are yeilding promising results with the training data, and we believe that our SOTA models, fine tuned with unlabeled training data, will be able to produce even better results.

Our experiments in fine-tuning some of these models is outlined in our second notebook. 