# OPP-115 Corpus (ACL 2016)
The OPP-115 Corpus (Online Privacy Policies, set of 115) is a collection of website privacy policies (i.e., in natural language) with annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law.

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

If you use this dataset as part of a publication, you must cite the following paper:

The creation and analysis of a website privacy policy corpus. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.

**Structure and contents of OPP-115 https://www.usableprivacy.org/static/files/swilson_acl_2016.pdf**

**Annotation Scheme**

The following policy annotation scheme is made to capture the data practices specified by privacy policies. The final annotation scheme consists of ten practice categories:

1. *First Party Collection/Use*: how and why a service provider collects user information
2. *Third Party Sharing/Collection*: how user information may be shared with or collected by third parties
3. *User Choice/Control*: choices and control options available to users.
4. *User Access, Edit, & Deletion*: if and how users may access, edit, or delete their information.
5. *Data Retention*: how long user information is stored.
6. *Data Security*: how user information is protected.
7. *Policy Change*: if and how users will be informed about changes to the privacy policy.
8. *Do Not Track*: if and how Do Not Track signals3for online tracking and advertising are honored.
9. *International & Specific Audiences*: practices that pertain only to a specific group of users (e.g., children, Europeans, or California residents).
10. *Other*: additional sub-labels for introductory or general text, contact information, and practices not covered by the other categories.

An individual data practice belongs to one of the ten categories above, and it is articulated by a category-specific set of attributes. For example, a User Choice/Control data practice is associated with four mandatory attributes (Choice Type, Choice Scope, Personal Information Type, Purpose) and one optional attribute (User Type). The annotation scheme defines a set of potential values for each attribute. To ground the data practice in the policy text, each attribute also may be associated with a text span in the privacy policy.

The set of mandatory and optional attributes reflects the potential level of specificity with which a data practice of a given category may be described. Optional attributes are less common, while mandatory attributes are necessary to represent a data practice. However, privacy policies are often vague or ambiguous on many of these attributes. Therefore, a valid value for each attribute is Unspecified, allowing annotators to express an absence of information.

**Related work/data exploration:**

* https://github.com/pmayostendorp/beforeiaccept/blob/master/scripts/OPP%20Data%20Exploration.ipynb

# Exploring data

**Imporing libraries:**

In [34]:
import pandas as pd
import numpy as np

In [35]:
pwd

'/Users/vildearntze/Desktop/CS299'

Change path to the output from the pwd command when OPP_115 folder is saved in the same folder as Data gandling.ipynb

In [36]:
path = '/Users/vildearntze/Desktop/CS299'

In [31]:
annotations = pd.read_csv(path + '/OPP-115/annotations/20_theatlantic.com.csv', delimiter = ',',header = None)


annotations["batch_id"] = annotations[1]
del annotations[1]
annotations["date"] = annotations[7]
del annotations[7]
annotations["policy_url"] = annotations[8]
del annotations[8]
annotations["category_name"] = annotations[5]
del annotations[5]
annotations["attributes_value_pairs"] = annotations[6]
del annotations[6]
annotations["segment_id"] = annotations[4]
del annotations[4]

In [32]:
annotations.head()

Unnamed: 0,0,2,3,batch_id,date,policy_url,category_name,attributes_value_pairs,segment_id
0,2840,84,3635,test_category_labeling_highlight,1/1/15,http://www.theatlantic.com/privacy-policy/,Other,"{""Other Type"": {""endIndexInSegment"": 762, ""sta...",0
1,3616,82,3635,test_category_labeling_highlight,1/1/15,http://www.theatlantic.com/privacy-policy/,Other,"{""Other Type"": {""endIndexInSegment"": 762, ""sta...",0
2,4069,88,3635,test_category_labeling_highlight,1/1/15,http://www.theatlantic.com/privacy-policy/,Other,"{""Other Type"": {""endIndexInSegment"": 762, ""sta...",0
3,2841,84,3635,test_category_labeling_highlight,1/1/15,http://www.theatlantic.com/privacy-policy/,Other,"{""Other Type"": {""endIndexInSegment"": 219, ""sta...",1
4,3617,82,3635,test_category_labeling_highlight,1/1/15,http://www.theatlantic.com/privacy-policy/,Other,"{""Other Type"": {""endIndexInSegment"": 219, ""sta...",1


### Description of the fields/columns in annotations

**Level of granularity:** Each row represents one annotation that belongs to the segment(by paragraph) with that *segment_id*.
* **batch_id:**
* **date:**
* **policy_url:**
* **category_name:** Noted as one of the ten listed categories under "Annotation Sceme". Describes what type of annotation is given for that segment.
* **attributes_value_pairs:** Describes more in detail why a specific annotation category is listed for that segment. 
 * **Category:**
   * **endIndexInSegment:**
   * **startIndexInSegment:**
   * **selectedText:**
   * **value:**
  
* **segment_id:** Identifies what segment the annotation belongs to.

In [21]:
display(annotations[6][0])
display(annotations[6][1])
display(annotations[6][2])

'{"Other Type": {"endIndexInSegment": 762, "startIndexInSegment": 100, "selectedText": "At the Atlantic Monthly Group, Inc. (\\"The Atlantic\\"), we want you to enjoy and benefit from our websites and online services secure in the knowledge that we have implemented fair information practices designed to protect your privacy. Our privacy policy is applicable to The Atlantic, and The Atlantics affiliates and subsidiaries whose websites, mobile applications and other online services are directly linked (the Sites). The privacy policy describes the kinds of information we may gather during your visit to these Sites, how we use your information, when we might disclose your personally identifiable information, and how you can manage your information.", "value": "Introductory/Generic"}}'

'{"Other Type": {"endIndexInSegment": 762, "startIndexInSegment": 100, "selectedText": "At the Atlantic Monthly Group, Inc. (\\"The Atlantic\\"), we want you to enjoy and benefit from our websites and online services secure in the knowledge that we have implemented fair information practices designed to protect your privacy. Our privacy policy is applicable to The Atlantic, and The Atlantics affiliates and subsidiaries whose websites, mobile applications and other online services are directly linked (the Sites). The privacy policy describes the kinds of information we may gather during your visit to these Sites, how we use your information, when we might disclose your personally identifiable information, and how you can manage your information.", "value": "Introductory/Generic"}}'

'{"Other Type": {"endIndexInSegment": 762, "startIndexInSegment": 100, "selectedText": "At the Atlantic Monthly Group, Inc. (\\"The Atlantic\\"), we want you to enjoy and benefit from our websites and online services secure in the knowledge that we have implemented fair information practices designed to protect your privacy. Our privacy policy is applicable to The Atlantic, and The Atlantics affiliates and subsidiaries whose websites, mobile applications and other online services are directly linked (the Sites). The privacy policy describes the kinds of information we may gather during your visit to these Sites, how we use your information, when we might disclose your personally identifiable information, and how you can manage your information.", "value": "Introductory/Generic"}}'