<a href="https://colab.research.google.com/github/bachvu98/Policy-NLP/blob/master/Transform%20to%20Binary%20Categories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we import the require dependencies

In [3]:
import pandas as pd
import numpy as np
import nltk

Read in the data from GitHub repository

In [52]:
annotations = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/annotations.csv')
sites = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/sites.csv')
segments = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/segments.csv')

Preview of **annotations** and **segments** table

In [54]:
annotations.head()

Unnamed: 0,Policy UID,annotation_id,batch_id,annotator_id,segment_id,category_name,attributes_value_pairs,date,policy_url
0,1017,20137,test_category_labeling_highlight_fordham_aaaaa,121,0,Other,"{""Other Type"": {""selectedText"": ""Sci-News.com ...",,http://www.sci-news.com/privacy-policy.html
1,1017,20324,test_category_labeling_highlight_fordham_aaaaa,121,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",,http://www.sci-news.com/privacy-policy.html
2,1017,20325,test_category_labeling_highlight_fordham_aaaaa,121,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",,http://www.sci-news.com/privacy-policy.html
3,1017,20326,test_category_labeling_highlight_fordham_aaaaa,121,2,Data Retention,"{""Personal Information Type"": {""selectedText"":...",,http://www.sci-news.com/privacy-policy.html
4,1017,20327,test_category_labeling_highlight_fordham_aaaaa,121,3,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""Not sele...",,http://www.sci-news.com/privacy-policy.html


In [64]:
segments.head()

Unnamed: 0,Policy UID,segment_id,segments
0,20,0,<strong> Privacy Policy </strong> <br> <br> <s...
1,20,1,This privacy policy does not apply to Sites ma...
2,20,2,"By visiting our Sites, you are accepting the p..."
3,20,3,<strong> What Information Is Collected? </stro...
4,20,4,<strong> Personally Identifiable Information <...


Merge annotations to corresponding segments

In [86]:
joined = pd.merge(annotations,segments,on=['Policy UID','segment_id'],how='outer')
joined['category_name'] = joined['category_name'].fillna(value='None')
joined = joined.drop(['batch_id','attributes_value_pairs','date'],axis=1)
#joined = seg_ind.merge(ann_ind)
print(joined.shape)
joined.head()

(23194, 7)


Unnamed: 0,Policy UID,annotation_id,annotator_id,segment_id,category_name,policy_url,segments
0,1017,20137,121,0,Other,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
1,1017,20589,117,0,Other,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
2,1017,20233,118,0,Other,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
3,1017,20234,118,0,Policy Change,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
4,1017,20324,121,1,First Party Collection/Use,http://www.sci-news.com/privacy-policy.html,Information that Sci-News.com May Collect Onli...


There are usually cases where a single segment belong to multiple categories.

In [67]:
print(joined.groupby(['Policy UID','segment_id']).agg(lambda x: x.nunique())['category_name'])

Policy UID  segment_id
20          0             1
            1             1
            2             2
            3             1
            4             2
                         ..
1713        84            3
            85            1
            86            1
            87            1
            88            1
Name: category_name, Length: 3792, dtype: int64


In this case, we select the category name that appears most often in each segment.

In [92]:
#Get the mode of each segment
mode_categories = joined.groupby(['Policy UID','segment_id']).agg(lambda x: x.value_counts().index[0])
mode_categories = mode_categories.reset_index()
mode_categories.head()

Unnamed: 0,Policy UID,segment_id,annotation_id,annotator_id,category_name,policy_url,segments
0,20,0,4069,88,Other,http://www.theatlantic.com/privacy-policy/,<strong> Privacy Policy </strong> <br> <br> <s...
1,20,1,4070,88,Other,http://www.theatlantic.com/privacy-policy/,This privacy policy does not apply to Sites ma...
2,20,2,2843,82,Policy Change,http://www.theatlantic.com/privacy-policy/,"By visiting our Sites, you are accepting the p..."
3,20,3,2847,84,First Party Collection/Use,http://www.theatlantic.com/privacy-policy/,<strong> What Information Is Collected? </stro...
4,20,4,4081,82,First Party Collection/Use,http://www.theatlantic.com/privacy-policy/,<strong> Personally Identifiable Information <...


In [130]:
categories = list(mode_categories['category_name'].unique())
print(categories)
cols = {'Other': 'other',
        'Policy Change': 'policy_change',
        'First Party Collection/Use': 'first_party_collection_use',
        'Third Party Sharing/Collection': 'third_party_sharing_collection',
        'Do Not Track': 'do_not_track',
        'User Choice/Control': 'user_choice_control',
        'International and Specific Audiences': 'international_specific_audiences',
        'Data Security': 'data_security',
        'Data Retention': 'data_retention',
        'User Access, Edit and Deletion': 'user_access_edit_deletion'}

['Other', 'Policy Change', 'First Party Collection/Use', 'Third Party Sharing/Collection', 'Do Not Track', 'User Choice/Control', 'International and Specific Audiences', 'Data Security', 'Data Retention', 'User Access, Edit and Deletion']


In [136]:
#Loop through the categories and generate a set of new columns with names in cols
binary_categories = pd.DataFrame({'Policy UID':mode_categories['Policy UID'], 'segment_id':mode_categories['segment_id']})

In [132]:
for category in categories:
    one_hot = lambda s: 1 if s.startswith(category) else 0
    binary_categories[cols[category]] = mode_categories['category_name'].apply(one_hot)

In [135]:
print(binary_categories.shape)

(3792, 2)


In [112]:
binary_categories.head()

Unnamed: 0,Policy UID,segment_id,other,policy_change,first_party_collection_use,third_party_sharing_collection,do_not_track,user_choice_control,international_specific_audiences,data_security,data_retention,user_access_edit_deletion
0,20,0,1,0,0,0,0,0,0,0,0,0
1,20,1,1,0,0,0,0,0,0,0,0,0
2,20,2,0,1,0,0,0,0,0,0,0,0
3,20,3,0,0,1,0,0,0,0,0,0,0
4,20,4,0,0,1,0,0,0,0,0,0,0


In [102]:
binary_categories.to_csv("/content/drive/My Drive/OPP-115/OPP-115/binary_segment_categories.csv")