<a href="https://colab.research.google.com/github/bachvu98/Policy-NLP/blob/master/Preprocessing_Policy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we import the require dependencies

In [26]:
import pandas as pd
import numpy as np
import string
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Read in the data from GitHub repository

In [12]:
annotations = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/annotations.csv')
sites = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/sites.csv')
segments = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/segments.csv')

Preview of **annotations** and **segments** table

In [13]:
annotations.head()

Unnamed: 0,Policy UID,annotation_id,batch_id,annotator_id,segment_id,category_name,attributes_value_pairs,date,policy_url
0,1017,20137,test_category_labeling_highlight_fordham_aaaaa,121,0,Other,"{""Other Type"": {""selectedText"": ""Sci-News.com ...",,http://www.sci-news.com/privacy-policy.html
1,1017,20324,test_category_labeling_highlight_fordham_aaaaa,121,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",,http://www.sci-news.com/privacy-policy.html
2,1017,20325,test_category_labeling_highlight_fordham_aaaaa,121,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",,http://www.sci-news.com/privacy-policy.html
3,1017,20326,test_category_labeling_highlight_fordham_aaaaa,121,2,Data Retention,"{""Personal Information Type"": {""selectedText"":...",,http://www.sci-news.com/privacy-policy.html
4,1017,20327,test_category_labeling_highlight_fordham_aaaaa,121,3,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""Not sele...",,http://www.sci-news.com/privacy-policy.html


In [14]:
segments.head()

Unnamed: 0,Policy UID,segment_id,segments
0,20,0,<strong> Privacy Policy </strong> <br> <br> <s...
1,20,1,This privacy policy does not apply to Sites ma...
2,20,2,"By visiting our Sites, you are accepting the p..."
3,20,3,<strong> What Information Is Collected? </stro...
4,20,4,<strong> Personally Identifiable Information <...


Merge annotations to corresponding segments

In [15]:
joined = pd.merge(annotations,segments,on=['Policy UID','segment_id'],how='outer')
joined['category_name'] = joined['category_name'].fillna(value='None')
joined = joined.drop(['batch_id','attributes_value_pairs','date','annotation_id','annotator_id','policy_url'],axis=1)
#joined = seg_ind.merge(ann_ind)
print(joined.shape)
joined.head()

(23194, 4)


Unnamed: 0,Policy UID,segment_id,category_name,segments
0,1017,0,Other,Privacy Policy <br> <br> Sci-News.com is commi...
1,1017,0,Other,Privacy Policy <br> <br> Sci-News.com is commi...
2,1017,0,Other,Privacy Policy <br> <br> Sci-News.com is commi...
3,1017,0,Policy Change,Privacy Policy <br> <br> Sci-News.com is commi...
4,1017,1,First Party Collection/Use,Information that Sci-News.com May Collect Onli...


There are usually cases where a single segment belong to multiple categories.

In [7]:
print(joined.groupby(['Policy UID','segment_id']).agg(lambda x: x.nunique())['category_name'])

Policy UID  segment_id
20          0             1
            1             1
            2             2
            3             1
            4             2
                         ..
1713        84            3
            85            1
            86            1
            87            1
            88            1
Name: category_name, Length: 3792, dtype: int64


In this case, we select the category name that appears most often in each segment.

In [16]:
#Get the mode of each segment
mode_categories = joined.groupby(['Policy UID','segment_id']).agg(lambda x: x.value_counts().index[0])
mode_categories = mode_categories.reset_index()
mode_categories.head()

Unnamed: 0,Policy UID,segment_id,category_name,segments
0,20,0,Other,<strong> Privacy Policy </strong> <br> <br> <s...
1,20,1,Other,This privacy policy does not apply to Sites ma...
2,20,2,Policy Change,"By visiting our Sites, you are accepting the p..."
3,20,3,First Party Collection/Use,<strong> What Information Is Collected? </stro...
4,20,4,First Party Collection/Use,<strong> Personally Identifiable Information <...


In [17]:
categories = list(mode_categories['category_name'].unique())
print(categories)
cols = {'Other': 'other',
        'Policy Change': 'policy_change',
        'First Party Collection/Use': 'first_party_collection_use',
        'Third Party Sharing/Collection': 'third_party_sharing_collection',
        'Do Not Track': 'do_not_track',
        'User Choice/Control': 'user_choice_control',
        'International and Specific Audiences': 'international_specific_audiences',
        'Data Security': 'data_security',
        'Data Retention': 'data_retention',
        'User Access, Edit and Deletion': 'user_access_edit_deletion'}

['Other', 'Policy Change', 'First Party Collection/Use', 'Third Party Sharing/Collection', 'Do Not Track', 'User Choice/Control', 'International and Specific Audiences', 'Data Security', 'Data Retention', 'User Access, Edit and Deletion']


In [18]:
#Loop through the categories and generate a set of new columns with names in cols
binary_categories = pd.DataFrame({'Policy UID':mode_categories['Policy UID'], 'segment_id':mode_categories['segment_id'], 'segments':mode_categories['segments']})

In [19]:
for category in categories:
    one_hot = lambda s: 1 if s.startswith(category) else 0
    binary_categories[cols[category]] = mode_categories['category_name'].apply(one_hot)

In [20]:
print(binary_categories.shape)

(3792, 13)


In [21]:
binary_categories.head()

Unnamed: 0,Policy UID,segment_id,segments,other,policy_change,first_party_collection_use,third_party_sharing_collection,do_not_track,user_choice_control,international_specific_audiences,data_security,data_retention,user_access_edit_deletion
0,20,0,<strong> Privacy Policy </strong> <br> <br> <s...,1,0,0,0,0,0,0,0,0,0
1,20,1,This privacy policy does not apply to Sites ma...,1,0,0,0,0,0,0,0,0,0
2,20,2,"By visiting our Sites, you are accepting the p...",0,1,0,0,0,0,0,0,0,0
3,20,3,<strong> What Information Is Collected? </stro...,0,0,1,0,0,0,0,0,0,0
4,20,4,<strong> Personally Identifiable Information <...,0,0,1,0,0,0,0,0,0,0


# Preprocessing segments text

In [23]:
def clean_text(text):
  text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", text).split())
  #Then tokenisation
  tokens = word_tokenize(text)
  # convert to lower case
  tokens = [w.lower() for w in tokens]
  # remove punctuation from each word
  table = str.maketrans('', '', string.punctuation)
  stripped = [w.translate(table) for w in tokens]
  # remove remaining tokens that are not alphabetic
  words = [word for word in stripped if word.isalpha()]
  # filter out stop words
  stop_words = set(stopwords.words('english'))
  # You can add more stop words here, specific for tweets
  words = [w for w in words if not w in stop_words]
  # stemming of words
  porter = PorterStemmer()
  words = [porter.stem(word) for word in words]
  # Convert from list to a sentence again
  text = ' '.join(word for word in words)
  return text

In [27]:
#Process the segments here
binary_categories['segments'] = binary_categories['segments'].apply(clean_text)

In [28]:
binary_categories.head()

Unnamed: 0,Policy UID,segment_id,segments,other,policy_change,first_party_collection_use,third_party_sharing_collection,do_not_track,user_choice_control,international_specific_audiences,data_security,data_retention,user_access_edit_deletion
0,20,0,strong privaci polici strong br br strong effe...,1,0,0,0,0,0,0,0,0,0
1,20,1,privaci polici appli site maintain compani org...,1,0,0,0,0,0,0,0,0,0
2,20,2,visit site accept practic describ privaci poli...,0,1,0,0,0,0,0,0,0,0
3,20,3,strong inform collect strong br br collect two...,0,0,1,0,0,0,0,0,0,0
4,20,4,strong person identifi inform strong br br gen...,0,0,1,0,0,0,0,0,0,0


In [30]:
binary_categories.to_csv("/content/drive/My Drive/OPP-115/OPP-115/binary_segment_categories.csv")