<a href="https://colab.research.google.com/github/bachvu98/Policy-NLP/blob/all-in-one/Preprocessing_Policy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we import the require dependencies

In [None]:
import pandas as pd
import numpy as np
import string
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Read in the data from GitHub repository

In [None]:
annotations = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/annotations.csv')
sites = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/sites.csv')
segments = pd.read_csv('https://raw.githubusercontent.com/bachvu98/Policy-NLP/master/OPP-115_v1_0/OPP-115/segments.csv')

Preview of **annotations** and **segments** table

In [None]:
annotations.head()

Unnamed: 0,Policy UID,annotation_id,batch_id,annotator_id,segment_id,category_name,attributes_value_pairs,date,policy_url
0,1017,20137,test_category_labeling_highlight_fordham_aaaaa,121,0,Other,"{""Other Type"": {""selectedText"": ""Sci-News.com ...",,http://www.sci-news.com/privacy-policy.html
1,1017,20324,test_category_labeling_highlight_fordham_aaaaa,121,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",,http://www.sci-news.com/privacy-policy.html
2,1017,20325,test_category_labeling_highlight_fordham_aaaaa,121,1,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""nformati...",,http://www.sci-news.com/privacy-policy.html
3,1017,20326,test_category_labeling_highlight_fordham_aaaaa,121,2,Data Retention,"{""Personal Information Type"": {""selectedText"":...",,http://www.sci-news.com/privacy-policy.html
4,1017,20327,test_category_labeling_highlight_fordham_aaaaa,121,3,First Party Collection/Use,"{""Collection Mode"": {""selectedText"": ""Not sele...",,http://www.sci-news.com/privacy-policy.html


In [None]:
segments.head()

Unnamed: 0,Policy UID,segment_id,segments
0,20,0,<strong> Privacy Policy </strong> <br> <br> <s...
1,20,1,This privacy policy does not apply to Sites ma...
2,20,2,"By visiting our Sites, you are accepting the p..."
3,20,3,<strong> What Information Is Collected? </stro...
4,20,4,<strong> Personally Identifiable Information <...


Merge annotations to corresponding segments

In [None]:
joined = pd.merge(annotations,segments,on=['Policy UID','segment_id'],how='outer')
joined['category_name'] = joined['category_name'].fillna(value='None')
joined = joined.drop(['batch_id','attributes_value_pairs','date'],axis=1)
#joined = seg_ind.merge(ann_ind)
print(joined.shape)
joined.head()

(23194, 7)


Unnamed: 0,Policy UID,annotation_id,annotator_id,segment_id,category_name,policy_url,segments
0,1017,20137,121,0,Other,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
1,1017,20589,117,0,Other,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
2,1017,20233,118,0,Other,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
3,1017,20234,118,0,Policy Change,http://www.sci-news.com/privacy-policy.html,Privacy Policy <br> <br> Sci-News.com is commi...
4,1017,20324,121,1,First Party Collection/Use,http://www.sci-news.com/privacy-policy.html,Information that Sci-News.com May Collect Onli...


There are usually cases where a single segment belong to multiple categories.

In [None]:
print(joined.groupby(['Policy UID','segment_id']).agg(lambda x: x.nunique())['category_name'])

Policy UID  segment_id
20          0             1
            1             1
            2             2
            3             1
            4             2
                         ..
1713        84            3
            85            1
            86            1
            87            1
            88            1
Name: category_name, Length: 3792, dtype: int64


In this case, we select the category name that appears most often in each segment.

In [None]:
#Get the mode of each segment
mode_categories = joined.groupby(['Policy UID','segment_id']).agg(lambda x: x.value_counts().index[0])
mode_categories = mode_categories.reset_index()
mode_categories.head()

Unnamed: 0,Policy UID,segment_id,annotation_id,annotator_id,category_name,policy_url,segments
0,20,0,4069,88,Other,http://www.theatlantic.com/privacy-policy/,<strong> Privacy Policy </strong> <br> <br> <s...
1,20,1,4070,88,Other,http://www.theatlantic.com/privacy-policy/,This privacy policy does not apply to Sites ma...
2,20,2,2843,82,Policy Change,http://www.theatlantic.com/privacy-policy/,"By visiting our Sites, you are accepting the p..."
3,20,3,2847,84,First Party Collection/Use,http://www.theatlantic.com/privacy-policy/,<strong> What Information Is Collected? </stro...
4,20,4,4081,82,First Party Collection/Use,http://www.theatlantic.com/privacy-policy/,<strong> Personally Identifiable Information <...


In [None]:
mode_categories= mode_categories.drop(['annotation_id','policy_url','annotator_id'],axis=1)
mode_categories.head()

Unnamed: 0,Policy UID,segment_id,category_name,segments
0,20,0,Other,<strong> Privacy Policy </strong> <br> <br> <s...
1,20,1,Other,This privacy policy does not apply to Sites ma...
2,20,2,Policy Change,"By visiting our Sites, you are accepting the p..."
3,20,3,First Party Collection/Use,<strong> What Information Is Collected? </stro...
4,20,4,First Party Collection/Use,<strong> Personally Identifiable Information <...


In [None]:
def clean_text(text):
  text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())
  #Then tokenisation
  tokens = word_tokenize(text)
  # convert to lower case
  tokens = [w.lower() for w in tokens]
  # remove punctuation from each word
  table = str.maketrans('', '', string.punctuation)
  stripped = [w.translate(table) for w in tokens]
  # remove remaining tokens that are not alphabetic
  words = [word for word in stripped if word.isalpha()]
  # filter out stop words
  stop_words = set(stopwords.words('english'))
  # You can add more stop words here, specific for tweets
  words = [w for w in words if not w in stop_words]
  # stemming of words
  porter = PorterStemmer()
  words = [porter.stem(word) for word in words]
  # Convert from list to a sentence again
  text = ' '.join(word for word in words)
  return text

In [None]:
#Process the segments here
mode_categories['segments'] = mode_categories['segments'].apply(clean_text)

In [None]:
mode_categories.head()

Unnamed: 0,Policy UID,segment_id,category_name,segments
0,20,0,Other,strong privaci polici strong br br strong effe...
1,20,1,Other,privaci polici appli site maintain compani org...
2,20,2,Policy Change,visit site accept practic describ privaci poli...
3,20,3,First Party Collection/Use,strong inform collect strong br br collect two...
4,20,4,First Party Collection/Use,strong person identifi inform strong br br gen...


In [None]:
mode_categories.to_csv("/content/drive/My Drive/OPP-115/OPP-115/segment_categories.csv")