<h1> Preprocessing<span class="tocSkip"></span></h1>

In this notebook, I follow the text preprocessing steps described by Story et al. (Story et al. (2019) "Natural Language Processing for Mobile App Privacy Compliance", available from https://usableprivacy.org/publications)

Many of the steps described by them are somewhat vague so this is not a true replication.

The purpose of these steps are to improve the performance of the models, to more accurately populate the crafted features data, and there are additional advantages to other tasks that could be done with the data (that I have not taken advantage of, such as more accurate EDA).

The steps described by Story et al. are:

- Normalize whitespace
- Normalize punctuation
- Remove non-ASCII characters
- Make all text lowercase

Following that, I then load the 'crafted features' provided by Story et al. and I find some issues in the clenliness so I conduct the same steps to these too.

Finally I append the crafted features to the dataframe.  The data will then be ready for modelling.

While previously I looked at classifiers at the sentence level, going forwards I will investigate Story et al.'s work more closely and so will process the text at the segment level.

**Crafted features**

To help to create accurate classifiers, columns will be added to the dataframe that contain key phrases that may be found in segment that has been annotated with a specific annotation. For example, the phrases 'phone book', 'phonebook' or 'address book' could be found in segments that have been annotated with the Contact_Address_Book annotation and adding these phrases as columns could help a classifier to correctly identify Contact_Address_Book.

Story et al. created these features based on their expertise and findings across the train and validation policies and have made them available along with the data.

**A note on my code**

Although these functions are only being applied to this dataset once, I still write them out as functions as they can in principle be modified for other projects and it is easier to verify that each function works in function format.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Confirm-all-text-data-is-string-format" data-toc-modified-id="Confirm-all-text-data-is-string-format-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Confirm all text data is string format</a></span></li><li><span><a href="#Setting-up-normalization-functions" data-toc-modified-id="Setting-up-normalization-functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setting up normalization functions</a></span></li><li><span><a href="#Normalize-Whitespace" data-toc-modified-id="Normalize-Whitespace-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Normalize Whitespace</a></span></li><li><span><a href="#Normalize-punctuation" data-toc-modified-id="Normalize-punctuation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Normalize punctuation</a></span></li><li><span><a href="#Remove-non-ASCII-characters" data-toc-modified-id="Remove-non-ASCII-characters-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Remove non-ASCII characters</a></span></li><li><span><a href="#Make-all-policy-text-lowercase" data-toc-modified-id="Make-all-policy-text-lowercase-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Make all policy text lowercase</a></span></li><li><span><a href="#Same-pre-processing-steps-for-Crafted-Features" data-toc-modified-id="Same-pre-processing-steps-for-Crafted-Features-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Same pre-processing steps for Crafted Features</a></span></li><li><span><a href="#Append-crafted-features-to-dataframe" data-toc-modified-id="Append-crafted-features-to-dataframe-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Append crafted features to dataframe</a></span><ul class="toc-item"><li><span><a href="#Load-crafted-features" data-toc-modified-id="Load-crafted-features-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Load crafted features</a></span></li><li><span><a href="#Add-crafted-features-columns-to-df" data-toc-modified-id="Add-crafted-features-columns-to-df-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Add crafted features columns to df</a></span></li></ul></li><li><span><a href="#Saving-the-dataframe-to-be-used-for-modelling" data-toc-modified-id="Saving-the-dataframe-to-be-used-for-modelling-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Saving the dataframe to be used for modelling</a></span></li></ul></div>

In [1]:
import pandas as pd
from pandas import json_normalize
import yaml
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from scipy import stats
from scipy.stats import norm

import sys
from collections import defaultdict
from collections import Counter
import re

import priv_policy_manipulation_functions as priv_pol_funcs

# Confirm all text data is string format

I realised that it is possible that a segment is not correctly stored in string format, so firstly I wish to confirm that every segment in my dataframe is stored in string format.

Loading the data:

In [2]:
clean_segment_annots_df = pd.read_pickle('objects/segment_annots_df.pkl')

Writing and verifying a function to confirm the datatype of every cell in a pandas dataframe column:

In [3]:
def column_all_dtype(dataframe_column, dtype):
    """
    Inputs: a specific dtype
    Example inputs for dtype: str, "<class 'numpy.int64'>", "<class 'numpy.bool_'>", "<class 'list'>"
    Outputs: True or False appropriately depending on whether any item in the dataframe column is not the dtype passed.
    """
    
    for _index in range(len(dataframe_column)):
        if str(type(dataframe_column[_index])) != str(dtype):
            return False
    return True

# validation on columns of known type:
print(column_all_dtype(clean_segment_annots_df['policy_segment_id'], "<class 'numpy.int64'>") ) # should return True
print(column_all_dtype(clean_segment_annots_df['policy_segment_id'], str) ) # should return False
print(column_all_dtype(clean_segment_annots_df['contains_synthetic'], "<class 'numpy.bool_'>") ) # should return True

True
False
True


Confirming all my text data is in string format:

In [4]:
column_all_dtype(clean_segment_annots_df['segment_text'], str)

True

This can also be verified by the fact that every cell has the same type (the number of unique types is 1):

In [5]:
clean_segment_annots_df['segment_text'].apply(type).nunique()

1

Perfect.

# Setting up normalization functions

For some of the below functions, to see that they worked, I want to check the total length of all strings in the DataFrame before and after running the function.  I can use a decorator to append these lines to the following functions where appropriate.

I want to check the length of all the text before and after to see any difference made.

In [6]:
# check total length
def total_length(dataframe):
    """
    Takes in a dataframe with a column named "segment_text" 
    Returns the sum of the string length of each cell in the column
    """
    total_length_all_segments = dataframe["segment_text"].str.len().sum()
    return f"Total length of all segments is {total_length_all_segments}"

In [7]:
clean_segment_annots_df["segment_text"].str.len().sum()

5917918

In [8]:
def check_length(func, dataframe):
    """
    Prints the length of the dataframe before and after running the function.
    Inputs: 
    - Any function that can take the dataframe as an input
    - A pandas dataframe
    Actions:
    Passing the dataframe as an argument for each function: 
    Prints the output of the total_length function, 
    then calls the function passed as argument to this check_length function,
    then prints the total_length function again
    Outputs: none
    """
    
    print(f"Before running function: {total_length(dataframe)}")
    
    func(dataframe)
    
    print(f"After running function: {total_length(dataframe)}")
    
    return None

# Normalize Whitespace

Using `new_string = " ".join(old_string.split())`.  The `.split()` function considers a range of forms of whitespace.

I want to check the length of all the text before and after to see any difference made.

In [9]:
def normalize_whitespace(dataframe):
    """
    Removed whitespace from all cells in the "segment_text" column.
    For verification, call this function within the 'check_length' function.
    Input: Dataframe with a column called "segment_text" that contains strings for the whitespace to be removed.
    Returns: nothing
    """
    
    dataframe["segment_text"] = dataframe["segment_text"].map(lambda x: " ".join(x.split()))
    
    return None

In [10]:
check_length(normalize_whitespace, clean_segment_annots_df)

Before running function: Total length of all segments is 5917918
After running function: Total length of all segments is 5917918


Suspiciously nothing changed, so I will verify my function before concluding that the privacy policy segments have no whitespace in them.

Verifying the function by testing it on some whitespace.
- create  a dataframe where I have intentionally added whitespace 
- calling the function on this dataframe
- confirming the whitespace is removed

In [11]:
text_with_space = 'PRIVACY                                          This.'
verify_whitespace_df = clean_segment_annots_df.copy()
verify_whitespace_df.loc[0, 'segment_text'] = text_with_space
check_length(normalize_whitespace,verify_whitespace_df)

Before running function: Total length of all segments is 5917320
After running function: Total length of all segments is 5917279


I have verified that my function works and so can conclude that the privacy policy segments have no whitespace in them.

# Normalize punctuation

I took the below function from this towardsdatascience article [here](https://towardsdatascience.com/text-normalization-7ecc8e084e31).

In [12]:
def _simplify_punctuation(text):
    """
    This function simplifies doubled or more complex punctuation. The exception is '...'.
    """
    
    corrected = str(text)
    corrected = re.sub(r'([!?,;])\1+', r'\1', corrected)
    corrected = re.sub(r'\.{2,}', r'...', corrected)
    return corrected

In [13]:
def remove_duplicate_punctuation(dataframe):
    
    dataframe["segment_text"] = dataframe["segment_text"].map(_simplify_punctuation)
    
    return None

In [14]:
check_length(remove_duplicate_punctuation, clean_segment_annots_df)

Before running function: Total length of all segments is 5917918
After running function: Total length of all segments is 5917879


Only a small number of characters were removed as expected.

Further punctuation normalization such as converting other characters to their english standardized versions (e.g. the opening speachmark “ to ", or elipses … to ...) would be ideal, but the ommission of this should not affect the sentence filtering much, and won't have any effect on the tf-idf matrix, because it ignores punctuation.

# Remove non-ASCII characters

This can be done by checking whether each character has a unicode index below 128, as ASCII characters are coded above 128.  Checking the unicode 'code point' is done with `ord(char)`.

In [15]:
def remove_non_ascii(string):
    """
    I found this function on this website: https://bobbyhadz.com/blog/python-remove-non-ascii-characters-from-string
    """
    return ''.join(char for char in string if ord(char) < 128)

# demonstrate function:
print(remove_non_ascii('a€bñcá')) # >> 'abc'
print(remove_non_ascii('a_b^0')) # >> a_b^0

abc
a_b^0


In [16]:
def remove_nonASCII_chars(dataframe):
        
    dataframe["segment_text"] = dataframe["segment_text"].map(remove_non_ascii)
    
    return None

In [17]:
check_length(remove_nonASCII_chars, clean_segment_annots_df)

Before running function: Total length of all segments is 5917879
After running function: Total length of all segments is 5901658


Thousands of characters were removed, representing nearly .3% of all characters.  I hope that this makes at least a small difference for model performance.

# Make all policy text lowercase

In [18]:
def convert_to_lowercase(dataframe):

    dataframe["segment_text"] = dataframe["segment_text"].str.lower()
    
    return None

# verify
sample_df = pd.DataFrame(["ifiUFIWUNFIijnf"], columns=["segment_text"])
convert_to_lowercase(sample_df)
display(sample_df)

Unnamed: 0,segment_text
0,ifiufiwunfiijnf


In [19]:
convert_to_lowercase(clean_segment_annots_df)
clean_segment_annots_df['segment_text'].head(3) # verify

0    privacy policy this privacy policy (hereafter ...
1    1. about our products 1.1 our products offer a...
2    2. the information we collect the information ...
Name: segment_text, dtype: object

It can be seen that the text has been changed to lowercase.

**Save the above cleaned dataframe**

In [20]:
clean_segment_annots_df.to_pickle('objects/clean_segment_annots_df.pkl')

---

# Same pre-processing steps for Crafted Features

I will need to populate the segment dataframe with crafted features, but if the segments and the crafted features are in different formats, it will be more difficult to do so.  So I will check and standardize the format of the crafted features too.

I save the crafted features and the annotations that they refer to in the variable `annotation features`.

Load the list of all the crafted features:

In [21]:
annotation_features = pd.read_pickle('objects/annotation_features.pkl')
list_all_crafted_features = [feature for row in annotation_features['features'] for feature in row]
len(list_all_crafted_features) # verify – should be 579 crafted features

579

These crafted features are not quite in the same format as the main dataframe so I cannot apply my functions to them. I won't check for doubled punctuation. To capture non-ASCII characters and uppercase letters I only need to check which non-lowercase letters there are.

Checking for any characters that are not lowercase english alphabet characters:

In [22]:
def non_asciis():
    list_of_chars = []
    for ft in list_all_crafted_features:
        for char in ft:
            if char.islower() == False: #aka if it's not lowercase
                if char not in list_of_chars:
                    list_of_chars.append(char)
    return list_of_chars
print(non_asciis())

[' ', '.', ',', '-', '\xa0', '/', 'S', 'N', 'U', 'T', 'P', 'A', '(', ')', 'I', "'"]


By inspecting this list I can see that the only characters that I don't expect are the uppercase letters and "\xa0", which represents a type of whitespace.  There are no non-ASCII characters so I don't need to remove them.

I also noticed while manually browsing the features that Bluetooth was not listed because it had been incorrectly entered as 'bluethooth', so I will correct that now too.

In [23]:
list_all_crafted_features = [feature for row in annotation_features['features'] for feature in row]
"bluethooth" in list_all_crafted_features

True

Correcting typo and normalizing text:

In [24]:
for _row in range(len(annotation_features)):
    crafted_feature_list = annotation_features.at[_row, 'features']

    # correct "Bluetooth"
    new_crafted_feature_list = ["bluetooth" if feature=="bluethooth" else feature for feature in crafted_feature_list]
    
    # change to lowercase
    new_crafted_feature_list = [feature.lower() for feature in new_crafted_feature_list]
    
    # normalize whitespace
    new_crafted_feature_list = [" ".join(feature.split()) for feature in new_crafted_feature_list]

    
    annotation_features.at[_row, 'features'] = new_crafted_feature_list

Checking again that the bluetooth typo was corrected:

In [25]:
list_all_crafted_features = [feature for row in annotation_features['features'] for feature in row]
"bluethooth" in list_all_crafted_features

False

Checking again for non-lowercase characters:

In [26]:
print(len(list_all_crafted_features)) # verify – should be 579 crafted features

579


In [27]:
print(non_asciis())

[' ', '.', ',', '-', '/', '(', ')', "'"]


All problematic characters are now removed so this list of features can be used for modelling.

In [28]:
annotation_features.to_pickle('objects/clean_annotation_features.pkl')
confirm_save_0 = pd.read_pickle('objects/clean_annotation_features.pkl')
print(annotation_features.shape == confirm_save_0.shape)
print(confirm_save_0.equals(annotation_features))

True
True


---

# Append crafted features to dataframe

## Load crafted features

The next steps are to:
- 1. Append each feature as a column to the dataframe
- 2. Populate the columns

Then I can move to modelling.

I already have a function to help with 1 called `add_empty_annotation_columns`.  I just need to put the new features into a list.

First though, I want to check whether any of the features are duplicates.

In [29]:
clean_annotation_features = pd.read_pickle('objects/clean_annotation_features.pkl')
clean_segment_annots_df = pd.read_pickle('objects/clean_segment_annots_df.pkl')

In [30]:
list_all_crafted_features = [feature for row in clean_annotation_features['features'] for feature in row]

In [31]:
all_features = []
duplicate_features = []
for feature in list_all_crafted_features:
    if feature in all_features:
        duplicate_features.append(feature)
    all_features.append(feature)
len(duplicate_features)

103

Oddly, a lot of the features are exactly the same. I will remove these duplicates after adding them to the dataframe.

## Add crafted features columns to df

In [32]:
print(len(list_all_crafted_features))
print(clean_segment_annots_df.shape)
crafted_features_df = priv_pol_funcs.add_empty_annotation_columns(clean_segment_annots_df, list_all_crafted_features) 

579
(15543, 41)
The shape of the returned dataframe is (15543, 620)


Verify that the features have been added:

In [33]:
crafted_features_df.iloc[:,40:].head(2)

Unnamed: 0,NOT_PERFORMED,contact info,contact details,contact data,"e.g., your name",contact you,your contact,"identify, contact",identifying information,"your name, address, and e-mail address",...,never be acquired,never be viewed,never be located,never be asked,never be utilized,never be requested,never be transmitted,never be communicated,nor do we collect,does not tell us
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can see that the crafted features are all columns from column 41 to the end.  Now let's remove the duplicate feature columns before populating them all. I expect 103 columns to be removed to bring the `crafted_features_df` down to 517 columns.

In [34]:
crafted_features_df = crafted_features_df.loc[:,~crafted_features_df.columns.duplicated()] # remove columns with duplicate names
crafted_features_df.shape

(15543, 517)

Perfect, the right number of columns have been removed.

Now to populate the crafted features columns, I will:

- Take the column name for each crafted feature
- take the segment text for each row
- if column name in segment text: put 1.

In [35]:
%%time

all_rows = range(len(crafted_features_df)) # index of rows to loop through

for column_number in range(41, 517): # Looping through each column with a feature

    column_name = crafted_features_df.columns[column_number] # for that column feature

    for row in all_rows: # and for every row
        if column_name in crafted_features_df.at[row, "segment_text"]: # if the segment has that feature
            crafted_features_df.at[row, column_name] = 1 # make the value for that feature on that row equal 1
    
    print(f"Processing {column_number}/517", end="\r")

CPU times: user 28.6 s, sys: 203 ms, total: 28.8 s
Wall time: 28.9 s


In [36]:
# looking at some of the results to verify
summations = crafted_features_df.iloc[:,41:].sum()
print(f"{(summations==0).sum()} features have not been populated")

133 features have not been populated


This seems like a lot of empty columns, so I manually looked through the results, as well as checking the source text, and found that most of the crafted feature columns that haven't been populated are generally:
- unusual ways of typing a phrase (example: 'post code' instead of postcode)
- specific phrases for uncommon data practices (example: 'exact device location')
- negative phrases (example: never be requested)

These features would have been included by the researchers to capture phrases that don't feature in their dataset but could feature when applying their model beyond their training data.

Overall this seems roughly correct so I will use it for modelling.

<font size= "3"> **Some final tidying:** <font/>

Confirming that every number across the target and feature columns equal 0 or 1:

In [37]:
(crafted_features_df.iloc[:,7:] == 0 # every cell equals 0
 | (crafted_features_df.iloc[:,7:] == 1) # or every cell equals 1
).nunique().nunique() # All columns and rows in the resulting dataframe of booleans only show one result (True)

1

Alternate method:

In [42]:
crafted_features_df.iloc[:,7:].isin([0, 1]).all().all()

True

Changing the dtype of those same columns to int8:

In [154]:
print(f"The dtypes are:")
display(crafted_features_df.iloc[:,7:].dtypes.value_counts())

subset_df = crafted_features_df.iloc[:,7:].copy()
subset_df = subset_df.astype('int8')
crafted_features_df.iloc[:,7:] = subset_df

print(f"Now the dtypes are:")
display(crafted_features_df.iloc[:,7:].dtypes.value_counts())

The dtypes are:


int64      482
float64     28
dtype: int64

Now the dtypes are:


int8    510
dtype: int64

# Saving the dataframe to be used for modelling

As before, to make it faster to load this dataframe in this notebook and others, I will save this dataframe as a pickle file.  This allows the below code to be ran without waiting for the above code.

In [155]:
crafted_features_df.to_pickle('objects/crafted_features_df.pkl')

Verifying that the file was correctly saved and can be imported properly:

In [156]:
confirm_save_5 = pd.read_pickle('objects/crafted_features_df.pkl')
print(crafted_features_df.shape == confirm_save_5.shape)
print(confirm_save_5.equals(crafted_features_df))

True
True


---

I now have a dataframe with cleaned text, crafted features, and all targets.  Now I will pass these into a modelling pipeline to follow the steps of Story et al.