# Project Final Report 

Title: Baselines for Android API Method to Privacy Policy Phrase Mapping<br>
Student Id: luu797<br>
Course: (either CS6463 or CS4973)<br>
Index:<br>

## Overview
[Background](#Background)<br>
[Research Objective](#Research-Objective)<br>
[Data Description](#Data-Description)<br>
[Data Exploration](#Data-Exploration)<br>
- [NLTK comparison with SpaCy and Gensim](#NLTK-comparison-with-SpaCy-and-Gensim)
- [Exploration of dataset properties using SpaCy](#Exploration-of-dataset-properties-using-SpaCy)
- [Dataset Visualization](#Dataset-Visualization)

[Modeling](#Modeling)<br>
- [Analysis of modelling results](#Analysis-of-modelling-results)

[Summary](#Summary)<br>

## Background
My research labs works on privacy and software related issues and currently we are attempting to automate the varification of privacy policy adherence in Android applications. In order to do that a crital mapping must be obtained. We must be able to map methods in the Android API to important words and phrases in privacy policies. This was previously done manually, but my project in the lab is to find a way to automate creating this mapping. My current model is performing well, but I need to compare it some baseline methods. This is what I will do in this project.

## Research Objective

Objective statement:
- I will create baseline classification standards to predict whether an Android API method manipulates/returns information associated with data types defined in privacy policies by using classical sparse vector space representations of the data with standard linear and non-linear classifiers.
- As a secondary objective I will contrast using nltk for text processing and data exploration with using the newer spaCy and gensim packages. 

## Data Description

I will be using data I scraped from the Android API documentation on this page (https://developer.android.com/reference/packages). Previously my lab published work based on API levels 24 and below, so I will only be using that data. Also, only a subset of this data was annotated so I will list the annotated data's attributes separately. I am including the unannotated data because I will use it for initializing word embeddings, which should help the spaCy model work better.

After annotating, my lab found that very few of of the methods mapped to privacy policy data types so the last catagory is only information regarding these methods. You can clearly see that there is not a lot of data to learn from. This is why I hypothesized that a pre-trained neural model would work well. However, the model would still most likey be used in conjunction with statistical data to make the final classifier. Before then, though, I must compare different baseline text models directly. For reference, my newer deep learning model has been getting on average ~%55 micro f1 and ~%44 macro f1 with this data, and about %10 higher if you use the classes with higher support.

To augment the data I transformed each method's documentation record by splitting it into sentences and adding those as independant instances. I also added a record for each method based on the method's name. This transformation added a lot of noise to the dataset, but it was necessary for any meaningful evaluation.

[Overview](#Overview)

## Data Exploration

This section will be split into two main sections: [NLTK comparison with SpaCy and Gensim](#NLTK-comparison-with-SpaCy-and-Gensim) and [Exploration of dataset properties using SpaCy](#Exploration-of-dataset-properties-using-SpaCy ) 

[Overview](#Overview)

### NLTK comparison with SpaCy and Gensim
Here I will show how several common NLP tasks are accomplished (in the context of data exploration) with NLTK and newer packages like SpaCy and Gensim. I will compare each using my largest dataset and conclude with an anlaysis of the different approaches.<br><br>
[Overview](#Overview)

#### Sentence Tokenization With NLTK and Gensim
Total number of documents: 29,402<br>
Number of differences: 553<br>

|   | NLTK |  Gensim|
|-----|-----|-----|
|Number of senteces: |73,109 | 73,154|
|Avg. #tokens of differences:  | 19.443 | 20.671|
|Pct. documents with the most senteces: |   47.20% |52.80%|


EXAMPLES OF SENTENCE TOKENIZATION DIFFERENCES:

ORIGINAL: Called by a device admin to set the short support message. This will be displayed to the user in settings screens where funtionality has been disabled by the admin. The message should be limited to a short statement such as "This setting is disabled by your administrator. Contact someone@example.com for support." If the message is longer than 200 characters it may be truncated. If the short support message needs to be localized it is the responsibility of the DeviceAdminReceiver to listen to the Intent#ACTION_LOCALE_CHANGED broadcast and set a new version of this string accordingly.<br>
NLTK: Contact someone@example.com for support<br>
GENSIM: Contact someone@example.com for support." If the message is longer than 200 characters it may be truncated<br>

ORIGINAL: This method was deprecated in API level 21. use BluetoothLeScanner#startScan(List ScanSettings ScanCallback) instead. Starts a scan for Bluetooth LE devices looking for devices that advertise given services. Devices which advertise all specified services are reported using the LeScanCallback#onLeScan callback. Requires Manifest.permission.BLUETOOTH_ADMIN<br>
NLTK: This method was deprecated in API level 21. use BluetoothLeScanner#startScan(List ScanSettings ScanCallback) instead<br>
GENSIM: This method was deprecated in API level 21<br>

ORIGINAL: Transfer the session to a new owner. Only sessions that update the installing app can be transferred. After the transfer to a package with a different uid all method calls on the session will cause SecurityException s. Once this method is called the session is sealed and no additional mutations beside committing it may be performed on the session.<br>
NLTK: After the transfer to a package with a different uid all method calls on the session will cause SecurityException s. Once this method is called the session is sealed and no additional mutations beside committing it may be performed on the session<br>
GENSIM: After the transfer to a package with a different uid all method calls on the session will cause SecurityException s<br>




#### Tokenization With NLTK and SpaCy
One of the main advantages of SpaCy is that the intuitive and well documented API allows for simple extensions to most pieces of the pipeline. For this dataset it was important to detect methods, classes, and variables as much as possible. In order to do this I added a few rules to the tokenization scheme. This is why methods like 'setStructuredData(String)' were tokenized as one token.<br>
Tokenization differences<br>
Number of documents with different tokenization: 18245 Fraction of total: 24.94%

||NLTK |SpaCy|
|-----|---|---|
| Number of tokens: | 1,191,458   |   1,166,787|
| Avg. tokens per sentence:  | 16.287|      15.950|
| Number of tokens of differences:  |    400,218 |       375,547|
|Avg. #tokens of differences:  |       21.936    |    20.584|

Examples of differences:

NLTK:
['Can', 'be', 'modified', 'in-place']<br>
SpaCy:
['Can', 'be', 'modified', 'in', '-', 'place']<br>

NLTK:
['Returns', 'the', 'current', 'setStructuredData', '(', 'String', ')']<br>
SpaCy:
['Returns', 'the', 'current', 'setStructuredData(String)']<br>

NLTK:
['Return', 'a', 'Bundle', 'containing', 'optional', 'vendor-specific', 'extension', 'information']<br>
SpaCy:
['Return', 'a', 'Bundle', 'containing', 'optional', 'vendor', '-', 'specific', 'extension', 'information']<br>

#### POS tagging with NLTK and SpaCy
NLTK and Spacy use different schemes to tag parts of speech so I is difficult to compare directly. Here I will just show some examples of differences to give a qualitative intuition about the differences.<br>

NLTK:
[('Can', 'MD'), ('be', 'VB'), ('modified', 'VBN'), ('in-place', 'NN')]<br>
SpaCy:
[('Can', 'VERB'), ('be', 'AUX'), ('modified', 'VERB'), ('in', 'ADP'), ('-', 'PUNCT'), ('place', 'NOUN')]<br>

NLTK:
[('Requires', 'NNS'), ('android.Manifest.permission.MANAGE_ACCOUNTS', 'VBP')]<br>
SpaCy:
[('Requires', 'VERB'), ('android.Manifest.permission.MANAGE_ACCOUNTS', 'PROPN')]<br>

NLTK:
[('See', 'VB'), ('ViewStructure.setId', 'NNP'), ('for', 'IN'), ('more', 'JJR'), ('information', 'NN')]<br>
SpaCy:
[('See', 'VERB'), ('ViewStructure.setId', 'PROPN'), ('for', 'ADP'), ('more', 'ADJ'), ('information', 'NOUN')]<br>

#### NER tagging with NLTK and SpaCy
Similar to POS tagging, the schemes for NLTK and SpaCy are a bit different for named-entity recognition so I will simply display some representative examples of differences. SpaCy's API includes easy examples for adding domain specific rules, so I added one to detect Android methods and classes. These are marked with 'MT_OR_CL'.

NLTK:
[('Parcelable', 'ORGANIZATION')]<br>
SpaCy:
[('Parcelable', 'ORG')]<br>

NLTK:
[('writeToParcel', 'ORGANIZATION'), ('CONTENTS_FILE_DESCRIPTOR', 'ORGANIZATION')]<br>
SpaCy:
[('CONTENTS_FILE_DESCRIPTOR', 'MT_OR_CL')]<br>

NLTK:
[('setClipData', 'ORGANIZATION'), ('ClipData', 'ORGANIZATION')]<br>
SpaCy:
[('setClipData(ClipData)', 'MT_OR_CL')]<br>

### Exploration of dataset properties using SpaCy 
For this section we will explore the differences between the different subset in our data. As a reminder the sets are (from largest to smallest): the full set of unannotated data, the annotated data, and the data that positively map to privacy policy terms. <br><br>
[Overview](#Overview)

Info for full dataset:<br>
   >-total number of methods: 25954 <br>
	  -total records after transform: 73154 <br>
	  -number of unique records after transform: 45199 <br>
	  -method with most sentences: ('android.junit.framework.Assert.assertEquals', 19) <br>
	  -method with most tokens: ('android.java.util.logging.SimpleFormatter.format', 171) <br>
	  -total number of tokens: 1166787 <br>
	  -num unique tokens: 24996 <br>
	  -most common tokens (with 5 or more chars): [('method', 13779), ('object', 7005), ('value', 6571)] <br>
	  -most frequent POS tag: NOUN <br>
	  -most common words in that tag: ('method', 13702) <br>
	  -most frequent proper noun: ('API', 1234) <br>
	  -number of unique domain-specific named entities: 25078 <br>
	  -number of unique domain-specific named entities: 6649 <br>
	  -most frequent domain-specific named entity: ('hashCode', 785) <br>

Info for annotated dataset:<br>
   >-total number of methods: 2989 <br>
	 -total records after transform: 9590 <br>
	 -number of unique records after transform: 7787 <br>
	 -method with most sentences: ('android.graphics.Bitmap.createBitmap', 10) <br>
	 -method with most tokens: ('android.text.format.DateUtils.formatDateRange', 128) <br>
	 -total number of tokens: 145394 <br>
	 -num unique tokens: 8167 <br>
	 -most common tokens (with 5 or more chars): [('method', 1179), ('value', 556), ('should', 533)] <br>
	 -most frequent POS tag: NOUN <br>
	 -most common words in that tag: ('method', 1177) <br>
	 -most frequent proper noun: ('API', 312) <br>
	 -number of unique domain-specific named entities: 3442 <br>
	 -number of unique domain-specific named entities: 1459 <br>
	 -most frequent domain-specific named entity: ('WebView', 104) <br>

Info for positively mapped method dataset:<br>
   >-total number of classes: 76 <br>
	 -total number of methods: 104 <br>
	 -total records after transform: 392 <br>
	 -number of unique records after transform: 317 <br>
	 -method with most sentences: ('android.hardware.SensorManager.registerListener', 6) <br>
	 -method with most tokens: ('android.location.LocationManager.requestLocationUpdates', 50) <br>
	 -total number of tokens: 6009 <br>
	 -num unique tokens: 1139 <br>
	 -most common tokens (with 5 or more chars): [('device', 53), ('location', 51), ('method', 50)] <br>
	 -most frequent POS tag: NOUN <br>
	 -most common words in that tag: ('device', 53) <br>
	 -most frequent proper noun: ('API', 25) <br>
	 -number of unique domain-specific named entities: 163 <br>
	 -number of unique domain-specific named entities: 73 <br>
	 -most frequent domain-specific named entity: ('SensorEventListener', 12) <br>



### Dataset Visualization
The data transform I applied was more or less a data augmentation technique. It requires an indepence assumption (which is not likely to hold), but it increases the number of training instances. This plot illustrates the increases.

[Overview](#Overview)
<img src="plot1.png">
Just to satisfy our curiosity let's see what the most common words are in our different data sets. Notice that the most common words in the 'positively mapped methods' data set seem to be a little more related to privacy (e.g. location, network).
<img src="plot2.png">
Being an API, the Android documentation used many words that were unique to that corpus. These included class names, method names, and variables. This plot shows the most common of these for each data subset.
<img src="plot3.png">
This plot shows the proportion of unique domain-specific tokens (DS ENTITY) reletive to unique tokens of other pos tags. The domain-specific tokens make up a significant number of the overall unique tokens.
<img src="plot4.png">



## Modeling

For modelling I used a Linear Support Vector Classifier with tf-idf weighting because it is generally considered the best performer of the standard linear classifiers (at least for text). Due to severeve class imbalance, a proper train, dev, test split is not possible. Instead, I will display the results of a k-fold crossvalidation run. As a reminder, with my current neural architecture I am getting .55 micro f1 and .44 macro f1.<br><br>
[Overview](#Overview)

4-fold results:<br>
RUN 1<br>
micro precision: 0.8137  micro precision: 0.3873  micro f1: 0.5239<br>
macro precision: 0.4984  macro precision: 0.2950  macro f1: 0.3411<br>

4-fold results:<br>
RUN 2<br>
micro precision: 0.8125  micro precision: 0.3840  micro f1: 0.5195<br>
macro precision: 0.4996  macro precision: 0.3187  macro f1: 0.3600<br>

4-fold results:<br>
RUN 3<br>
micro precision: 0.7932  micro precision: 0.3564  micro f1: 0.4878<br>
macro precision: 0.4555  macro precision: 0.2686  macro f1: 0.3141<br>
<br>
4-fold results:<br>
RUN 4<br>
micro precision: 0.8089  micro precision: 0.3574  micro f1: 0.4954<br>
macro precision: 0.4173  macro precision: 0.2410  macro f1: 0.2907<br>

Average of 4 4-fold runs<br>
micro precision: 0.8071  micro precision: 0.3713  micro f1: 0.5066 <----- FINAL MICRO SCORES<br>
macro precision: 0.4677  macro precision: 0.2808  macro f1: 0.3265 <----- FINAL MACRO SCORES<br>

### Analysis of modelling results
As expected, both the micro and macro scores were lower on the linear model than the more modern neural non-linear model. Interestingly, the difference in macro f1 score from my neural model is much lowere than the difference in micro f1 scores. This would seem to confirm my hypothesis that neural models that have been pre-trained on language data would do well on my mapping task because I have few examples to learn from. Another important note is that this linear model took a tiny fraction of the time to run as compared to my neural model. The Linear SVC finishes a 4-fold crossvalidation run in less than a minute running on a CPU. My neural model, on the other hand, takes over 12 hours on a GPU. Considering this, the Linear SVC results are quite impressive.

[Overview](#Overview)

## Summary

In this project I desctibed and explored my Android API data. I used visualiztion techniques to display differences in three subsets of my project data. While doing this I also demonstrated the differences between several popular Natural Language Processing packages. The results of this comparison were fairly minute which suggests that users should use which ever package is most comfortable for them. I then used a Linear SVC model to create a baseline measurment for predicting a mapping between an Android method and a privacy policy phrase. The micro f1 results were comperable, but the macro f1 score was much lower for the linear model. This suggests that my neural model excels in low-resource contexts (i.e. when there are not many examples to learn from) relative to a linear model. 

[Overview](#Overview)

## References

Rocky Slavin, Xiaoyin Wang, Mitra Bokaei Hosseini, James Hester, Ram Krishnan, Jaspreet Bhatia, Travis D. Breaux, and Jianwei Niu. 2016. Toward a framework for detecting privacy policy violations in android application code. In Proceedings of the 38th International Conference on Software Engineering (ICSE '16). ACM, New York, NY, USA, 25-36. DOI: https://doi.org/10.1145/2884781.2884855

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 2. Curran Associates Inc., USA, 3111-3119. 

## Presentation Credit
Do not put anything below this cell.