## Applying NLP methods to analyze cookie banner text
No need to click into cookie policy, as studies have shown that most users do not do that far to educate themselves before deciding how they want to engage with site cookies.

This notebook expores: 
1. NLP to detect misleading text
2. Sentiment analysis: +/- framing

### 1. NLP to Detect Misleading Text
* SparkNLP
* LLMs

<font color="salmon">How did [Santos et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3463676.3485611) conduct their text analysis? What did they collect?</font>

1. CODEBOOK: Qualitative coding with [MAXQDA](https://www.maxqda.com) and intercoder-reliable **codebook**
    * OUR USE: Supervised learning, feed model examples of codes 

2. LEGAL REQ: from EU law (ePD, GDPR, case law, European Protection Board, DPAs)
    * OUR USE: update their categories & use them as code categories for banners


<font color="salmon">Essentially, we can replicate their study:</font>
1. For the US
2. At larger scale
3. Specific things to code: 
    * Technical jargon
    * Vague, ambiguous language
    * Misleading statements
    * Positve/negative framing

### SPARK NLP

* [Annotators](https://sparknlp.org/docs/en/annotators)

### 2. Sentiment Analysis (+/- framing)

* BERT 
    - State of the art
    - Computationally intensive, a lot of environment setup required
    - Takes a while if conducted on very large datasets on a single machine
    - May take GPUs
* Spark NLP
    - Has pre-trained models for sentiment analysis that apparently work reasonably well (we can cross-validate with BERT & human labels)
    - Designed to scale and distribute computational workload across multiple machines
    - CPUs
    - Overhead for setting up Spark cluster may not be worth it...
    - Strength = real-time data streams and processing
    - Integrated into Spark ecosystem

Which to choose?
BERT
- Sentiment analysis of cookie banner text is probably best concieved as a one-time task. And I don't imagine collecting over 1 million banners (Santos et al. (2021) only coded 407). 
- If the data is not too large, we use BERT

SPARK NLP
- If we want to integrate the sentiment analysis as part of a larger pipeline. Why? Maybe we are scouring the internet and have new websites that need new analysis on an ongoing basis?

VERDICT: BERT (unless)
* Static dataset of cookie banner text
* Reasonably sized corpus
* We can store all of the text in one place, extract them into one worker, and conduct the sentiment analysis

### Pipeline

1. AWS Lambda to scrape and store cookie banner info in S3 (because we want LT)
    * Could also store momentarily in DynamoDB if any site records need updating
2. Set up EC2 instance or AWS Batch job to periodically extract and preprocess cookie banner text (cleaning, normalizing, etc.)
3. Set up EC2 instance wth GPU support to run BERT model for sentiment analysis. OR download the data and run this locally (we don't have access to GPUs on AWS I think)
4. Store results in S3 for retrieval and analysis