# Data Preprocessing

In [None]:
# TODO: filter dataset to only include English tweets => 189,626 tweets

# 1. Filtering English language: by using pandas, a Python data analysis library, in the
language column, only rows where the language field had the value “en” were filtered.
This step was necessary to increase the reliability of the pre-trained BERT model for
sentiment analysis [ 36 ]. After this filtering, 189,626 tweets out of 472,399 tweets were
filtered as English text.

In [None]:
# 2. Text lowercasing: all tweets were converted to lowercase; according to Hickman
# et al. [37 ], lowercasing tends to be beneficial because it reduces data dimensionality,
# thereby increasing statistical power, and usually does not reduce validity.

In [None]:
# 3. Stop word removal: common English (function) words such as “and”, “is”, “I”, “am”,
# “what”, “of”, etc. were removed by using the Natural Language Toolkit (NLTK).
# Stop word removal has the advantages of reducing the size of the stored dataset and
# improving the overall efficiency and effectiveness of the analysis [38].

In [None]:
# 4. URLs removal: all URLs were removed from tweets, since the text of URL strings does
# not necessarily convey any relevant information, and can therefore be removed [39].

In [None]:
# 5. Duplicate removal: all duplicate tweets were removed to eliminate redundancy and
# possible skewing of the results.

In [None]:
# TODO: exclude location info (96% of the tweets lacked geolocation)

In [None]:
# TODO: assign sentiment labels using pre-trained BERT sentiment model

# Neural Network Models

- Sentiment Analysis
  - pre-trained transformer-based `BERT` model

- Anomaly Detection
  - `autoencoder`
  - `LSTM with Attention`

## Sentiment Analysis
- `nlptown/bert-base-multilingual-uncased-sentiment` :  fine-tuned version of `bert-base-multilingual-uncased`, which is optimized for sentiment analysis across six languages: English, Dutch, German, French, Spanish and Italian.
- Reference: Lakhanpal, S.; Gupta, A.; Agrawal, R. Leveraging Explainable AI to Analyze Researchers’ Aspect-Based Sentiment About ChatGPT. In Proceedings of the 15th International Conference on Intelligent Human Computer Interaction (IHCI 2023), Daegu, Republic of Korea, 8–10 November 2023; pp. 281–290.

- Can be seen as part of preprocessing???


In [None]:
# TODO: Tweets were tokenized using the AutoTokenizer from HuggingFace Transformers, truncated to a maximum length of 512 tokens [41].

# TODO: The model predicted sentiment scores across five classes representing very negative to very positive sentiments.
# These categorical outputs were then converted to a continuous polarity scale ranging from −1 (strongly negative) to +1 (strongly positive) to facilitate the temporal analysis of sentiment fluctuations

## Anomaly Detection

- `autoencoder`
  - An autoencoder neural network was designed and trained to detect anomalies based on deviations in tweet sentiment patterns.
  - The input data was structured into sequences of polarity scores. 
  - The autoencoder was implemented as a fully connected feedforward network with a three-layer encoder and symmetric decoder.
  - The encoder consisted of a hidden layer with 64 neurons followed by a 16-neuron bottleneck, using rectified linear unit (ReLU) activations for encoding and decoding [ 42 ]. 
  - Reconstruction errors (mean squared error between actual and reconstructed sequences) were calculated, and tweets with errors above the 95th percentile threshold were flagged as anomalies. 

- `LSTM with Attention`
  - An LSTM neural network with an integrated attention mechanism was implemented to detect anomalies based on prediction errors.
  - Input sequences of polarity scores were processed through LSTM layers, and attention layers were applied to selectively weigh temporal dependencies within the sequences.
  - The LSTM with attention included a single-layer LSTM model with a hidden size of 32, followed by an attention mechanism.

- Common config
  - Both models were trained for 10 epochs using the Adam optimizer (learning rate was set to 0.001), with a batch size of 32 and mean squared error (MSE) loss. 
  - Sentiment polarity scores were normalized using MinMax scaling to the [0,1] range. The model’s output was a prediction of subsequent sentiment scores.
  - Anomalies were identified when prediction errors exceeded a threshold set at the 95th percentile, highlighting sudden or extreme shifts (changes) in sentiment.