# Data Preprocessing

- [Dataset](https://www.kaggle.com/datasets/swaptr/turkey-earthquake-tweets)

In [1]:
import pandas as pd

# NOTE: unzip ./data/turkey_syria_earthquake_tweets/archive.zip before running the below code. `tweets.csv` file is too big.
full_data = pd.read_csv("./data/turkey_syria_earthquake_tweets/tweets.csv")
full_data.head()

FileNotFoundError: [Errno 2] No such file or directory: './data/turkey_syria_earthquake_tweets/tweets.csv'

In [None]:
# TODO: filter dataset to only include English tweets => 189,626 tweets

# 1) Filtering English language: by using pandas, a Python data analysis library, in the language column, only rows where the language field had the value “en” were filtered.
# This step was necessary to increase the reliability of the pre-trained BERT model for sentiment analysis [ 36 ]. 
# After this filtering, 189,626 tweets out of 472,399 tweets were filtered as English text.

print(len(full_data))
en_filtered_data = full_data[full_data["language"] == "en"]
print(len(en_filtered_data))
en_filtered_data.to_csv("./data/turkey_syria_earthquake_tweets/tweets_en.csv")

478052
189626


In [2]:
df = pd.read_csv("./data/turkey_syria_earthquake_tweets/tweets_en.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,coordinates,place,source
0,1,2023-02-21 03:29:07+00:00,New search &amp; rescue work is in progress in...,"['Hatay', 'earthquakes', 'Türkiye', 'TurkiyeQu...",1.0,0.0,5697.0,True,en,,,Twitter Web App
1,2,2023-02-21 03:29:04+00:00,Can't imagine those who still haven't recovere...,"['Turkey', 'earthquake', 'turkeyearthquake2023...",0.0,0.0,1.0,False,en,,,Twitter for Android
2,3,2023-02-21 03:28:06+00:00,its a highkey sign for all of us to ponder ove...,"['turkeyearthquake2023', 'earthquake', 'Syria']",0.0,0.0,3.0,False,en,,,Twitter for Android
3,5,2023-02-21 03:27:27+00:00,"See how strong was the #Earthquake of Feb 20, ...","['Earthquake', 'Hatay', 'Turkey', 'turkeyearth...",0.0,0.0,21836.0,True,en,,,Twitter for Android
4,6,2023-02-21 03:27:11+00:00,More difficult news today on top of struggles ...,"['Türkiye', 'Syria', 'earthquake', 'Canadians']",1.0,0.0,675.0,False,en,,,Twitter for iPhone


In [3]:
# 2. Text lowercasing: all tweets were converted to lowercase; according to Hickman
# et al. [37 ], lowercasing tends to be beneficial because it reduces data dimensionality,
# thereby increasing statistical power, and usually does not reduce validity.
df['content'] = df['content'].str.lower()

In [7]:
# 3. Stop word removal: common English (function) words such as “and”, “is”, “I”, “am”,
# “what”, “of”, etc. were removed by using the Natural Language Toolkit (NLTK).
# Stop word removal has the advantages of reducing the size of the stored dataset and
# improving the overall efficiency and effectiveness of the analysis [38].

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Download required NLTK data
nltk.download('stopwords')
nltk.download('punkt_tab')

# Get English stop words
stop_words = set(stopwords.words('english'))

# Function to remove stop words from text
def remove_stopwords(text):
    if pd.isna(text):
        return text
    
    # Tokenize the text
    words = word_tokenize(text)
    
    # Remove stop words and return as string
    filtered_words = [word for word in words if word.lower() not in stop_words]
    
    return ' '.join(filtered_words)

df['content'] = df['content'].apply(remove_stopwords)
df.head()


[nltk_data] Downloading package stopwords to /Users/jade/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/jade/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Unnamed: 0.1,Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,coordinates,place,source
0,1,2023-02-21 03:29:07+00:00,new search & amp ; rescue work progress # hata...,"['Hatay', 'earthquakes', 'Türkiye', 'TurkiyeQu...",1.0,0.0,5697.0,True,en,,,Twitter Web App
1,2,2023-02-21 03:29:04+00:00,ca n't imagine still n't recovered previous tr...,"['Turkey', 'earthquake', 'turkeyearthquake2023...",0.0,0.0,1.0,False,en,,,Twitter for Android
2,3,2023-02-21 03:28:06+00:00,highkey sign us ponder actions return merciful...,"['turkeyearthquake2023', 'earthquake', 'Syria']",0.0,0.0,3.0,False,en,,,Twitter for Android
3,5,2023-02-21 03:27:27+00:00,"see strong # earthquake feb 20 , 2023 # hatay ...","['Earthquake', 'Hatay', 'Turkey', 'turkeyearth...",0.0,0.0,21836.0,True,en,,,Twitter for Android
4,6,2023-02-21 03:27:11+00:00,difficult news today top struggles already fac...,"['Türkiye', 'Syria', 'earthquake', 'Canadians']",1.0,0.0,675.0,False,en,,,Twitter for iPhone


In [8]:
# 4. URLs removal: all URLs were removed from tweets, since the text of URL strings does
# not necessarily convey any relevant information, and can therefore be removed [39].

import re
df['content'] = df['content'].str.replace(r'http\S+', '', regex=True)

In [15]:
# 5. Duplicate removal: all duplicate tweets were removed to eliminate redundancy and
# possible skewing of the results.

df = df.drop_duplicates(subset='content', keep='first')

In [None]:
# exclude location info (96% of the tweets lacked geolocation)
# drop if exists
df = df.drop(columns=['coordinates', 'place', 'Unnamed: 0'], errors='ignore') 
df.head()

Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,source
0,2023-02-21 03:29:07+00:00,new search & amp ; rescue work progress # hata...,"['Hatay', 'earthquakes', 'Türkiye', 'TurkiyeQu...",1.0,0.0,5697.0,True,en,Twitter Web App
1,2023-02-21 03:29:04+00:00,ca n't imagine still n't recovered previous tr...,"['Turkey', 'earthquake', 'turkeyearthquake2023...",0.0,0.0,1.0,False,en,Twitter for Android
2,2023-02-21 03:28:06+00:00,highkey sign us ponder actions return merciful...,"['turkeyearthquake2023', 'earthquake', 'Syria']",0.0,0.0,3.0,False,en,Twitter for Android
3,2023-02-21 03:27:27+00:00,"see strong # earthquake feb 20 , 2023 # hatay ...","['Earthquake', 'Hatay', 'Turkey', 'turkeyearth...",0.0,0.0,21836.0,True,en,Twitter for Android
4,2023-02-21 03:27:11+00:00,difficult news today top struggles already fac...,"['Türkiye', 'Syria', 'earthquake', 'Canadians']",1.0,0.0,675.0,False,en,Twitter for iPhone


In [None]:
# TODO: assign sentiment labels using pre-trained BERT sentiment model

# Neural Network Models

- Sentiment Analysis
  - pre-trained transformer-based `BERT` model

- Anomaly Detection
  - `autoencoder`
  - `LSTM with Attention`

## Sentiment Analysis
- `nlptown/bert-base-multilingual-uncased-sentiment` :  fine-tuned version of `bert-base-multilingual-uncased`, which is optimized for sentiment analysis across six languages: English, Dutch, German, French, Spanish and Italian.
- Reference: Lakhanpal, S.; Gupta, A.; Agrawal, R. Leveraging Explainable AI to Analyze Researchers’ Aspect-Based Sentiment About ChatGPT. In Proceedings of the 15th International Conference on Intelligent Human Computer Interaction (IHCI 2023), Daegu, Republic of Korea, 8–10 November 2023; pp. 281–290.

- Can be seen as part of preprocessing???


In [None]:
# TODO: Tweets were tokenized using the AutoTokenizer from HuggingFace Transformers, truncated to a maximum length of 512 tokens [41].

# TODO: The model predicted sentiment scores across five classes representing very negative to very positive sentiments.
# These categorical outputs were then converted to a continuous polarity scale ranging from −1 (strongly negative) to +1 (strongly positive) to facilitate the temporal analysis of sentiment fluctuations

## Anomaly Detection

- `autoencoder`
  - An autoencoder neural network was designed and trained to detect anomalies based on deviations in tweet sentiment patterns.
  - The input data was structured into sequences of polarity scores. 
  - The autoencoder was implemented as a fully connected feedforward network with a three-layer encoder and symmetric decoder.
  - The encoder consisted of a hidden layer with 64 neurons followed by a 16-neuron bottleneck, using rectified linear unit (ReLU) activations for encoding and decoding [ 42 ]. 
  - Reconstruction errors (mean squared error between actual and reconstructed sequences) were calculated, and tweets with errors above the 95th percentile threshold were flagged as anomalies. 

- `LSTM with Attention`
  - An LSTM neural network with an integrated attention mechanism was implemented to detect anomalies based on prediction errors.
  - Input sequences of polarity scores were processed through LSTM layers, and attention layers were applied to selectively weigh temporal dependencies within the sequences.
  - The LSTM with attention included a single-layer LSTM model with a hidden size of 32, followed by an attention mechanism.

- Common config
  - Both models were trained for 10 epochs using the Adam optimizer (learning rate was set to 0.001), with a batch size of 32 and mean squared error (MSE) loss. 
  - Sentiment polarity scores were normalized using MinMax scaling to the [0,1] range. The model’s output was a prediction of subsequent sentiment scores.
  - Anomalies were identified when prediction errors exceeded a threshold set at the 95th percentile, highlighting sudden or extreme shifts (changes) in sentiment.