# Data Preprocessing

- [Dataset](https://www.kaggle.com/datasets/swaptr/turkey-earthquake-tweets)

In [46]:
import os
import pandas as pd

### Load data

In [47]:
# # NOTE: unzip ./data/turkey_syria_earthquake_tweets/archive.zip before running the below code. `tweets.csv` file is too big.
# full_data = pd.read_csv("./data/turkey_syria_earthquake_tweets/tweets.csv")
# full_data.head()

# # Invalid data
# # NOTE: full_data.iloc[:, 6] contains NaN
# # DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
# print(full_data.iloc[:, 6].apply(type).value_counts())
# mask = full_data['isVerified'].apply(lambda v: isinstance(v, float))
# print(full_data.loc[mask, 'isVerified'])

df = pd.read_csv("./data/turkey_syria_earthquake_tweets/tweets_en.csv")
print(f"N = {len(df)}")
df.head()

N = 189626


Unnamed: 0.1,Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,coordinates,place,source
0,1,2023-02-21 03:29:07+00:00,New search &amp; rescue work is in progress in...,"['Hatay', 'earthquakes', 'T√ºrkiye', 'TurkiyeQu...",1.0,0.0,5697.0,True,en,,,Twitter Web App
1,2,2023-02-21 03:29:04+00:00,Can't imagine those who still haven't recovere...,"['Turkey', 'earthquake', 'turkeyearthquake2023...",0.0,0.0,1.0,False,en,,,Twitter for Android
2,3,2023-02-21 03:28:06+00:00,its a highkey sign for all of us to ponder ove...,"['turkeyearthquake2023', 'earthquake', 'Syria']",0.0,0.0,3.0,False,en,,,Twitter for Android
3,5,2023-02-21 03:27:27+00:00,"See how strong was the #Earthquake of Feb 20, ...","['Earthquake', 'Hatay', 'Turkey', 'turkeyearth...",0.0,0.0,21836.0,True,en,,,Twitter for Android
4,6,2023-02-21 03:27:11+00:00,More difficult news today on top of struggles ...,"['T√ºrkiye', 'Syria', 'earthquake', 'Canadians']",1.0,0.0,675.0,False,en,,,Twitter for iPhone


In [48]:
print(df.loc[0, 'content'])

New search &amp; rescue work is in progress in #Hatay after two more #earthquakes hit #T√ºrkiye‚Äôs southeastern province.  #TurkiyeQuakes #Turkey-#Syria  #Earthquake #turkeyearthquake2023  https://t.co/sd4WHByiQs


### Text lowercasing

All tweets were converted to lowercase; according to Hickman et al. [37], lowercasing tends to be beneficial because it reduces data dimensionality, thereby increasing statistical power, and usually does not reduce validity.

In [49]:
df['content'] = df['content'].str.lower()
print(df.loc[0, 'content'])

new search &amp; rescue work is in progress in #hatay after two more #earthquakes hit #t√ºrkiye‚Äôs southeastern province.  #turkiyequakes #turkey-#syria  #earthquake #turkeyearthquake2023  https://t.co/sd4whbyiqs


### Stop word removal

Stop word removal was useful in traditional NLP models, but not so effective for DL models. 

The original paper compared the traditional NLP models with DL models so they used it. 

"common English (function) words such as ‚Äúand‚Äù, ‚Äúis‚Äù, ‚ÄúI‚Äù, ‚Äúam‚Äù, ‚Äúwhat‚Äù, ‚Äúof‚Äù, etc. were removed by using the Natural Language Toolkit (NLTK). Stop word removal has the advantages of reducing the size of the stored dataset and improving the overall efficiency and effectiveness of the analysis [38]."

However, **we are only training DL models in our analysis so we are not applying it**.

In [50]:
# import nltk
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize

# NLTK_LOCAL_PATH = "./nltk_data"
# os.makedirs(NLTK_LOCAL_PATH, exist_ok=True)
# nltk.data.path.append(NLTK_LOCAL_PATH)
# # nltk.download('stopwords', download_dir=NLTK_LOCAL_PATH)
# # nltk.download('punkt', download_dir=NLTK_LOCAL_PATH)

# # Get English stop words
# stop_words = set(stopwords.words('english'))

# # Function to remove stop words from text
# def remove_stopwords(text):
#     if pd.isna(text):
#         return text
#     # Tokenize the text
#     words = word_tokenize(text)
#     # Remove stop words and return as string
#     filtered_words = [word for word in words if word.lower() not in stop_words]
#     return ' '.join(filtered_words)

# df['content'] = df['content'].apply(remove_stopwords)
# print(df.loc[0, 'content'])

### URLs removal

All URLs were removed from tweets, since the text of URL strings does not necessarily convey any relevant information,  and can therefore be removed [39].

In [51]:
df['content'] = df['content'].str.replace(r'http\S+', '', regex=True)
print(df.loc[0, 'content'])

new search &amp; rescue work is in progress in #hatay after two more #earthquakes hit #t√ºrkiye‚Äôs southeastern province.  #turkiyequakes #turkey-#syria  #earthquake #turkeyearthquake2023  


### Duplicate removal

All duplicate tweets were removed to eliminate redundancy and possible skewing of the results.

In [52]:
print(len(df))
df = df.drop_duplicates(subset='content', keep='first')
print(len(df))

189626
180915


### Exclude location info

96% of the tweets lacked geolocation, drop if exists

In [53]:
df = df.drop(columns=['coordinates', 'place'], errors='ignore')
df.head()

Unnamed: 0.1,Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,source
0,1,2023-02-21 03:29:07+00:00,new search &amp; rescue work is in progress in...,"['Hatay', 'earthquakes', 'T√ºrkiye', 'TurkiyeQu...",1.0,0.0,5697.0,True,en,Twitter Web App
1,2,2023-02-21 03:29:04+00:00,can't imagine those who still haven't recovere...,"['Turkey', 'earthquake', 'turkeyearthquake2023...",0.0,0.0,1.0,False,en,Twitter for Android
2,3,2023-02-21 03:28:06+00:00,its a highkey sign for all of us to ponder ove...,"['turkeyearthquake2023', 'earthquake', 'Syria']",0.0,0.0,3.0,False,en,Twitter for Android
3,5,2023-02-21 03:27:27+00:00,"see how strong was the #earthquake of feb 20, ...","['Earthquake', 'Hatay', 'Turkey', 'turkeyearth...",0.0,0.0,21836.0,True,en,Twitter for Android
4,6,2023-02-21 03:27:11+00:00,more difficult news today on top of struggles ...,"['T√ºrkiye', 'Syria', 'earthquake', 'Canadians']",1.0,0.0,675.0,False,en,Twitter for iPhone


# Neural Network Models


- Sentiment Analysis
  - pre-trained transformer-based `BERT` model

- Anomaly Detection
  - `autoencoder`
  - `LSTM with Attention`

## Sentiment Analysis

- `nlptown/bert-base-multilingual-uncased-sentiment` :  fine-tuned version of `bert-base-multilingual-uncased`, which is optimized for sentiment analysis across six languages: English, Dutch, German, French, Spanish and Italian.
- Reference: Lakhanpal, S.; Gupta, A.; Agrawal, R. Leveraging Explainable AI to Analyze Researchers‚Äô Aspect-Based Sentiment About ChatGPT. In Proceedings of the 15th International Conference on Intelligent Human Computer Interaction (IHCI 2023), Daegu, Republic of Korea, 8‚Äì10 November 2023; pp. 281‚Äì290.

- Can be seen as part of preprocessing?

In [None]:
# Tokenize inputs
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

# pipe = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

inputs = df['content'].tolist()

inputs = inputs[:2500]
model_inputs = tokenizer(inputs, padding=True, truncation=True, max_length=512, return_tensors="pt")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [10]:
# Call BERT and get predicted labels
outputs = model(**model_inputs)
logits = outputs.logits
probabilities = torch.softmax(logits, dim=-1)
predicted_labels = torch.argmax(probabilities, dim=-1)

# Convert to polarity scale as stated in the paper
star_ratings = predicted_labels + 1
polarity_scores = (star_ratings - 3) / 2.0
# Now each tweet has a sentiment polarity ‚àà [-1, +1]
# Note that I had to create a subset because of resource limitations
df_subset = df.iloc[:2500].copy()
df_subset['sentiment_polarity'] = polarity_scores.numpy()


# clean and normalize polarity data
pol = df_subset['sentiment_polarity'].astype(float).copy()
pol = pol.interpolate(limit_direction="both")  # fill occasional gaps
pol_mean, pol_std = pol.mean(), pol.std() if pol.std() > 0 else 1.0
pol_norm = (pol - pol_mean) / pol_std


In [11]:
# Create Autoencoder neural network
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split

# hyperparameters
SEQ_LEN   = 64
STRIDE    = 1
BATCH_SIZE= 128
EPOCHS    = 20
LR        = 1e-3
DEVICE    = "cuda" if torch.cuda.is_available() else "cpu"
RNG       = np.random.default_rng(42)



In [12]:
# construct input data as sequences of polarity scores. Note that I used GPT
# for this section

series = pol_norm.to_numpy().astype(np.float32)
N = len(series)
windows = []
indices = []  # store ending index of each window for mapping anomalies back
for start in range(0, N - SEQ_LEN + 1, STRIDE):
    end = start + SEQ_LEN
    windows.append(series[start:end])
    indices.append(end - 1)  # align anomaly decision at window end
X = np.stack(windows, axis=0)  # (num_windows, SEQ_LEN)
idx_map = np.array(indices)

class SeqDataset(Dataset):
    def __init__(self, X):
        self.X = torch.from_numpy(X)
    def __len__(self):
        return self.X.shape[0]
    def __getitem__(self, i):
        x = self.X[i]
        return x, x  # autoencoder: input == target

dataset = SeqDataset(X)

# train/test split the dataset
val_size = int(0.2 * len(dataset))
train_size = len(dataset) - val_size
train_ds, val_ds = random_split(dataset, [train_size, val_size], generator=torch.Generator().manual_seed(42))
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=False)
val_dl   = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, drop_last=False)

In [13]:
# declare AutoEncoder class with layers specified in the paper
class AE(nn.Module):
    def __init__(self, in_dim=SEQ_LEN, h1=128, h2=64, bottleneck=16):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(in_dim, h1),
            nn.ReLU(),
            nn.Linear(h1, h2),
            nn.ReLU(),
            nn.Linear(h2, bottleneck),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(bottleneck, h2),
            nn.ReLU(),
            nn.Linear(h2, h1),
            nn.ReLU(),
            nn.Linear(h1, in_dim)  # final layer (no activation for regression)
        )
    def forward(self, x):
        z = self.encoder(x)
        out = self.decoder(z)
        return out

model = AE(in_dim=SEQ_LEN, h1=128, h2=64, bottleneck=16).to(DEVICE)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

In [14]:
# train loop
def run_epoch(dl, train=True):
    model.train(train)
    total, count = 0.0, 0
    for xb, yb in dl:
        xb, yb = xb.to(DEVICE), yb.to(DEVICE)
        if train:
            optimizer.zero_grad(set_to_none=True)
        preds = model(xb)
        loss = criterion(preds, yb)
        if train:
            loss.backward()
            optimizer.step()
        total += loss.item() * xb.size(0)
        count += xb.size(0)
    return total / max(count, 1)

best_val = float("inf")
patience, bad = 5, 0 # used to prevent overfitting, tracks if validation loss stops decreasing or starts increasing
for epoch in range(1, EPOCHS+1):
    tr = run_epoch(train_dl, train=True)
    va = run_epoch(val_dl, train=False)
    print(f"Epoch {epoch:02d} | train MSE: {tr:.6f} | val MSE: {va:.6f}")
    if va + 1e-6 < best_val:
        best_val = va
        bad = 0
        best_state = {k: v.cpu() for k, v in model.state_dict().items()}
    else:
        bad += 1
        if bad >= patience:
            print("Early stopping.")
            break

# Restore best model
model.load_state_dict(best_state)
model.to(DEVICE)
model.eval()

Epoch 01 | train MSE: 0.986858 | val MSE: 0.978194
Epoch 02 | train MSE: 0.975218 | val MSE: 0.961202
Epoch 03 | train MSE: 0.944968 | val MSE: 0.920235
Epoch 04 | train MSE: 0.896460 | val MSE: 0.878052
Epoch 05 | train MSE: 0.853866 | val MSE: 0.846852
Epoch 06 | train MSE: 0.821692 | val MSE: 0.824133
Epoch 07 | train MSE: 0.798710 | val MSE: 0.810321
Epoch 08 | train MSE: 0.785143 | val MSE: 0.804307
Epoch 09 | train MSE: 0.771090 | val MSE: 0.791600
Epoch 10 | train MSE: 0.753499 | val MSE: 0.780109
Epoch 11 | train MSE: 0.737869 | val MSE: 0.769325
Epoch 12 | train MSE: 0.723167 | val MSE: 0.756426
Epoch 13 | train MSE: 0.706743 | val MSE: 0.746167
Epoch 14 | train MSE: 0.693631 | val MSE: 0.741966
Epoch 15 | train MSE: 0.682895 | val MSE: 0.732714
Epoch 16 | train MSE: 0.669968 | val MSE: 0.723534
Epoch 17 | train MSE: 0.658414 | val MSE: 0.714663
Epoch 18 | train MSE: 0.647138 | val MSE: 0.710504
Epoch 19 | train MSE: 0.637369 | val MSE: 0.701968
Epoch 20 | train MSE: 0.624954 

AE(
  (encoder): Sequential(
    (0): Linear(in_features=64, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=16, bias=True)
    (5): ReLU()
  )
  (decoder): Sequential(
    (0): Linear(in_features=16, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=64, bias=True)
  )
)

In [19]:
# compute recon errors
with torch.no_grad():
    X_tensor = torch.from_numpy(X).to(DEVICE)
    recon = model(X_tensor)
    mse = ((recon - X_tensor) ** 2).mean(dim=1).detach().cpu().numpy()  # per-window MSE

# 95th percentile threshold
threshold = np.percentile(mse, 95.0)

# Flag anomalies: windows with error above threshold
is_anom = mse > threshold

# Map window anomalies back to tweet-level indices
anom_indices = idx_map[is_anom]  # indices in the original df (row positions)
df_subset['ae_recon_error'] = np.nan
df_subset.loc[idx_map, 'ae_recon_error'] = mse  # assign errors to window-end rows
df_subset['ae_anomaly_95p'] = False
df_subset.loc[anom_indices, 'ae_anomaly_95p'] = True


print(f"Detected anomalies: {is_anom.sum()} / {len(is_anom)} windows")
anomalies = df_subset.loc[df_subset['ae_anomaly_95p'] == True, ['content', 'sentiment_polarity']]
# non_anomalies = df_subset.loc[df_subset['ae_anomaly_95p'] == False, ['content', 'sentiment_polarity']]
print(anomalies.head(10))

Threshold (95th pct): 1.322654
Detected anomalies: 122 / 2437 windows
                                               content  sentiment_polarity
166  @ lefkosaturkbld , present information disaste...                -1.0
168  africa also reported splitting two . # turkey ...                -1.0
317  heart aching feel helpless # helpsyria # syria...                -1.0
319  idlib health directorate : dozens emergency ca...                -1.0
320  üîî # earthquake ( # deprem ) m2.7 occurred 20 k...                -1.0
329  best option choose btw ‚û°Ô∏è # livrma match  : //...                 1.0
336  footage shaking adana city hospital magnitude ...                -1.0
339  # almayadeen 's correspondent # aleppo said 3 ...                -1.0
356  friend vietnam predicted another # earthquake ...                -1.0
628  interior minister soylu : 3 people lost lives ...                -1.0
                                             content  sentiment_polarity
0  new search & amp ; res

## Anomaly Detection

- `autoencoder`
  - An autoencoder neural network was designed and trained to detect anomalies based on deviations in tweet sentiment patterns.
  - The input data was structured into sequences of polarity scores.
  - The autoencoder was implemented as a fully connected feedforward network with a three-layer encoder and symmetric decoder.
  - The encoder consisted of a hidden layer with 64 neurons followed by a 16-neuron bottleneck, using rectified linear unit (ReLU) activations for encoding and decoding [ 42 ].
  - Reconstruction errors (mean squared error between actual and reconstructed sequences) were calculated, and tweets with errors above the 95th percentile threshold were flagged as anomalies.

- `LSTM with Attention`
  - An LSTM neural network with an integrated attention mechanism was implemented to detect anomalies based on prediction errors.
  - Input sequences of polarity scores were processed through LSTM layers, and attention layers were applied to selectively weigh temporal dependencies within the sequences.
  - The LSTM with attention included a single-layer LSTM model with a hidden size of 32, followed by an attention mechanism.

- Common config
  - Both models were trained for 10 epochs using the Adam optimizer (learning rate was set to 0.001), with a batch size of 32 and mean squared error (MSE) loss.
  - Sentiment polarity scores were normalized using MinMax scaling to the [0,1] range. The model‚Äôs output was a prediction of subsequent sentiment scores.
  - Anomalies were identified when prediction errors exceeded a threshold set at the 95th percentile, highlighting sudden or extreme shifts (changes) in sentiment.

In [None]:
# TBA