## Preprocessing and feature extraction notebook

### Group 8 Members
#### Spring Semester 2024-2025
- Alexandre Gonçalves - 20240738
- Bráulio Damba - 20240007
- Hugo Fonseca - 20240520
- Ricardo Pereira - 20240745
- Victoria Goon - 20240550

## 1 - Imports

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import scale
from sklearn.model_selection import GridSearchCV
import re
import string
import nltk
from sklearn.metrics.pairwise import cosine_distances
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import re
import string

from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Text extraction 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_distances
from collections import Counter

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from gensim.models import FastText

import scipy.sparse
from scipy import sparse

import contractions
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
from gensim.models import Word2Vec
import gensim.downloader as api

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from gensim.utils import simple_preprocess

import pickle
from sklearn.utils import resample

from itertools import product

from sentence_transformers import SentenceTransformer

# Deep Learning libraries
from keras.models import Sequential,Model
from keras.layers import Dense, Activation, Dropout, Flatten, Input
from keras.layers import Embedding, Conv1D, MaxPooling1D, GlobalMaxPooling1D, LSTM, Bidirectional, Dropout, Flatten, GRU
from tensorflow.keras.optimizers import Adam

from transformers import AutoTokenizer, AutoModel
import torch

# Set pd options to display all columns and rows
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 30)
pd.set_option('display.max_colwidth', None)  # Show full text without truncation


# Download required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="xgboost")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /Users/ricardo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ricardo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/ricardo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/ricardo/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
# Define the base directory (where the notebook is)
BASE_DIR = os.path.dirname(os.path.abspath("__file__"))

# Construct full paths to the CSV files
train_path = os.path.join(BASE_DIR, "data", "train.csv")
test_path = os.path.join(BASE_DIR, "data", "test.csv")

# Load the datasets
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

## 2 - Pre-Processing

Filtering text with less than 3 words, that are noise and have no sentimental value.

In [3]:
original_train_df = df_train.copy()
original_train_size = len(original_train_df)

def word_count(text):
    if isinstance(text, str):
        return len(text.strip().split())
    return 0

# Identify tweets to remove (< 3 words)
removed_tweets = original_train_df[original_train_df["text"].apply(word_count) < 3]

df_train = original_train_df[original_train_df["text"].apply(word_count) >= 3].reset_index(drop=True)

new_train_size = len(df_train)
removed = original_train_size - new_train_size
percent_removed = (removed / original_train_size) * 100

print(f"Removed {removed} tweets ({percent_removed:.2f}% of original data)")

print("\nSample of removed tweets (< 3 words):")
print(removed_tweets["text"].sample(n=min(5, len(removed_tweets)), random_state=42).to_string(index=False))

Removed 42 tweets (0.44% of original data)

Sample of removed tweets (< 3 words):
                    AMRN
Wipro赢得Marelli的多年战略性IT协议
 https://t.co/575AH1YRkF
                    BWAY
                 @TicToc


### 2.1 - Text cleaning

**Symeonidis, Effrosynidis, and Arampatzis (2018) perform a comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis** perform a comparative study on several pre-processing techniques for twitter sentiment analysis.

Based on this approach, we decided to:

1) **Handle contractions:** Strings like “won’t” and “don’t” will be replaced by “will not” and “do not”, respectively. If we do not replace contractions, the tokenization process would create the tokens “don” and “’t” (for the case of “don’t”), with the second one not being particularly helpful as it will match with more than the other not’s in texts.

2) **URL/User Mention Replacement**: In tweets, the majority of sentences contain a URL, a user mention, and/or a hashtag symbol. Their presence does not contain any sentiment and one approach is to replace them in pre processing with tags as, e.g. **Agarwal et al. (2011)** do. In our project, we use the tags ‘URL’ and ‘USER’.

2) **Replace the numbers** with the [NUM] symbol, as many researchers argue that in sentiment analysis, keeping numbers is not really useful;

3) **Replace punctuation repetition** , as it normalizes language and generalizes vocabulary to represent sentiment **(Balahur, 2013)** ;

4) **Stopword removal:** Stopwords are function words with high frequencies of presence across all sentences. It is considered needless to analyze 
them, due to the fact that they do not contain much useful information for Sentiment Analysis.

5) **Lemmatization:** This method analyzes a word morphologically and removes its inflectional ending, producing its base form or lemma as it is 
found in a dictionary. Lemmatization is used by **Guzman and Maalej (2014)** to reduce the number of feature descriptors for user sentiment extraction. 

6) **Stemming:** It is the process of removing the endings of the words in order to detect their root form or stem. By doing so, many words are merged and the dimensionality is reduced. It is a widely used method that generally yields good results. In our project, the Porter Stemmer ( **Porter, 1980**) is used.


In [4]:
# Source: https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet
lemmatizer = WordNetLemmatizer()

# Source: https://www.nltk.org/api/nltk.tokenize.casual.html
# Difference between TweetTokenizer and Word_Tokenize: https://stackoverflow.com/questions/61919670/how-nltk-tweettokenizer-different-from-nltk-word-tokenize
tokenizer = TweetTokenizer()

# Source: https://www.nltk.org/_modules/nltk/stem/porter.html
stemmer = PorterStemmer()

# Set of English stop words from NLTK
stop_words = set(stopwords.words('english'))

In [5]:
def clean_text_column(text,lemmatizer=None, stemmer=None, remove_stopwords=None):
    text = text.lower()

    # Replace URLs and user mentions
    text = re.sub(r"http\S+|www\.\S+", "URL", text)
    text = re.sub(r"@\w+", "USER", text)

    # Expand contractions (we use contractions library for this)
    # Contractions library Source: https://pypi.org/project/contractions/
    text = contractions.fix(text)

    # # Replace numbers with [NUM]
    # text = re.sub(r"\d+(\.\d+)?", "[NUM]", text)

    # Convert to tickers (e.g., $AAPL to [TICKER])
    text = re.sub(r"\$[a-z]{1,5}", "[TICKER]", text)

    #Remove numbers
    text = re.sub(r"\d+", "", text)

    # Normalize punctuation repetitions
    text = re.sub(r"([!?\.])\1+", r"\1", text)

    # Tokenize
    tokens = tokenizer.tokenize(text)

    # Remove stop words and punctuation
    if remove_stopwords:
        tokens = [token for token in tokens if token not in stop_words and token not in string.punctuation]
    else:
        tokens = [token for token in tokens if token not in string.punctuation]
    
    # Lemmatization OR stemming 
    if lemmatizer is not None and stemmer is None:
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    elif stemmer is not None and lemmatizer is None:
        tokens = [stemmer.stem(token) for token in tokens]
    elif lemmatizer is not None and stemmer is not None:
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Else, leave tokens as is

    # Source: https://www.nltk.org/api/nltk.tokenize.treebank.html 
    # TreebankWordDetokenizer from NLTK takes care of the correct spacing and formatting, 
    # and we get a well-formed sentence that looks like natural English (e.g. without TreebankWordDetokinzer: This is an example tweet ! , With: This is an example tweet!)
    return TreebankWordDetokenizer().detokenize(tokens)

#### Reason why we perform pre-processing with `clean_text_column` before the Train/Val/Test Split:

Doing this does **not** cause Data Leakage because the `clean_text_column` function performs **rule-based text transformations** such as:

- Removing URLs, user mentions, numbers, and tickers with regular expressions
- Expanding contractions
- Lemmatizing or stemming tokens
- etc. 

All these steps:
- Do NOT learn or fit any parameters** from the data.
- Do NOT extract information** (such as frequency statistics, word distributions, or labels) that could bias the model.
- Are simply applying the same set of rules to each text string independently, regardless of which dataset (train, val, test) it comes from.

**Therefore:**  
Applying this cleaning function to the entire dataset *before* splitting will NOT cause data leakage, because it does not expose our model to any information from the test/validation sets that it could "cheat" with during training or evaluation.

**Only steps that "fit" on the full data (like building a vocabulary, fitting a vectorizer, or computing mean/variance for scaling) should be done *after* splitting to avoid leakage.**


In [6]:
df_train_copy = df_train.copy()
df_test_copy = df_test.copy()

In [7]:
preproc_combinations = []

for lem, stm, rm_stop in product([None, lemmatizer], [None, stemmer], [False, True]):
    name = []
    name.append('lemma' if lem else 'no_lemma')
    name.append('stem' if stm else 'no_stem')
    name.append('no_stopwords' if rm_stop else 'with_stopwords')
    preproc_combinations.append({
        "lemmatizer": lem,
        "stemmer": stm,
        "remove_stopwords": rm_stop,
        "name": '_'.join(name)
    })

In [8]:
def apply_preproc_combinations(df, combinations, text_col="text"):
    for combo in combinations:
        column_name = f"text_{combo['name']}"
        print(f"Processing {column_name}...")
        df[column_name] = df[text_col].apply(
            lambda x: clean_text_column(
                x, 
                lemmatizer=combo['lemmatizer'], 
                stemmer=combo['stemmer'], 
                remove_stopwords=combo['remove_stopwords']
            )
        )
    return df

In [9]:
df_train_copy = apply_preproc_combinations(df_train_copy, preproc_combinations)
df_test_copy  = apply_preproc_combinations(df_test_copy, preproc_combinations)

Processing text_no_lemma_no_stem_with_stopwords...
Processing text_no_lemma_no_stem_no_stopwords...
Processing text_no_lemma_stem_with_stopwords...
Processing text_no_lemma_stem_no_stopwords...
Processing text_lemma_no_stem_with_stopwords...
Processing text_lemma_no_stem_no_stopwords...
Processing text_lemma_stem_with_stopwords...
Processing text_lemma_stem_no_stopwords...
Processing text_no_lemma_no_stem_with_stopwords...
Processing text_no_lemma_no_stem_no_stopwords...
Processing text_no_lemma_stem_with_stopwords...
Processing text_no_lemma_stem_no_stopwords...
Processing text_lemma_no_stem_with_stopwords...
Processing text_lemma_no_stem_no_stopwords...
Processing text_lemma_stem_with_stopwords...
Processing text_lemma_stem_no_stopwords...


In [10]:
df_train_copy.head()

Unnamed: 0,text,label,text_no_lemma_no_stem_with_stopwords,text_no_lemma_no_stem_no_stopwords,text_no_lemma_stem_with_stopwords,text_no_lemma_stem_no_stopwords,text_lemma_no_stem_with_stopwords,text_lemma_no_stem_no_stopwords,text_lemma_stem_with_stopwords,text_lemma_stem_no_stopwords
0,$BYND - JPMorgan reels in expectations on Beyond Meat https://t.co/bd0xbFGjkT,0,TICKER jpmorgan reels in expectations on beyond meat URL,TICKER jpmorgan reels expectations beyond meat URL,ticker jpmorgan reel in expect on beyond meat url,ticker jpmorgan reel expect beyond meat url,TICKER jpmorgan reel in expectation on beyond meat URL,TICKER jpmorgan reel expectation beyond meat URL,TICKER jpmorgan reel in expectation on beyond meat URL,TICKER jpmorgan reel expectation beyond meat URL
1,$CCL $RCL - Nomura points to bookings weakness at Carnival and Royal Caribbean https://t.co/yGjpT2ReD3,0,TICKER TICKER nomura points to bookings weakness at carnival and royal caribbean URL,TICKER TICKER nomura points bookings weakness carnival royal caribbean URL,ticker ticker nomura point to book weak at carniv and royal caribbean url,ticker ticker nomura point book weak carniv royal caribbean url,TICKER TICKER nomura point to booking weakness at carnival and royal caribbean URL,TICKER TICKER nomura point booking weakness carnival royal caribbean URL,TICKER TICKER nomura point to booking weakness at carnival and royal caribbean URL,TICKER TICKER nomura point booking weakness carnival royal caribbean URL
2,"$CX - Cemex cut at Credit Suisse, J.P. Morgan on weak building outlook https://t.co/KN1g4AWFIb",0,TICKER cemex cut at credit suisse j p morgan on weak building outlook URL,TICKER cemex cut credit suisse j p morgan weak building outlook URL,ticker cemex cut at credit suiss j p morgan on weak build outlook url,ticker cemex cut credit suiss j p morgan weak build outlook url,TICKER cemex cut at credit suisse j p morgan on weak building outlook URL,TICKER cemex cut credit suisse j p morgan weak building outlook URL,TICKER cemex cut at credit suisse j p morgan on weak building outlook URL,TICKER cemex cut credit suisse j p morgan weak building outlook URL
3,$ESS: BTIG Research cuts to Neutral https://t.co/MCyfTsXc2N,0,TICKER]: btig research cuts to neutral URL,TICKER]: btig research cuts neutral URL,ticker]: btig research cut to neutral url,ticker]: btig research cut neutral url,TICKER]: btig research cut to neutral URL,TICKER]: btig research cut neutral URL,TICKER]: btig research cut to neutral URL,TICKER]: btig research cut neutral URL
4,$FNKO - Funko slides after Piper Jaffray PT cut https://t.co/z37IJmCQzB,0,TICKER funko slides after piper jaffray pt cut URL,TICKER funko slides piper jaffray pt cut URL,ticker funko slide after piper jaffray pt cut url,ticker funko slide piper jaffray pt cut url,TICKER funko slide after piper jaffray pt cut URL,TICKER funko slide piper jaffray pt cut URL,TICKER funko slide after piper jaffray pt cut URL,TICKER funko slide piper jaffray pt cut URL


In [11]:
df_test_copy.head()

Unnamed: 0,id,text,text_no_lemma_no_stem_with_stopwords,text_no_lemma_no_stem_no_stopwords,text_no_lemma_stem_with_stopwords,text_no_lemma_stem_no_stopwords,text_lemma_no_stem_with_stopwords,text_lemma_no_stem_no_stopwords,text_lemma_stem_with_stopwords,text_lemma_stem_no_stopwords
0,0,"ETF assets to surge tenfold in 10 years to $50 trillion, Bank of America predicts",etf assets to surge tenfold in years to trillion bank of america predicts,etf assets surge tenfold years trillion bank america predicts,etf asset to surg tenfold in year to trillion bank of america predict,etf asset surg tenfold year trillion bank america predict,etf asset to surge tenfold in year to trillion bank of america predicts,etf asset surge tenfold year trillion bank america predicts,etf asset to surge tenfold in year to trillion bank of america predicts,etf asset surge tenfold year trillion bank america predicts
1,1,Here’s What Hedge Funds Think Evolution Petroleum Corporation (EPM),here is what hedge funds think evolution petroleum corporation epm,hedge funds think evolution petroleum corporation epm,here is what hedg fund think evolut petroleum corpor epm,hedg fund think evolut petroleum corpor epm,here is what hedge fund think evolution petroleum corporation epm,hedge fund think evolution petroleum corporation epm,here is what hedge fund think evolution petroleum corporation epm,hedge fund think evolution petroleum corporation epm
2,2,$PVH - Phillips-Van Heusen Q3 2020 Earnings Preview https://t.co/kNhCYwVnBX,TICKER phillips-van heusen q earnings preview URL,TICKER phillips-van heusen q earnings preview URL,ticker phillips-van heusen q earn preview url,ticker phillips-van heusen q earn preview url,TICKER phillips-van heusen q earnings preview URL,TICKER phillips-van heusen q earnings preview URL,TICKER phillips-van heusen q earnings preview URL,TICKER phillips-van heusen q earnings preview URL
3,3,"China is in the process of waiving retaliatory tariffs on imports of U.S. pork and soy by domestic companies, a pro… https://t.co/08mZU9TrBX",china is in the process of waiving retaliatory tariffs on imports of you s pork and soy by domestic companies a pro … URL,china process waiving retaliatory tariffs imports pork soy domestic companies pro … URL,china is in the process of waiv retaliatori tariff on import of you s pork and soy by domest compani a pro … url,china process waiv retaliatori tariff import pork soy domest compani pro … url,china is in the process of waiving retaliatory tariff on import of you s pork and soy by domestic company a pro … URL,china process waiving retaliatory tariff import pork soy domestic company pro … URL,china is in the process of waiving retaliatory tariff on import of you s pork and soy by domestic company a pro … URL,china process waiving retaliatory tariff import pork soy domestic company pro … URL
4,4,"Highlight: “When growth is scarce, investors seem very willing to pay up for growth stock"" @PNCBank's… https://t.co/rO4fBOkBG9",highlight “ when growth is scarce investors seem very willing to pay up for growth stock USER's … URL,highlight “ growth scarce investors seem willing pay growth stock USER's … URL,highlight “ when growth is scarc investor seem veri will to pay up for growth stock user' … url,highlight “ growth scarc investor seem will pay growth stock user' … url,highlight “ when growth is scarce investor seem very willing to pay up for growth stock USER's … URL,highlight “ growth scarce investor seem willing pay growth stock USER's … URL,highlight “ when growth is scarce investor seem very willing to pay up for growth stock USER's … URL,highlight “ growth scarce investor seem willing pay growth stock USER's … URL


In [12]:
df_train_cleaned = df_train_copy.copy()
df_test_cleaned = df_test_copy.copy()

In [13]:
# Using stratify to maintain the distribution of classes in the train, validation, and test sets 
# As our dataset is quite small, we use 80% for training, and split the remaining 20% into validation and test sets (10% each).

train_df, val_test_df = train_test_split(df_train_cleaned, test_size=0.2, stratify=df_train_cleaned['label'], random_state=42)
val_df, test_df = train_test_split(val_test_df, test_size=0.5, stratify=val_test_df['label'], random_state=42)

In [14]:
y_train = train_df['label']
y_val = val_df['label']
y_test = test_df['label']

In [15]:
train_df.drop(columns=['label'])

Unnamed: 0,text,text_no_lemma_no_stem_with_stopwords,text_no_lemma_no_stem_no_stopwords,text_no_lemma_stem_with_stopwords,text_no_lemma_stem_no_stopwords,text_lemma_no_stem_with_stopwords,text_lemma_no_stem_no_stopwords,text_lemma_stem_with_stopwords,text_lemma_stem_no_stopwords
7473,"$FTS: Fortis announced the appointment of David Hutchens as Chief Operating Officer, Fortis, effective January 1,... https://t.co/90PWjTanjp",TICKER]: fortis announced the appointment of david hutchens as chief operating officer fortis effective january URL,TICKER]: fortis announced appointment david hutchens chief operating officer fortis effective january URL,ticker]: forti announc the appoint of david hutchen as chief oper offic forti effect januari url,ticker]: forti announc appoint david hutchen chief oper offic forti effect januari url,TICKER]: fortis announced the appointment of david hutchens a chief operating officer fortis effective january URL,TICKER]: fortis announced appointment david hutchens chief operating officer fortis effective january URL,TICKER]: fortis announced the appointment of david hutchens a chief operating officer fortis effective january URL,TICKER]: fortis announced appointment david hutchens chief operating officer fortis effective january URL
9279,Ebay stock up 5.4%in Monday premarket trading,ebay stock up in monday premarket trading,ebay stock monday premarket trading,ebay stock up in monday premarket trade,ebay stock monday premarket trade,ebay stock up in monday premarket trading,ebay stock monday premarket trading,ebay stock up in monday premarket trading,ebay stock monday premarket trading
7027,Nasdaq Private Market Sets New Annual Transaction Record in 2019 - StreetInsider.com,nasdaq private market sets new annual transaction record in streetinsider.com,nasdaq private market sets new annual transaction record streetinsider.com,nasdaq privat market set new annual transact record in streetinsider.com,nasdaq privat market set new annual transact record streetinsider.com,nasdaq private market set new annual transaction record in streetinsider.com,nasdaq private market set new annual transaction record streetinsider.com,nasdaq private market set new annual transaction record in streetinsider.com,nasdaq private market set new annual transaction record streetinsider.com
8900,$CBAY (-76.9% pre) CymaBay Therapeutics Halts Clinical Development of Seladelpar - GN https://t.co/GZdaOP9sfB,TICKER pre cymabay therapeutics halts clinical development of seladelpar gn URL,TICKER pre cymabay therapeutics halts clinical development seladelpar gn URL,ticker pre cymabay therapeut halt clinic develop of seladelpar gn url,ticker pre cymabay therapeut halt clinic develop seladelpar gn url,TICKER pre cymabay therapeutic halt clinical development of seladelpar gn URL,TICKER pre cymabay therapeutic halt clinical development seladelpar gn URL,TICKER pre cymabay therapeutic halt clinical development of seladelpar gn URL,TICKER pre cymabay therapeutic halt clinical development seladelpar gn URL
2236,Dunkin' Brands lifts dividend and buyback as 2020 outlook shy of estimates,dunkin brands lifts dividend and buyback as outlook shy of estimates,dunkin brands lifts dividend buyback outlook shy estimates,dunkin brand lift dividend and buyback as outlook shi of estim,dunkin brand lift dividend buyback outlook shi estim,dunkin brand lift dividend and buyback a outlook shy of estimate,dunkin brand lift dividend buyback outlook shy estimate,dunkin brand lift dividend and buyback a outlook shy of estimate,dunkin brand lift dividend buyback outlook shy estimate
...,...,...,...,...,...,...,...,...,...
8516,Here Are 2 Key Catalysts for Kroger Stock in Fiscal 2020,here are key catalysts for kroger stock in fiscal,key catalysts kroger stock fiscal,here are key catalyst for kroger stock in fiscal,key catalyst kroger stock fiscal,here are key catalyst for kroger stock in fiscal,key catalyst kroger stock fiscal,here are key catalyst for kroger stock in fiscal,key catalyst kroger stock fiscal
3033,U.S. gasoline futures tumbled to their lowest level since 1999-with some pump prices already below $1-as coronaviru… https://t.co/Tt9W1qMrJj,you s gasoline futures tumbled to their lowest level since  with some pump prices already below  as coronaviru … URL,gasoline futures tumbled lowest level since  pump prices already  coronaviru … URL,you s gasolin futur tumbl to their lowest level sinc  with some pump price alreadi below  as coronaviru … url,gasolin futur tumbl lowest level sinc  pump price alreadi  coronaviru … url,you s gasoline future tumbled to their lowest level since  with some pump price already below  a coronaviru … URL,gasoline future tumbled lowest level since  pump price already  coronaviru … URL,you s gasoline future tumbled to their lowest level since  with some pump price already below  a coronaviru … URL,gasoline future tumbled lowest level since  pump price already  coronaviru … URL
810,Frozen Wells Fargo Bonuses Show a Peril for Bankers After Crisis,frozen wells fargo bonuses show a peril for bankers after crisis,frozen wells fargo bonuses show peril bankers crisis,frozen well fargo bonus show a peril for banker after crisi,frozen well fargo bonus show peril banker crisi,frozen well fargo bonus show a peril for banker after crisis,frozen well fargo bonus show peril banker crisis,frozen well fargo bonus show a peril for banker after crisis,frozen well fargo bonus show peril banker crisis
7767,Why 51job Shares Dropped 15% Last Month,why job shares dropped last month,job shares dropped last month,whi job share drop last month,job share drop last month,why job share dropped last month,job share dropped last month,why job share dropped last month,job share dropped last month


In [16]:
X_train_final = train_df["text_no_lemma_stem_with_stopwords"].to_numpy()
X_val_final = val_df["text_no_lemma_stem_with_stopwords"].to_numpy()
X_test_final = test_df["text_no_lemma_stem_with_stopwords"].to_numpy()
np.save("train_text_no_lemma_stem_with_stopwords.npy", X_train_final)
np.save("val_text_no_lemma_stem_with_stopwords.npy", X_val_final)
np.save("test_text_no_lemma_stem_with_stopwords.npy", X_test_final)


## 3 - Feature Extraction

We decided to follow a general pipeline, where based on the feature extraction technique we employ , and that it is adequate to the classification model we first define:

- 3.1. - Statistical Methods: Bag of Words, and TF-IDF -> 3.1.1 Classification models: SVC, XGB, Logistic Regression and KNN -> 3.1.2 Hyperparamter Tuning for the best feature extraction technique and for the best model

- 3.2. - Fixed Word Embedding Encoders -> Word2Vec, FastText , Glove-Twitter -> 3.2.1 Classification Models -> Keep the best traditional ML model from 3.1 and add BiLSTM , BiGRU , BiLSTM + Attention , BiGRU + Attention  (Source: https://sbert.net/docs/sentence_transformer/pretrained_models.html) 

- 3.3. - Contextual Word Embedding Encoders -> ELMO (mean and concat) -> 3.3.1 Classification Models -> Keep the best traditional ML model from 3.1 and add BiLSTM , BiGRU , BiLSTM + Attention , BiGRU + Attention 

- 3.4. - Sentence Encoders -> all-mpnet-base-v2 , all-distilroberta-v1 , all-MiniLM-L12-v2 , paraphrase-multilingual-mpnet-base-v2 -> 3.4.1 Classification Models -> Keep the best traditional ML model from 3.1 and add BiLSTM , BiGRU , BiLSTM + Attention , BiGRU + Attention

- 3.5 -> Transformers -> BERT base, BERT Large, XLNET base, XLNET large, Roberta Base, Roberta Large distilbert large, distilbert base, ALBERT x large-v1 , ALBERT-xxlarge-v2 , XLM-MLM-en-2048 , BART-LARGE  

- 3.6 -> Domain Specific Transformers: FinBert , BERTweet , FinTwitBERT (Source: https://huggingface.co/StephanAkkerman/FinTwitBERT)


As, stated above, after doing the encoding for each feature extraction technique, we call the SMOTE function. 

Moreover, one important note is that as we have 8 different text pre-processing combinations, and we have several models it can become quite computationally expensive to run all the pre-processing combinations. So we tested the best text variant in a different notebook

In [17]:
final_combinations  = [
    "text_no_lemma_stem_with_stopwords"
]

### 3.1 - Statistical Methods - Bag of Words and TF-IDF

We use .npz to store sparse matrices (like BOW or TF-IDF) and .npy to store dense arrays (like embeddings) for later use in different notebooks.

In [18]:
for vec_type, VecClass in [('bow', CountVectorizer), ('tfidf', TfidfVectorizer)]:
    for column_name in final_combinations:
        print(f"Fitting {vec_type} vectorizer for {column_name}...")
        vectorizer = VecClass(ngram_range=(1,2), max_features=15000)
        vectorizer.fit(train_df[column_name])
        X_train = vectorizer.transform(train_df[column_name])
        X_val   = vectorizer.transform(val_df[column_name])
        X_test  = vectorizer.transform(test_df[column_name])
        sparse.save_npz(f"{vec_type}_{column_name}_train.npz", X_train)
        sparse.save_npz(f"{vec_type}_{column_name}_val.npz", X_val)
        sparse.save_npz(f"{vec_type}_{column_name}_test.npz", X_test)

Fitting bow vectorizer for text_no_lemma_stem_with_stopwords...
Fitting tfidf vectorizer for text_no_lemma_stem_with_stopwords...


### 3.2 - Fixed Word Embedding Encoders

### Word2Vec and Fast-Text

In [19]:
def tweet_to_seq(tokens, model, max_len, embed_dim):
    seq = np.zeros((max_len, embed_dim), dtype='float32')
    for i, token in enumerate(tokens[:max_len]):
        if token in model.wv:
            seq[i] = model.wv[token]
    return seq

In [20]:
ft_vectors = {}
ft_models = {}
vector_size = 200
max_sequence_length = 32 

for enc_type, ModelClass in [('w2v', Word2Vec), ('ft', FastText)]:
    for column_name in final_combinations:
        print(f"Training {enc_type} for {column_name}...")

        # Prepare tokenized sentences
        train_sentences = [tweet.split() for tweet in train_df[column_name]]
        val_sentences   = [tweet.split() for tweet in val_df[column_name]]
        test_sentences  = [tweet.split() for tweet in test_df[column_name]]

        # Train model on train split only
        model = ModelClass(sentences=train_sentences, vector_size=vector_size, window=10, min_count=1, workers=7)

        # Get sequence vectors (3D arrays)
        X_train = np.stack([
            tweet_to_seq(tokens, model, max_sequence_length, vector_size)
            for tokens in train_sentences
        ])
        X_val = np.stack([
            tweet_to_seq(tokens, model, max_sequence_length, vector_size)
            for tokens in val_sentences
        ])
        X_test = np.stack([
            tweet_to_seq(tokens, model, max_sequence_length, vector_size)
            for tokens in test_sentences
        ])

        # Save as .npy
        np.save(f"{enc_type}_seq_{column_name}_train.npy", X_train)
        np.save(f"{enc_type}_seq_{column_name}_val.npy", X_val)
        np.save(f"{enc_type}_seq_{column_name}_test.npy", X_test)

Training w2v for text_no_lemma_stem_with_stopwords...
Training ft for text_no_lemma_stem_with_stopwords...


### GLOVE-Twitter

In [21]:
glove_configs = {
    25: "glove.twitter.27B.25d.txt",
    50: "glove.twitter.27B.50d.txt",
    100: "glove.twitter.27B.100d.txt",
    200: "glove.twitter.27B.200d.txt"
}

class GloveWrapper:
    def __init__(self, glove_dict):
        self.wv = glove_dict

def load_glove_model(filepath):
    embeddings = {}
    with open(filepath, "r", encoding="utf8") as f:
        for line in f:
            parts = line.rstrip().split(" ")
            word = parts[0]
            vector = np.asarray(parts[1:], dtype="float32")
            embeddings[word] = vector
    return GloveWrapper(embeddings)

for dim, glove_path in glove_configs.items():
    print(f"\nLoading GloVe model with {dim} dimensions...")
    glove_model = load_glove_model(glove_path)

    for column_name in final_combinations:
        print(f"Processing GloVe SEQUENCE {dim}D for {column_name}...")

        train_sentences = [tweet.split() for tweet in train_df[column_name]]
        val_sentences   = [tweet.split() for tweet in val_df[column_name]]
        test_sentences  = [tweet.split() for tweet in test_df[column_name]]

        X_train = np.stack([tweet_to_seq(tokens, glove_model, max_sequence_length, dim) for tokens in train_sentences])
        X_val   = np.stack([tweet_to_seq(tokens, glove_model, max_sequence_length, dim) for tokens in val_sentences])
        X_test  = np.stack([tweet_to_seq(tokens, glove_model, max_sequence_length, dim) for tokens in test_sentences])

        np.save(f"glove_seq_{column_name}_{dim}d_train.npy", X_train)
        np.save(f"glove_seq_{column_name}_{dim}d_val.npy", X_val)
        np.save(f"glove_seq_{column_name}_{dim}d_test.npy", X_test)


Loading GloVe model with 25 dimensions...
Processing GloVe SEQUENCE 25D for text_no_lemma_stem_with_stopwords...

Loading GloVe model with 50 dimensions...
Processing GloVe SEQUENCE 50D for text_no_lemma_stem_with_stopwords...

Loading GloVe model with 100 dimensions...
Processing GloVe SEQUENCE 100D for text_no_lemma_stem_with_stopwords...

Loading GloVe model with 200 dimensions...
Processing GloVe SEQUENCE 200D for text_no_lemma_stem_with_stopwords...


### 3.3 - Sentence Transformers

In [22]:
sentence_transformers = dict(
    mpnet_base_v2 = SentenceTransformer('sentence-transformers/all-mpnet-base-v2'),
    distilroberta_v1 = SentenceTransformer('sentence-transformers/all-distilroberta-v1'),
    minilm_l12_v2 = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2'),
    all_minilm_l6_v2 = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

)

for model_name, model in sentence_transformers.items():
    for column_name in final_combinations:
        for split_name, df in zip(['train', 'val', 'test'], [train_df, val_df, test_df]):
            print(f"Encoding {column_name} with {model_name} for {split_name} set...")
            text_data = df[column_name].astype(str).tolist()
            embeddings = model.encode(text_data, batch_size=64, show_progress_bar=True)
            np.save(f"{model_name}_{column_name}_{split_name}.npy", embeddings)

Encoding text_no_lemma_stem_with_stopwords with mpnet_base_v2 for train set...


Batches: 100%|██████████| 119/119 [00:10<00:00, 11.30it/s]


Encoding text_no_lemma_stem_with_stopwords with mpnet_base_v2 for val set...


Batches: 100%|██████████| 15/15 [00:01<00:00, 10.81it/s]


Encoding text_no_lemma_stem_with_stopwords with mpnet_base_v2 for test set...


Batches: 100%|██████████| 15/15 [00:01<00:00, 11.40it/s]


Encoding text_no_lemma_stem_with_stopwords with distilroberta_v1 for train set...


Batches: 100%|██████████| 119/119 [00:05<00:00, 21.92it/s]


Encoding text_no_lemma_stem_with_stopwords with distilroberta_v1 for val set...


Batches: 100%|██████████| 15/15 [00:00<00:00, 18.61it/s]


Encoding text_no_lemma_stem_with_stopwords with distilroberta_v1 for test set...


Batches: 100%|██████████| 15/15 [00:00<00:00, 22.09it/s]


Encoding text_no_lemma_stem_with_stopwords with minilm_l12_v2 for train set...


Batches: 100%|██████████| 119/119 [00:04<00:00, 29.74it/s]


Encoding text_no_lemma_stem_with_stopwords with minilm_l12_v2 for val set...


Batches: 100%|██████████| 15/15 [00:00<00:00, 29.57it/s]


Encoding text_no_lemma_stem_with_stopwords with minilm_l12_v2 for test set...


Batches: 100%|██████████| 15/15 [00:00<00:00, 29.88it/s]


Encoding text_no_lemma_stem_with_stopwords with all_minilm_l6_v2 for train set...


Batches: 100%|██████████| 119/119 [00:01<00:00, 67.51it/s]


Encoding text_no_lemma_stem_with_stopwords with all_minilm_l6_v2 for val set...


Batches: 100%|██████████| 15/15 [00:00<00:00, 60.21it/s]


Encoding text_no_lemma_stem_with_stopwords with all_minilm_l6_v2 for test set...


Batches: 100%|██████████| 15/15 [00:00<00:00, 64.76it/s]


### 3.4 - Transformers

In this first approach with transformers we use them to apply **feature extraction** ,  and the usual ways to extract embeddings are through **mean pooling** or **CLS pooling**. We opt for the **mean pooling** aproach as it is more robust to the majority of the architectures.

#### 1. Feature Extraction

In [23]:
transformer_names = [
    'bert-large-uncased', 
    'roberta-large',
    'facebook/bart-large'
]

In [24]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

We decided to implement mean pooling for all the transformers, as according to multiple sources:

- In terms of embedding quality, mean pooling is generally more robust for tasks requiring comprehensive context. (If the model is fine-tuned for a specific task, the [CLS] token can outperform pooling by focusing on task-relevant features. For instance, in sentiment analysis, a fine-tuned [CLS] token might better isolate emotional cues than a mean-pooled vector. However, if the model isn’t fine-tuned for the target task—or if the task differs significantly from pretraining objectives—the [CLS] token’s quality may degrade, making mean pooling safer for general-purpose use -> USe CLS for specific domain transformers)

- “Mean pooling is often more robust than using the [CLS] token for creating fixed-size sentence embeddings, especially for models not explicitly trained for classification tasks.” (Source: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks)

In [25]:
def mean_pooling(model, tokenizer, texts, batch_size=32, max_length=128):
    model.eval()
    all_embeddings = []
    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            inputs = tokenizer(batch, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to(device)
            outputs = model(**inputs)
            attention_mask = inputs['attention_mask'].unsqueeze(-1)
            masked_embeddings = outputs.last_hidden_state * attention_mask
            sum_embeddings = masked_embeddings.sum(dim=1)
            sum_mask = attention_mask.sum(dim=1)
            mean_pooled = sum_embeddings / sum_mask
            all_embeddings.append(mean_pooled.cpu().numpy())
    return np.vstack(all_embeddings)

In [26]:
split_names = ['train', 'val', 'test']
dfs = [train_df, val_df, test_df]

for model_name in transformer_names:
    model = AutoModel.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = model.to(device) 

    for column_name in final_combinations:
        if not column_name.startswith("text_"): continue
        for split_name, df in zip(split_names, dfs):
            print(f"Encoding {column_name} with {model_name} for {split_name} set...")
            texts = df[column_name].astype(str).tolist()
            embeddings = mean_pooling(model, tokenizer, texts)
            np.save(f"{model_name.replace('/','-')}_{column_name}_{split_name}_meanpooled.npy", embeddings)

Encoding text_no_lemma_stem_with_stopwords with bert-large-uncased for train set...
Encoding text_no_lemma_stem_with_stopwords with bert-large-uncased for val set...
Encoding text_no_lemma_stem_with_stopwords with bert-large-uncased for test set...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Encoding text_no_lemma_stem_with_stopwords with roberta-large for train set...
Encoding text_no_lemma_stem_with_stopwords with roberta-large for val set...
Encoding text_no_lemma_stem_with_stopwords with roberta-large for test set...


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Encoding text_no_lemma_stem_with_stopwords with facebook/bart-large for train set...
Encoding text_no_lemma_stem_with_stopwords with facebook/bart-large for val set...
Encoding text_no_lemma_stem_with_stopwords with facebook/bart-large for test set...


### 3.5 -Domain Specific Transformers

In [27]:
domain_models = {
    "finbert": "ProsusAI/finbert",
    "berttweet": "vinai/bertweet-base",
    "fintwitbert": "yiyanghkust/finbert-tone"
}

In [28]:
models = {}
tokenizers = {}
for key, name in domain_models.items():
    tokenizers[key] = AutoTokenizer.from_pretrained(name)
    models[key] = AutoModel.from_pretrained(name).to(device)

In [29]:
split_names = ['train', 'val', 'test']
dfs = [train_df, val_df, test_df]

for model_key, model in models.items():
    tokenizer = tokenizers[model_key]
    for column_name in final_combinations:
        if not column_name.startswith("text_"): continue
        for split_name, df in zip(split_names, dfs):
            print(f"Encoding {column_name} with {model_key} for {split_name} set...")
            texts = df[column_name].astype(str).tolist()
            embeddings = mean_pooling(model, tokenizer, texts)
            np.save(f"{model_key}_{column_name}_{split_name}_meanpooled.npy", embeddings)

Encoding text_no_lemma_stem_with_stopwords with finbert for train set...
Encoding text_no_lemma_stem_with_stopwords with finbert for val set...
Encoding text_no_lemma_stem_with_stopwords with finbert for test set...
Encoding text_no_lemma_stem_with_stopwords with berttweet for train set...
Encoding text_no_lemma_stem_with_stopwords with berttweet for val set...
Encoding text_no_lemma_stem_with_stopwords with berttweet for test set...
Encoding text_no_lemma_stem_with_stopwords with fintwitbert for train set...
Encoding text_no_lemma_stem_with_stopwords with fintwitbert for val set...
Encoding text_no_lemma_stem_with_stopwords with fintwitbert for test set...
