# Build SVM model with TF-IDF, BoW models 

## 1.1 Preprocessing train and test tweet data

In [1]:
pip install wordninja

Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m541.6/541.6 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wordninja
  Building wheel for wordninja (setup.py) ... [?25l[?25hdone
  Created wheel for wordninja: filename=wordninja-2.0.0-py3-none-any.whl size=541530 sha256=4fce30e8d2f0acb5749338629ffe4c155b7f5b7fd451dd895b4257a425c30f0d
  Stored in directory: /root/.cache/pip/wheels/aa/44/3a/f2a5c1859b8b541ded969b4cd12d0a58897f12408f4f51e084
Successfully built wordninja
Installing collected packages: wordninja
Successfully installed wordninja-2.0.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
import re, string, wordninja, pandas as pd, numpy as np, time
import string
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import SnowballStemmer

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### For tokenisation we will use  NLTK’s TweetTokenizer, because it is better at handling typical tweet elements 
### (mentions, hashtags, emojis, urls) than a simple  whitespace, regex tokenizer pr spacy.
### It keeps tweet-specific elements intact



#### For the preprocessing we are goinng to do the following steps:

**We remove Stopwords :**

We remove standard eng stopwords (like “the,” “and”) but keep words like “not” “no” to keeep negation cues for classification.

**We remove Mentions and timestamps:**

We replace user mentions with a placeholder “@user,” this preserves the idea of someone being mentioned without leaking specific usernames. 
Timestamps (12/10, 13 pm, 13:34am) get converted to “timestamp” so the model isn’t distracted by exact times.
 
##### We remove hashtags from #hashtaged words 

##### We remove non-ASCII and non-alphanumeric chars except spaces, standalone punctuation 

**Splitting “n’t”:** we split :“didn’t” → “did” + “n’t” and "n't"-> "not" to make negation more explicit

#### We include keyword (train_df['keyword']) and location ((train_df['location']) related to each tweet into a tweet and do the preprocessing step on this new compound tweet. We put keyword an the beginning and at the end of each tweet

## We use Stemming (SnowballStemmer) for tweet preprocessing

 We choose stemming over lemmatization to reduce words to their base forms.
I have done several SVM runs with both smemming with SnowballStemmer , Lemmatization with WordLemmatizer and spaCy and

Stemming with SnowballStemmer gives the best outcomes outperforming WordLEmmatizer and spaCy

SnowballStemmer is more efficient probably because its aggressive reduction of words can leads to a smaller, more consolidated feature space. this reduction might help boost frequency counts for key tokens and reduce noise due to inconsistent morphological variants
 

In [41]:
# 1 Compile regex patterns
url_pattern = re.compile(r'http\S+')
timestamp_pattern = re.compile(
    r'\b\d{1,2}[/:]\d{1,2}(?:/\d{2,4})?\b'     # e.g. 12:30, 12/08/2025, etc.
    r'|\d{1,2}\s*(?:am|pm)'                  # e.g. 12 am, 3pm
    r'|\b\d{1,2}:\d{2}(?:am|pm)?\b',         # e.g. 12:30pm
    re.IGNORECASE
)
smiley_pattern = re.compile(r'[:;=8][\-o]?[\)\]\(\[dDpP]|<3')  # e.g. :) :D etc.

# 2 removing negations from default_stop 

default_stop = set(stopwords.words('english'))
negation_words = {
    "not", "no", "nor", "don", "don't", "ain", "aren", "aren't",
    "couldn", "couldn't", "didn", "didn't", "doesn", "doesn't", 
    "hadn", "hadn't", "hasn", "hasn't", "haven", "haven't",
    "isn", "isn't", "mightn", "mightn't", "mustn", "mustn't",
    "needn", "needn't", "shan", "shan't", "shouldn", "shouldn't",
    "wasn", "wasn't", "weren", "weren't", "won", "won't",
    "wouldn", "wouldn't"
}
custom_stop = {w for w in default_stop if w not in negation_words}

tokenizer = TweetTokenizer()

stemmer = SnowballStemmer('english')



def preprocess_tweet(text):
    text = text.lower()
    
    # Replace URLs, timestamps, smileys
    text = url_pattern.sub('url', text)
    text = timestamp_pattern.sub('timestamp', text)
    text = smiley_pattern.sub('smiley', text)
    
    # Replace @handles with "@user"
    text = re.sub(r'@\w+', '@user', text)
    
    # Remove commas within numbers (13,000 -> 13000)
    text = re.sub(r'(?<=\d),(?=\d)', '', text)

    # Split hashtags: remove '#' and split compound words using wordninja
    def hashtag_split(match):
        return " " + " ".join(wordninja.split(match.group(1))) + " "
    text = re.sub(r'#(\w+)', hashtag_split, text)
    
    # Remove non-ASCII
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    
    # Remove other non-alphanumeric chars except spaces
    text = re.sub(r'[^a-z0-9\s]', '', text)
    
    # Tokenize
    tokens = tokenizer.tokenize(text)

    # "n't" --> 'not'
    sep_tokens = []
    for tok in tokens:
        if tok.endswith("n't") and len(tok) > 3:
            main_part = tok[:-3]
            sep_tokens.append(main_part)
            sep_tokens.append("not")  
        else:
            sep_tokens.append(tok)
    tokens = sep_tokens

    tokens = [t for t in tokens if t != 'pm']
    
    # Remove stopwords
    tokens = [t for t in tokens if t not in custom_stop]
    
    # Remove standalone punctuation
    tokens = [t for t in tokens if t not in string.punctuation]

    #  Stemming with Snowballstemmer
    tokens = [stemmer.stem(t) for t in tokens]

    # Rejoin
    return " ".join(tokens)

train_df = pd.read_csv("/kaggle/input/dattaset/train.csv")
test_df = pd.read_csv("/kaggle/input/dattaset/test.csv")

# Include keyword + location in tweet and preprocess tweet for train data
train_df['keyword'] = train_df['keyword'].fillna('')
train_df['location'] = train_df['location'].fillna('')

train_df['combined_text'] = train_df['keyword'] + ' ' + train_df['text'] + ' ' + train_df['location'] + ' ' + train_df['keyword']
train_df['processed_text'] = train_df['combined_text'].apply(preprocess_tweet)
print(train_df[['text','processed_text']].head())

# Include keyword + location in tweet and preprocess tweet for test data
test_df['keyword'] = test_df['keyword'].fillna('')
test_df['location'] = test_df['location'].fillna('')

test_df['combined_text'] = test_df['keyword'] + ' ' + test_df['text'] + ' ' + test_df['location'] + ' '  + test_df['keyword']
test_df['processed_text'] = test_df['combined_text'].apply(preprocess_tweet)


                                                text  \
0  Our Deeds are the Reason of this #earthquake M...   
1             Forest fire near La Ronge Sask. Canada   
2  All residents asked to 'shelter in place' are ...   
3  13,000 people receive #wildfires evacuation or...   
4  Just got sent this photo from Ruby #Alaska as ...   

                                      processed_text  
0          deed reason earthquak may allah forgiv us  
1               forest fire near la rong sask canada  
2  resid ask shelter place notifi offic no evacu ...  
3  13000 peopl receiv wildfir evacu order california  
4  got sent photo rubi alaska smoke wildfir pour ...  


In [42]:
train_df['processed_text'].to_csv('processed_tweets_stem.txt', index=False, header=False)

In [43]:
print(train_df.columns)

Index(['id', 'keyword', 'location', 'text', 'target', 'combined_text',
       'processed_text'],
      dtype='object')


In [53]:
counts = train_df['target'].value_counts()
print(counts)

max_words = train_df['processed_text'].apply(lambda x: len(x.split())).max()
print("Longest entry word count:", max_words)

target
0    4342
1    3271
Name: count, dtype: int64
Longest entry word count: 31


## 1.2. TF-IDF, BoW vectorizers

In [44]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

 Tried to perform BoW and TF-IDF with different parameters, like including n-grams up to trigrams and experimenting with vocab limits, min_df, max_df -- nothing significantly contriburing to model performance:

BoW vectorizer:

In [47]:
# initialize BoW vectorizer
bow_vectorizer = CountVectorizer(encoding='utf8')

bow_vect2 = CountVectorizer(
 encoding='utf8',
 ngram_range=(1,3),
 # min_df=2,
 # max_df=0.95,
 # max_features=10000,
 # binary=False
)

# fit BoW vectorizer on train_df
bow_vectorizer.fit(train_df['processed_text'])
bow_vect2.fit(train_df['processed_text'])

X_train_bow = bow_vectorizer.transform(train_df['processed_text'])
X_test_bow = bow_vectorizer.transform(test_df['processed_text'])

# X_train_bow2 = bow_vect2.transform(train_df['processed_text'])
# X_test_bow2 = bow_vect2.transform(test_df['processed_text'])


print(X_train_bow)

  (0, 1067)	1
  (0, 3516)	1
  (0, 4046)	1
  (0, 4832)	1
  (0, 7614)	1
  (0, 9829)	1
  (0, 12473)	1
  (1, 2424)	1
  (1, 4695)	1
  (1, 4826)	1
  (1, 6893)	1
  (1, 8267)	1
  (1, 10200)	1
  (1, 10413)	1
  (2, 1396)	1
  (2, 4373)	1
  (2, 4436)	1
  (2, 8400)	1
  (2, 8480)	1
  (2, 8620)	1
  (2, 8755)	1
  (2, 9234)	2
  (2, 9998)	1
  (2, 10704)	2
  (3, 119)	1
  :	:
  (7610, 12467)	1
  (7610, 12508)	1
  (7610, 12699)	1
  (7611, 2465)	1
  (7611, 2940)	1
  (7611, 4069)	2
  (7611, 6244)	1
  (7611, 6335)	1
  (7611, 7187)	1
  (7611, 8427)	1
  (7611, 9313)	1
  (7611, 9365)	1
  (7611, 10085)	1
  (7611, 10612)	1
  (7611, 11450)	1
  (7611, 11884)	1
  (7612, 824)	1
  (7612, 2389)	1
  (7612, 5843)	1
  (7612, 6979)	1
  (7612, 8327)	1
  (7612, 8456)	1
  (7612, 9794)	1
  (7612, 12467)	1
  (7612, 13004)	1


TF-IDF vectorizer:

In [48]:
# initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(encoding='utf8')

tfidf_vect = TfidfVectorizer(
 encoding='utf8',
 ngram_range=(1,3),
 # max_df=0.9,
 # min_df=5,
 # sublinear_tf=True
)

tfidf_vectorizer.fit(train_df['processed_text'])
tfidf_vect.fit(train_df['processed_text'])

X_train_tfidf = tfidf_vectorizer.transform(train_df['processed_text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['processed_text'])

X_train_tfidf2 = tfidf_vect.transform(train_df['processed_text'])
X_test_tfidf2 = tfidf_vect.transform(test_df['processed_text'])


print(X_train_tfidf)

  (0, 12473)	0.2565252442235441
  (0, 9829)	0.35066653098778766
  (0, 7614)	0.29726933026488295
  (0, 4832)	0.4655483796694387
  (0, 4046)	0.3190347569931321
  (0, 3516)	0.4812100428181675
  (0, 1067)	0.4156647123744579
  (1, 10413)	0.5032480027565118
  (1, 10200)	0.5263327936570664
  (1, 8267)	0.3135319388658582
  (1, 6893)	0.3453931599445707
  (1, 4826)	0.32473146237564765
  (1, 4695)	0.23343057690600238
  (1, 2424)	0.3036052857345438
  (2, 10704)	0.5875709533528314
  (2, 9998)	0.27192514823013375
  (2, 9234)	0.45255761102091774
  (2, 8755)	0.22536571706589506
  (2, 8620)	0.22032519323177235
  (2, 8480)	0.33340700261820133
  (2, 8400)	0.1588433388437048
  (2, 4436)	0.23018153570499789
  (2, 4373)	0.18258108347852875
  (2, 1396)	0.2323036222883481
  (3, 13004)	0.3288533262988225
  :	:
  (7610, 7396)	0.48925003246654286
  (7610, 5616)	0.4173588017779799
  (7610, 605)	0.44293175501353865
  (7611, 11884)	0.24822566286151615
  (7611, 11450)	0.25059447707181515
  (7611, 10612)	0.2350759349

## 1.3. Classification with SVM model 


* Linear SVM proven good in text classification benchmarks

* Linear SVM’s margin maximization makes it less sensitive to outliers  and noise (which is important in texts with  slang, hashtags, abbreviations and misspellings) compared to other models such as Logistic Regression or Naive Bayes

* and SVM are better at non overfitting sparse high-dimentional data as compared to tree-based models such as Random Forests or Gradient Boosting

In [13]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from sklearn.model_selection import train_test_split, PredefinedSplit, KFold, GridSearchCV
from sklearn.metrics import f1_score
from sklearn.svm import SVC
import time
import psutil, os

#### Set initial parameters of SVM model which will be further tuned

In [50]:
# SVM parameter grid
svm_param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto'],
}

#### Set functions to run grid search and eval
#####  Hyperparams search is done via cross-validation with GridSearchCV with 5 folds
##### We then run a separate 5‑fold CV on the entire training set with the best hyperparameters to get avarage f1 macro score

In [51]:
def get_memory_usage():
    process = psutil.Process(os.getpid())  #function to get current process memory usage in MB
    return process.memory_info().rss / (1024 * 1024) 

# 1 Hyperparameter Tuning (80–20)

def tune_svm(X, y, param_grid):
    
    cv = KFold(n_splits=5, shuffle=True, random_state=7)

    grid_search = GridSearchCV(
        estimator=SVC(),
        param_grid=param_grid,
        scoring='f1_macro',
        cv=cv,
        verbose=1,
        n_jobs=-1
    )
    grid_search.fit(X,y)
    
    print("\nBest Score (5-fold CV): {:.3f}".format(grid_search.best_score_))
    print("Best Params:", grid_search.best_params_)
    return grid_search.best_params_



# 2 5-Fold CV with best hyperparameters

def evaluate_svm_5fold(X, y, best_params):
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    cv_f1_scores = []
    fold = 5
    for train_index, val_index in kf.split(X):
        print(f"Training fold {fold}")
        X_tr, X_val = X[train_index], X[val_index]
        y_tr, y_val = y[train_index], y[val_index]
        
        model = SVC(**best_params)
        model.fit(X_tr, y_tr)

        start_inference = time.time()  #start inference timer
        mem_before = get_memory_usage()
        
        preds = model.predict(X_val)

        mem_after = get_memory_usage()    #get memory after inference
        end_inference = time.time()        #end inference timer
        inference_time = end_inference - start_inference 
        print(f"Inference time: {inference_time:.3f} sec, Memory usage diff: {mem_after - mem_before:.3f} MB")

        
        fold_f1 = f1_score(y_val, preds, average='macro')
        print(f"Fold {fold} f1_macro: {fold_f1:.3f}")
        cv_f1_scores.append(fold_f1)
        fold += 1
    
    avg_f1 = np.mean(cv_f1_scores)
    print("Average 5-fold CV f1_macro:", avg_f1)
    return avg_f1


# 3 Training on full data

def train_and_predict_svm_full(X, y, X_test, best_params, out_file):
    final_model = SVC(**best_params)
    final_model.fit(X, y)

    start_inference = time.time()  #start inference timer for final prediction
    mem_before = get_memory_usage() #capture memory before final inference
    
    test_preds = final_model.predict(X_test)

    mem_after = get_memory_usage()    #capture memory after final inference
    end_inference = time.time()      #end inference timer for final prediction
    inference_time = end_inference - start_inference  #final inference time
    print(f"Final model inference time: {inference_time:.3f} sec, Memory usage diff: {mem_after - mem_before:.3f} MB")

    submission_df = pd.DataFrame({'id': test_df['id'], 'target': test_preds})
    submission_df.to_csv(out_file, index=False)
    print(f"\nSaved {out_file} using the best SVM model")
    return final_model

#### Tune SVM hyperparameters on both BoW and TF-IDF train data:

In [52]:
y_train = train_df['target']

# Start total timer 
start_total = time.time()

#  SVM on BoW

print("=== SVM on BoW ===")
bow_best_params = tune_svm(X_train_bow, y_train, svm_param_grid)
print("\nRunning 5-fold CV (BoW) with best hyperparameters:")
evaluate_svm_5fold(X_train_bow, y_train, bow_best_params)


#  SVM on TF-IDF

print("\n=== SVM on TF-IDF ===")
tfidf_best_params = tune_svm(X_train_tfidf, y_train, svm_param_grid)
print("\nRunning 5-fold CV (TF-IDF) with best hyperparameters:")
evaluate_svm_5fold(X_train_tfidf, y_train, tfidf_best_params)
# train_and_predict_svm_full(X_train_tfidf, y_train, X_test_tfidf, tfidf_best_params, "SVM_TFIDF_submission.csv")


end_total = time.time()
print("Total pipeline time (sec):", end_total - start_total)

=== SVM on BoW ===
Fitting 5 folds for each of 16 candidates, totalling 80 fits

Best Score (5-fold CV): 0.786
Best Params: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}

Running 5-fold CV (BoW) with best hyperparameters:
Training fold 5
Inference time: 0.603 sec, Memory usage diff: 0.000 MB
Fold 5 f1_macro: 0.796
Training fold 6
Inference time: 0.607 sec, Memory usage diff: 0.000 MB
Fold 6 f1_macro: 0.785
Training fold 7
Inference time: 0.611 sec, Memory usage diff: 0.000 MB
Fold 7 f1_macro: 0.808
Training fold 8
Inference time: 0.606 sec, Memory usage diff: -70.250 MB
Fold 8 f1_macro: 0.778
Training fold 9
Inference time: 0.588 sec, Memory usage diff: 0.000 MB
Fold 9 f1_macro: 0.781
Average 5-fold CV f1_macro: 0.7896273632776335

=== SVM on TF-IDF ===
Fitting 5 folds for each of 16 candidates, totalling 80 fits

Best Score (5-fold CV): 0.780
Best Params: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}

Running 5-fold CV (TF-IDF) with best hyperparameters:
Training fold 5
Inference 

### Quality measured with avarage f1 macro across show that
#### * Both BoW and TF-IDF pipelines achieve practically similar average F1_macro scores ~0.79,
#### * with SVM on BoW having a slightly higher score of 0.789

### Inference Time:
#### For BoW, each fold’s inference takes roughly 0.45–0.47 seconds.
#### For TF-IDF, the inference time is slightly higher (~0.5 seconds per fold)

#### The overall pipeline runtime uquals 271 seconds (it includes hyperparameter tuning, cross-validation, and final model training)

## Best model -- SVM on BoW
#### Make predictions for test data using SVM on BoW:

In [60]:
train_and_predict_svm_full(X_train_bow, y_train, X_test_bow, bow_best_params, "1_SVM_BoW_submission.csv")

Final model inference time: 1.355 sec, Memory usage diff: 0.000 MB

Saved 1_SVM_BoW_submission.csv using the best SVM model


# 2. NN based model -- 1D CNN

### For the classification we choose a simple 1D CNN
Tweets are overall of short length and in them  n-grams or key phrases carry strong signals , CNNs are good at at identifying *local n-gram features for short-text classification*
#### Convolutions are best at detecting patterns over short contexts — exactly what we have in tweets
#### Plus they have fewer parameters and are trained faster than  than sequential models like LSTM or RNNs


####  the architecture — Embedding layer, Conv1D + GlobalMaxPooling1D + fully connected dense output
* The architecture starts with an embedding layer to convert words into dense vectors, 
* followed by a convolutional layer that extracts meaningful features (with tunable filter sizes and kernel sizes to experiment with different local contexts),
* then a global max pooling layer that condenses these features, 
* and finally a dense layer with dropout for regularization before the binary output.




#### WE set up parameters for CNN 


In [55]:
import time
import numpy as np
import tensorflow as tf
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import keras_tuner as kt


def get_memory_usage():
    process = psutil.Process(os.getpid()) 
    return process.memory_info().rss / (1024 * 1024) 


# --- PARAMS ---
max_words = 10000  # we'll use top 10k words in vocab
maxlen = 100       # max tweet length in words


# Tokenize and pad training texts
# Initialize Keras tokenizer and fit it  on the processed tweet texts for vocab learning
# train_df tweet entrees get converted to int sequence with resulting sequences padded
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_df['processed_text'])
X_train_seq = tokenizer.texts_to_sequences(train_df['processed_text'])
X_train = pad_sequences(X_train_seq, maxlen=maxlen)
y_train = train_df['target'].values

#### 2.1 Trainng the model
The build_model function defines a CNN with 
* **Embedding layer,**           ---> Pretrained embeddings capture semantic similarity among words and thus boost model performance
* **one Conv1D layer** (with tunable filters and kernel size),  ----> CNN setup for text classification often includes parallel convolutional layers with different filter widths, followed by a global pooling layer. Kernel sizes help capture patterns of varying n-gram lengths.
* **GlobalMaxPooling1D layer,** ---> help capture the most salient features across the entire sentence
* **Dense layer, dropout,**
* **Sigmoid output**


**The hyperparameter ranges are chosen as follows** --- consideration for choice: standard hyperparams values and ranges for text classification:
* **embedding dimensions of 50, 100, 200**
* **filter sizes: 64, 128, 256**
* **kernel sizes 3, 5,  7**
* **dense units: 32, 64, 128**
* **dropout rates rannge from 0.2 to 0.5 with stepsize=0.1**
#### The values gove a balance between model complexity and overfitting: the embedding dimension is high enough to capture semantic nuances, 
#### while the convolutional filters and kernel sizes let us experiment with various n-gram sizes without overly complicating the network
#### The dropout controls model capacity and regularize against overfitting

In [56]:
# Model buiding for hyperparameter tuning


def build_model(hp):
    """
    function to build 1D CNN model:
    
    The model architecture:
      - Embedding layer with tunable embedding dimension
      - Conv1D layer with tunable number of filters and kernel size
      - GlobalMaxPooling1D layer to reduce sequence dimensions
      - Dense layer with tunable units.
      - Dropout layer with tunable dropout rate to reduce overfitting
      - Final Dense output layer with sigmoid activation for binary classification
      
    Hyperparameters tuned via Keras Tuner include:
      - 'embedding_dim': dimension of the word embeddings
      - 'filters': number of  convolution filters (ie output channels) in the Conv1D layer
      - 'kernel_size': width of the convolutional kernel
      - 'dense_units': number of units in the Dense layer
      - 'dropout_rate': dropout rate after the Dense layer/ proportion of neuron outputs dropped to reduce overfitting
    
    Args:
        hp: HyperParameters object for tuning
        
    Returns:
        A compiled Keras model
    """
    model = tf.keras.Sequential()
    
    # tunable embedding dimension: we will try 50, 100 , 200
    # Keras layer to lookup table that maps each of the top max_words tokens (from the Tokenizer) to a vector of size embedding_dim
    embedding_dim = hp.Choice('embedding_dim', values=[50, 100, 200])
    model.add(tf.keras.layers.Embedding(input_dim=max_words,
                                        output_dim=embedding_dim,
                                        input_length=maxlen))
    
    # Tunable Conv1D layer with tunable filters and kernel size
    filters = hp.Choice('filters', values=[64, 128, 256])
    kernel_size = hp.Choice('kernel_size', values=[3, 5, 7])
    model.add(tf.keras.layers.Conv1D(filters=filters,
                                     kernel_size=kernel_size,
                                     activation='relu'))
    
    # Global max pooling layer to capture the most salient features
    model.add(tf.keras.layers.GlobalMaxPooling1D())
    
    # dense layer with tunable number of units
    dense_units = hp.Choice('dense_units', values=[32, 64, 128])
    model.add(tf.keras.layers.Dense(dense_units, activation='relu'))
    
    # Dropout layer with tunable dropout rate
    dropout_rate = hp.Float('dropout_rate', min_value=0.2, max_value=0.5, step=0.1)
    model.add(tf.keras.layers.Dropout(dropout_rate))
    
    # output layer for binary classification
    model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

## Hyperparameters tuning with keras tuner 
We will use Keras Tuner’s Hyperband to search over 

specified choices for embedding dimension, number of filters, kernel size, dense units and dropout rate.

The goal is to max validation accuracy over a 5-fold-like setup with early stopping to avoid overfitting.

In [57]:
# Start total timer 
start_total = time.time()


# Hyperparameter tuning with Keras Tuner (80-20 split)
start_tuning = time.time()
tuner = kt.Hyperband(
    build_model,
    objective='val_accuracy', 
    max_epochs=10,
    factor=3,
    directory='hyper_tuning',
    project_name='cnn_disaster'
)

# early stopping callback to prevent overfitting
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

# 80-20 train/validation split for hyperparams search
tuner.search(X_train, y_train, epochs=10, validation_split=0.2, callbacks=[stop_early])

# best hyperparameters and best model
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Optimal embedding_dim:", best_hps.get('embedding_dim'))
print("Optimal filters:", best_hps.get('filters'))
print("Optimal kernel_size:", best_hps.get('kernel_size'))
print("Optimal dense_units:", best_hps.get('dense_units'))
print("Optimal dropout_rate:", best_hps.get('dropout_rate'))
end_tuning = time.time()
print("Hyperparameter tuning time (sec):", end_tuning - start_tuning)


# 5-Fold Cross-Validation using the best hyperparameters
start_cv = time.time()
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
cv_times = []
fold = 1

for train_index, val_index in kf.split(X_train):
    print(f"Training fold {fold}")
    X_tr, X_val = X_train[train_index], X_train[val_index]
    y_tr, y_val = y_train[train_index], y_train[val_index]
    model = build_model(best_hps)
    start_time = time.time()
    history = model.fit(X_tr, y_tr, epochs=10, validation_data=(X_val, y_val), callbacks=[stop_early], verbose=1)
    fold_time = time.time() - start_time
    cv_times.append(fold_time)
    print(f"Fold {fold} training time: {fold_time:.2f} seconds")

    start_inference = time.time()  #start inference timer for CV inference
    mem_before = get_memory_usage()  #capture memory before inference
    
    predictions = model.predict(X_val)  
    
    mem_after = get_memory_usage()  #capture memory after inference
    end_inference = time.time()  #end inference timer for CV inference
    inference_time = end_inference - start_inference  
    print(f"Inference time: {inference_time:.3f} sec, Memory usage diff: {mem_after - mem_before:.3f} MB")

    
    preds = (predictions > 0.5).astype(int).reshape(-1)
    fold_f1 = f1_score(y_val, preds, average='macro')
    # score = model.evaluate(X_val, y_val, verbose=0)
    print(f"Fold {fold} f1_macro: {fold_f1}")
    cv_scores.append(fold_f1)
    fold += 1
end_cv = time.time()
print("Average CV f1_macro:", np.mean(cv_scores))
print("Average CV training time (sec):", np.mean(cv_times))
print("Total CV time (sec):", end_cv - start_cv)

end_total = time.time()
print("Total pipeline time (sec):", end_total - start_total)

# Retrain best model on full training set using best hyperparameters
start_final = time.time()
best_model = build_model(best_hps)
history = best_model.fit(X_train, y_train, epochs=10, validation_split=0.2, callbacks=[stop_early], verbose=1)
end_final = time.time()


Trial 30 Complete [00h 00m 07s]
val_accuracy: 0.7583716511726379

Best val_accuracy So Far: 0.7813525795936584
Total elapsed time: 00h 03m 30s
Optimal embedding_dim: 100
Optimal filters: 256
Optimal kernel_size: 7
Optimal dense_units: 128
Optimal dropout_rate: 0.4
Hyperparameter tuning time (sec): 209.85265111923218
Training fold 1
Epoch 1/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.6447 - loss: 0.6177 - val_accuracy: 0.7984 - val_loss: 0.4450
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8508 - loss: 0.3573 - val_accuracy: 0.7833 - val_loss: 0.5133
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9334 - loss: 0.1913 - val_accuracy: 0.7708 - val_loss: 0.6629
Epoch 4/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9640 - loss: 0.1028 - val_accuracy: 0.7525 - val_loss: 0.8164
Fold 1 training t



[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.6605 - loss: 0.6088 - val_accuracy: 0.7905 - val_loss: 0.4512
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8572 - loss: 0.3477 - val_accuracy: 0.7827 - val_loss: 0.5225
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9290 - loss: 0.1947 - val_accuracy: 0.7577 - val_loss: 0.6796
Fold 2 training time: 5.55 seconds
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
Inference time: 0.438 sec, Memory usage diff: 0.250 MB
Fold 2 f1_macro: 0.7545849354460479
Training fold 3
Epoch 1/10




[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.6482 - loss: 0.6148 - val_accuracy: 0.8181 - val_loss: 0.4380
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8420 - loss: 0.3676 - val_accuracy: 0.8050 - val_loss: 0.4731
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9231 - loss: 0.2082 - val_accuracy: 0.7892 - val_loss: 0.6101
Epoch 4/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9620 - loss: 0.1079 - val_accuracy: 0.7840 - val_loss: 0.7412
Fold 3 training time: 6.15 seconds
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
Inference time: 0.432 sec, Memory usage diff: 2.375 MB
Fold 3 f1_macro: 0.7792894599481555
Training fold 4
Epoch 1/10




[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 13ms/step - accuracy: 0.6492 - loss: 0.6131 - val_accuracy: 0.8016 - val_loss: 0.4548
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8469 - loss: 0.3642 - val_accuracy: 0.7753 - val_loss: 0.5430
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9287 - loss: 0.2036 - val_accuracy: 0.7714 - val_loss: 0.7070
Fold 4 training time: 6.07 seconds
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step
Inference time: 0.501 sec, Memory usage diff: 0.750 MB
Fold 4 f1_macro: 0.7576433121019108
Training fold 5
Epoch 1/10




[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.6612 - loss: 0.5973 - val_accuracy: 0.7891 - val_loss: 0.4539
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8615 - loss: 0.3363 - val_accuracy: 0.7852 - val_loss: 0.5343
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9372 - loss: 0.1840 - val_accuracy: 0.7668 - val_loss: 0.6514
Fold 5 training time: 5.58 seconds
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
Inference time: 0.429 sec, Memory usage diff: 1.875 MB
Fold 5 f1_macro: 0.761730572228521
Average CV f1_macro: 0.7607441569798971
Average CV training time (sec): 5.923155164718628
Total CV time (sec): 32.09015941619873
Total pipeline time (sec): 241.94412469863892
Epoch 1/10




[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.6464 - loss: 0.6121 - val_accuracy: 0.7676 - val_loss: 0.4707
Epoch 2/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8380 - loss: 0.3682 - val_accuracy: 0.7695 - val_loss: 0.5653
Epoch 3/10
[1m191/191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9272 - loss: 0.2112 - val_accuracy: 0.7091 - val_loss: 0.8935


##### Hyperparameter tuning time: ~210 seconds
##### Average CV F1 macro (5 folds): ~0.761
##### Training and inference Time: 
##### Training time per fold: ~5.92 seconds on average, leading to a total CV time of ~32.09 seconds
##### After each fold’s training, the inference takes 0.48 secs on average 
##### Memory usage differences are in the range of 0.25 - 2.38  MB, which means additional memory used during inference.


#### The entire pipeline took ~242 seconds

## 2.3 Predictions on test data

In [61]:
# Preprocess test data, make predictions
X_test_seq = tokenizer.texts_to_sequences(test_df['processed_text'])
X_test = pad_sequences(X_test_seq, maxlen=maxlen)

start_inference = time.time()  #start inference timer for test predictions
mem_before = get_memory_usage()  #capture memory before test inference

test_preds = best_model.predict(X_test)

mem_after = get_memory_usage()  #capture memory after test inference
end_inference = time.time()  #end inference timer for test predictions
inference_time = end_inference - start_inference  
print(f"Test inference time: {inference_time:.3f} sec, Memory usage diff: {mem_after - mem_before:.3f} MB") 

test_preds = (test_preds > 0.5).astype(int).reshape(-1)

submission = pd.DataFrame({'id': test_df['id'], 'target': test_preds})
submission.to_csv("2_submission_cnn_tuned.csv", index=False)
print("\nSaved submission.csv")

[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Test inference time: 0.216 sec, Memory usage diff: 1.250 MB

Saved submission.csv


# 3. Fine-tuning pre-trained model -- Roberta-base 

For fine-tuning we choose Roberta-base model

we are going to:
* tokenize using RobertaTokenizerFast
* use RobertaForSequenceClassification (pretrained roberta-base)
* fine-tune using Trainer with chosen hyperparameters
* predict on test data 

In [64]:
import numpy as np
import pandas as pd
import torch
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification
from transformers import Trainer, TrainingArguments, TrainerCallback
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
import time
import psutil
import os

In [65]:
def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 * 1024)


train_texts = train_df['processed_text'].tolist()
train_labels = train_df['target'].values
test_texts = test_df['processed_text'].tolist()
test_ids = test_df['id'].tolist()

### train/validation splitting:
#### We do 80-20 split for small validation set

In [66]:
# Split into train/val
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, test_size=0.2, random_state=42
)

#### we use RobertaTokenizerFast as tokenizer:
to split text into subword tokens consistent with Roberta-base vocabulary

In [67]:
# Tokenize (max_length=128)
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings   = tokenizer(val_texts, truncation=True, padding=True, max_length=128)
test_encodings  = tokenizer(test_texts, truncation=True, padding=True, max_length=128)

#### We create dataset objects
to wrap tokenized inputs for further Trainer input

In [68]:
#  Create Dataset class
class TweetDataset(torch.utils.data.Dataset):
    """
    Pytorch dataset to hold tokenized texts and labels 
    """
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        if self.labels is not None:
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item
    def __len__(self):
        return len(self.encodings['input_ids'])

train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset   = TweetDataset(val_encodings,   val_labels)
test_dataset  = TweetDataset(test_encodings)

#### We use RobertaForSequenceClassification
##### with num_labels=2 for binary classification

In [69]:
# Callback for logging
class LoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            print(f"Step {state.global_step}: {logs}")

# Compute metrics
from sklearn.metrics import f1_score, accuracy_score
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1_macro': f1_score(labels, predictions, average='macro')
    }

# Initialize model
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

# Training config
batch_size = 32
epochs = 3
learning_rate = 2e-5

# Training configuration with Hugging Face Trainer 
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=epochs,
    learning_rate=learning_rate,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=50,
    save_total_limit=2,
    report_to=["none"]
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[LoggingCallback()]
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### The model is fine-tuned and evaluated on the validation set:

In [71]:
# Train
start_total = time.time()
# fine-tune the model
trainer.train()


# Evaluate
start_inference = time.time()
mem_before = get_memory_usage()
val_metrics = trainer.evaluate()
mem_after = get_memory_usage()
end_inference = time.time()
print("Evaluation Inference time: {:.3f}s, Memory usage diff: {:.3f} MB"
      .format(end_inference - start_inference, mem_after - mem_before))
print("Validation metrics:", val_metrics)

end_total = time.time()
print("Total time (sec):", end_total - start_total)

Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy,F1 Macro
1,0.4924,0.435452,0.0032,0.812869,0.808881
2,0.42,0.420721,0.0032,0.825345,0.816686
3,0.3647,0.440592,0.0032,0.820092,0.814986


Step 50: {'loss': 0.6239, 'grad_norm': 7.579275608062744, 'learning_rate': 1.825479930191972e-05, 'epoch': 0.2617801047120419}
Step 100: {'loss': 0.5207, 'grad_norm': 5.010350704193115, 'learning_rate': 1.6509598603839444e-05, 'epoch': 0.5235602094240838}
Step 150: {'loss': 0.4924, 'grad_norm': 6.428791046142578, 'learning_rate': 1.4764397905759162e-05, 'epoch': 0.7853403141361257}
Step 191: {'eval_loss': 0.43545156717300415, 'eval_model_preparation_time': 0.0032, 'eval_accuracy': 0.8128693368351937, 'eval_f1_macro': 0.8088813977541567, 'eval_runtime': 2.3577, 'eval_samples_per_second': 645.961, 'eval_steps_per_second': 20.359, 'epoch': 1.0}
Step 200: {'loss': 0.4464, 'grad_norm': 7.938277244567871, 'learning_rate': 1.3019197207678885e-05, 'epoch': 1.0471204188481675}
Step 250: {'loss': 0.4091, 'grad_norm': 7.515224456787109, 'learning_rate': 1.1273996509598604e-05, 'epoch': 1.3089005235602094}
Step 300: {'loss': 0.4095, 'grad_norm': 6.289299488067627, 'learning_rate': 9.52879581151832

Step 573: {'eval_loss': 0.44059208035469055, 'eval_model_preparation_time': 0.0032, 'eval_accuracy': 0.8200919238345371, 'eval_f1_macro': 0.8149863446123289, 'eval_runtime': 2.2994, 'eval_samples_per_second': 662.333, 'eval_steps_per_second': 20.875, 'epoch': 3.0}
Evaluation Inference time: 2.305s, Memory usage diff: 0.000 MB
Validation metrics: {'eval_loss': 0.44059208035469055, 'eval_model_preparation_time': 0.0032, 'eval_accuracy': 0.8200919238345371, 'eval_f1_macro': 0.8149863446123289, 'eval_runtime': 2.2994, 'eval_samples_per_second': 662.333, 'eval_steps_per_second': 20.875, 'epoch': 3.0}
Total time (sec): 111.2301332950592


* training (for all 3 epochs) took ~108.15 secs
* inference Time (final evaluation) ~2.30 secs
* Overall Pipeline took ~111.23 secs
  
*  Avarage Validation Accuracy = 0.82  and  avarage F1 Macro = 0.81

#### Make predictons on test data:

In [73]:
# inference time and memory usage for test prediction
start_inference = time.time()
mem_before = get_memory_usage()

preds_output = trainer.predict(test_dataset)

mem_after = get_memory_usage()
end_inference = time.time()
print("Prediction Inference time: {:.3f}s, Memory usage diff: {:.3f} MB"
      .format(end_inference - start_inference, mem_after - mem_before))

logits = preds_output.predictions
preds = np.argmax(logits, axis=1)
submission_df = pd.DataFrame({"id": test_ids, "target": preds})
submission_df.to_csv("3_roberta_submission.csv", index=False)
print("Submission saved")

Prediction Inference time: 5.054s, Memory usage diff: 0.000 MB
Submission saved


ALL 3 models were assessed with aravage f1 macro score and resousres intake were estimated with train/fine-tuning + inference time + memory usage for train+inference

The model with highest quality/resources (i.e. av.f1 macro/memory+time) ration is Roberta-base pre-trained model fine-tuned on the tweet dataset 

3 models were assessed:
* SVM on BoW
* 1D CNN model
* pre-trained ROBERTa-base transformer model
  
All 3 models were compared by examining their macro-F1 scores (averaged over the validation folds/test set) and by estimating their overall resource usage (training/fine-tuning plus inference time and memory consumption). 

* SVM on BoW = 0.78/312
* 1D CNN model = 0.76/242 
* pre-trained ROBERTa-base transformer model = 0.81/ 111  

Based on this performance-to-resource ratio, the pretrained RoBERTa-base model fine-tuned on the tweet dataset yielded the highest score relative to SVM and CNN.

For all of 3 models avaraged f1 macro lies in very close range to 0.80
The total pipelines time though differs, but not that greatly either (all <=5 min)

Encountered dfficulties:
* 1) The main challenge consists in proper data preprocessing -- in preprocessing step how to manage (remove/change) elements assosiated with/found often in tweets:
  *  like hashtaged compound elements  like #loveyouall for which **wordninja** was used to split them, anothough it still doesnt guarantee all splitting would result in correct tokens (like "love you all" or "love youal"? ). hashtags serve as keywords in tweets but dividing them in correct tokens is challenging.
  *  or informal abbreviations, slang which may not stem correctly using SnowballStemmer, potentially losing critical sentiment or context.
  *  or  finding the correct words or n-grams to keep out(which should not be removed and sometimes change tweet meaning of tweet from dorect one to figurative one)
  * anpther question is how to combine other 2 cols from train_df (info on location and keywords) with tweet data and whether they carry any particular role
* 2) Hyperparameters search for CNN:CNN with too many filters or large kernel sizes might overfit to specific patterns , balancing between capturing local n-gram features and generalizing across the dataset is challenging
  3) Sensitivity to hyperparameter for fine-tuning of ROBERTa : we fine-tune pre-trained transformer with a small learning rate (2e-5) and 3 epochs but its performance is changes strongly with even small change in these hyperparams, for ex increasing number of epochs >5 leads to very strong overfitting signifincantly reducing performance score.
  

How to potentially improve the results:
1) Better preprocessing : we can replace stemming with contextual embeddings (for ex BERT, RoBERTa) and subword tokenization ( for ex BPE or WordPiece) to handle out-of-vocabulary abbreviations.
2) for svm: we could perform dimentionality reductinon (for ex using SVD or LDA) to reduce feature space complexity.
3) for cnn we can try 1) increasnig convolutional layers and very kernel sizes to capture different n-gram patterns 2) integrating pretrained embeddings like GloVe or FastText to improve feature representation and reduce training time
5) for both cnn and roberta we could apply learning rate scheduling methods to adjust learnign rate dynalically during training

source of code: code from lectures (for preprocessing + BoW, TD-IDF vectorization)
                o1 model for roberta-base implementation 