# Machine Learning Approaches for Medical Transcript Classification : A Comparative Study of Binary and Frequency Bag-of-Words

**• ML Classification Setup: Preprocessing, Feature Extraction, and Ensemble Models with F1 Evaluation :**

This import block is setting up the environment for a **text or tabular classification project with multiple ML models.** Here’s what each part does in sequence:

It brings in re for regular expressions and string for working with punctuation, both often used in text preprocessing. From collections, it imports Counter to count word or character frequencies. numpy and pandas are imported for numerical operations and data handling, with pandas especially useful for working with structured datasets. The tqdm library provides progress bars for loops, which is helpful for monitoring preprocessing or training steps.

For modeling, it imports several classifiers: LogisticRegression from scikit-learn for a linear baseline, DecisionTreeClassifier for interpretable non-linear models, and RandomForestClassifier for an ensemble of decision trees. It also includes XGBClassifier from XGBoost, a powerful gradient boosting algorithm widely used in competitions and real-world tasks. Finally, f1_score from sklearn.metrics is included as the evaluation metric, useful for imbalanced classification tasks since it balances precision and recall.

In [1]:
import re
import string
from collections import Counter
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

**• Google Colab File Upload to Runtime  :**

files.upload() opens a dialog to upload files from your local system. Once uploaded, they are stored in the Colab environment

In [2]:
from google.colab import files
uploaded = files.upload()

Saving valid.csv to valid.csv
Saving test.csv to test.csv
Saving train.csv to train.csv


**• Loading Train, Validation, and Test Data with Pandas :**

This code loads three datasets into pandas DataFrames.
train_df contains the training data, valid_df holds the validation data, and test_df is the unseen test set. Each CSV file (train.csv, valid.csv, test.csv) is read using pd.read_csv, making the data easy to explore, preprocess, and use for model training and evaluation.

In [3]:
train_df = pd.read_csv("train.csv")
valid_df = pd.read_csv("valid.csv")
test_df = pd.read_csv("test.csv")

**• Loading and Splitting Text and Labels for Train, Validation, and Test Sets :**

This block loads the dataset and prepares it for machine learning. First, it reads three CSV files—train.csv, valid.csv, and test.csv—into pandas DataFrames. From each DataFrame, the text column is extracted as strings and converted to Python lists (train_texts, valid_texts, test_texts), while the label column is extracted as lists of labels (train_labels, valid_labels, test_labels). This separation makes the data ready for preprocessing, feature extraction, and model training.

In [4]:
# 1. Load Data
train_df = pd.read_csv("train.csv")
valid_df = pd.read_csv("valid.csv")
test_df  = pd.read_csv("test.csv")
train_texts = train_df['text'].astype(str).tolist()
train_labels = train_df['label'].tolist()
valid_texts = valid_df['text'].astype(str).tolist()
valid_labels = valid_df['label'].tolist()
test_texts  = test_df['text'].astype(str).tolist()
test_labels = test_df['label'].tolist()

**• Text Preprocessing: Lowercasing, Punctuation Removal, and Whitespace Normalization**

This code defines a simple **text preprocessing function** and applies it to the train, validation, and test sets.

The preprocess function first converts all text to lowercase for consistency. Then it removes punctuation by replacing it with spaces, using regular expressions. After that, it replaces multiple spaces with a single space and trims leading or trailing whitespace. The cleaned text is returned.

Finally, the function is applied to every sentence in train_texts, valid_texts, and test_texts, ensuring that all datasets are standardized before feature extraction and model training.

In [5]:
def preprocess(text):
  text = text.lower()
  text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
  text = re.sub(r"\s+", " ", text).strip()
  return text
train_texts = [preprocess(t) for t in train_texts]
valid_texts = [preprocess(t) for t in valid_texts]
test_texts  = [preprocess(t) for t in test_texts]

**• Vocabulary Building: Top 10K Words with IDs and Frequencies from Training Data :**

This block builds a vocabulary from the training dataset by keeping only the most frequent words.

It starts with a Counter object to count word frequencies. Each text in train_texts is split into tokens, and their counts are updated. Then, the top 10,000 most common words are extracted into vocab. A mapping word2id assigns each word a unique integer ID starting from 0.

The vocabulary is saved into a file vocab.txt, where each line contains a word, its ID, and its frequency in the training set. To verify, the code prints out the first 10 words with their IDs and counts.
For example, one of the printed results looks like:

the 0 118887


which shows that the word “the” is the most frequent, with ID 0 and a count of 118,887.



In [6]:
# 3. Build Vocabulary (Top 10,000 words from TRAIN)
word_counts = Counter()
for t in train_texts:
  word_counts.update(t.split())
TOP_K = 10000
vocab = [word for word, _ in word_counts.most_common(TOP_K)]
word2id = {word: i for i, word in enumerate(vocab)}  # ids start from 0
with open("vocab.txt", "w", encoding="utf-8") as f:
  for i, word in enumerate(vocab):
    f.write(f"{word} {i} {word_counts[word]}\n")
for i, word in enumerate(vocab[:10]):
    print(f"{word} {word2id[word]} {word_counts[word]}")

the 0 118887
and 1 66917
was 2 56124
of 3 48447
to 4 41003
a 5 34316
with 6 28462
in 7 26243
is 8 21651
patient 9 19289


**• Converting Text to Word ID Sequences and Saving Train/Validation/Test Sets :**

This block converts the texts into sequences of word IDs based on the previously built vocabulary and saves them to files.

The convert_to_ids function loops through each text and label, replaces every word in the text with its corresponding ID from word2id (ignoring words not in the vocabulary), joins the IDs into a string, appends the label at the end, and writes each line to a file. This is done for training (train_ids.txt), validation (valid_ids.txt), and test (test_ids.txt) sets.

Each line in the saved file represents a single sample as a sequence of integers followed by the label, making it suitable for machine learning models that work with numerical inputs. For example, a line from train_ids.txt might look like:

26 248 542 27 157 424 ... 1


Here, each number corresponds to a word ID from the vocabulary, and the final number is the label for that sample.

In [7]:
# 4. Save Train/Valid/Test with Word IDs
def convert_to_ids(texts, labels, filename):
  lines = []
  with open(filename, "w", encoding="utf-8") as f:
    for text, label in zip(texts, labels):
      ids = [str(word2id[w]) for w in text.split() if w in word2id]
      line = " ".join(ids) + f" {label}\n"
      f.write(line)
      lines.append(line.strip())
  return lines
train_ids = convert_to_ids(train_texts, train_labels, "train_ids.txt")
valid_ids = convert_to_ids(valid_texts, valid_labels, "valid_ids.txt")
test_ids  = convert_to_ids(test_texts, test_labels, "test_ids.txt")
for line in train_ids[:5]:
    print(line)


26 248 542 27 157 424 232 2588 2253 3912 5154 26 157 21 364 1009 55 33 778 450 36 391 9548 1777 46 33 19 897 40 33 1034 3 0 1829 1 1089 4991 83 33 21 897 1 21 364 778 450 1945 27 29 8 27 4 26 424 1278 1079 163 55 10 424 232 26 157 1829 1278 6 315 157 1060 7 19 105 1381 299 1413 691 1860 1290 27 33 21 897 26 391 9548 1777 36 157 1829 1278 55 315 157 1060 7 19 105 1381 1094 26 248 542 1945 1829 1278 105 1381 232 364 105 897 1829 1278 2
137 205 1513 1 1214 2924 123 205 2589 1 1214 2616 34 7931 116 2778 506 3 34 0 2505 3913 2 1095 68 0 892 1 566 618 87 0 867 679 1 1769 4 0 796 399 3 0 1769 0 4992 2 33 1 0 104 1214 2749 2 247 0 1214 2639 2 33 6 0 1752 1045 1913 17 1898 117 44 0 7330 17 16 244 29 14 472 1643 5261 1 5 3284 3 17 1898 117 816 16 29 2 5 148 2240 463 6 0 5478 57 17 2489 117 44 0 7330 0 591 188 0 463 2 33 0 1012 1046 2 33 6 33 591 316 0 3914 2 716 3674 3 0 774 68 0 1769 47 2 33 87 0 796 399 260 3345 3 0 774 520 1270 14 129 3 0 6859 4512 0 2550 3 5 148 2240 463 51 2 1452 4 1678 208

This code is used in Google Colab to download files from the Colab environment to your local computer.

The files.download() function takes a filename as input and prompts a download in the browser. Here, it allows you to save the generated files: vocab.txt, train_ids.txt, valid_ids.txt, and test_ids.txt. These files include the vocabulary and the train/validation/test datasets converted into sequences of word IDs, so you can use them locally or in other projects.

In [13]:
from google.colab import files
files.download("vocab.txt")
files.download("train_ids.txt")
files.download("valid_ids.txt")
files.download("test_ids.txt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**• Text Vectorization: Binary Bag-of-Words (BBoW) vs Frequency Bag-of-Words (FBoW) :**

**vectorize_BBoW(texts)**

Initializes a zero matrix X with shape (number of texts, vocabulary size).

Iterates through each text in the dataset.

For each unique word in the text (set(t.split())), it retrieves the corresponding index from word2id. If the word exists in the vocabulary, it sets the corresponding entry in X to 1.

Returns a binary matrix where each row represents a text, and each column represents a word. A value of 1 indicates that the word appears at least once in that text; 0 means the word does not appear.

**vectorize_FBoW(texts)**

Initializes a zero matrix X with the same shape, but as float32.

For each text, it calculates the frequency of each word relative to the total number of words in that text (c / total), using Counter.

Each entry in the resulting matrix represents the relative frequency of the corresponding word in that text.

Returns a floating-point matrix capturing how often each word occurs, which can provide more nuanced information than BBoW.

**BBoW explanation:**

Binary Bag-of-Words ignores the number of times a word appears and only cares whether it is present or absent. It converts textual data into a sparse binary feature vector that is simple, interpretable, and often effective for classification tasks. For example, if the text is "patient has fever" and "patient" has index 0, "has" index 1, "fever" index 2, the BBoW vector would be [1, 1, 1, 0, ...] regardless of whether a word repeats. This is especially useful when you want to focus on word presence rather than frequency.

**FBoW explanation:**

The vectorize_FBoW function creates a normalized frequency representation. It also initializes a matrix X with zeros. For each text, it counts the occurrences of each word (Counter(words)), then divides each count by the total number of words in that text to get the relative frequency. The result is a vector where each element indicates how often a word appears relative to the text length.


How it works & purpose:

Captures both the presence and the frequency of words.

Words that appear more often in a text have higher values in the vector.

Normalization by text length prevents longer texts from dominating due to sheer word count.

Example: if a text has 5 words and "patient" appears twice, the FBoW vector will have 0.4 at the "patient" index.

Summary :

BBoW → binary vectors indicating presence/absence of words. Simple and sparse.

FBoW → normalized frequency vectors capturing relative occurrence of words. Useful for highlighting important words while accounting for text length.

Both methods convert raw text into fixed-length vectors matching the vocabulary size, making it compatible with classical ML models like logistic regression, decision trees, random forests, or XGBoost.

In [8]:
# 5. Vectorization Functions
def vectorize_BBoW(texts):
    X = np.zeros((len(texts), len(vocab)), dtype=np.uint8)
    for i, t in enumerate(tqdm(texts, desc="BBoW")):
        for word in set(t.split()):
            idx = word2id.get(word)
            if idx is not None:
                X[i, idx] = 1
    return X
def vectorize_FBoW(texts):
    X = np.zeros((len(texts), len(vocab)), dtype=np.float32)
    for i, t in enumerate(tqdm(texts, desc="FBoW")):
        words = t.split()
        total = len(words)
        if total == 0:
            continue
        counts = Counter(words)
        for word, c in counts.items():
            idx = word2id.get(word)
            if idx is not None:
                X[i, idx] = c / total
    return X

**• Preparing BBoW Features and Encoding Labels for Classification :**

This code prepares the features and labels for machine learning models.

First, it vectorizes the text datasets using the Binary Bag-of-Words (BBoW) method. X_train_BBoW, X_valid_BBoW, and X_test_BBoW are matrices where each row represents a text and each column represents a word in the vocabulary, with 1 indicating the presence of a word and 0 its absence.

Next, it encodes the labels into numerical format using LabelEncoder from scikit-learn. The encoder is first fitted on the training labels (train_labels) and transforms them into integer class IDs (train_labels_enc). The same encoder is then used to transform the validation and test labels consistently (valid_labels_enc, test_labels_enc). This ensures that the labels are in a format suitable for classifiers like logistic regression, decision trees, or ensemble models.

In [9]:
# 6. Prepare BBoW matrices
from sklearn.preprocessing import LabelEncoder
X_train_BBoW = vectorize_BBoW(train_texts)
X_valid_BBoW = vectorize_BBoW(valid_texts)
X_test_BBoW  = vectorize_BBoW(test_texts)
le = LabelEncoder()
train_labels_enc = le.fit_transform(train_labels)
valid_labels_enc = le.transform(valid_labels)
test_labels_enc  = le.transform(test_labels)

BBoW: 100%|██████████| 4000/4000 [00:00<00:00, 8990.37it/s]
BBoW: 100%|██████████| 499/499 [00:00<00:00, 7684.33it/s]
BBoW: 100%|██████████| 500/500 [00:00<00:00, 7561.34it/s]


**• Training and Evaluating Classical ML Models on Binary Bag-of-Words Features :**

This block trains and evaluates multiple classical machine learning models on the Binary Bag-of-Words (BBoW) features. Four models are defined: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, and XGBClassifier (XGBoost). Each model is initialized with specific hyperparameters like maximum iterations, depth, number of estimators, or learning criteria to control complexity and improve performance.

The code iterates over the feature sets (here only BBoW) and trains each model on the training data (X_train_BBoW with train_labels_enc). After training, predictions are made on the train, validation, and test sets. The f1_score with average='macro' is calculated for each split, giving a balanced metric across all classes, especially useful for datasets with class imbalance.

The output shows performance comparisons:

**•** LogisticRegression achieves very high training F1 (0.9076), but validation F1 drops to 0.6879, indicating some overfitting. Test F1 is 0.7268.

**•** DecisionTree performs slightly worse on training (0.8627) but generalizes better to validation (0.7319) and test (0.7530).

**•** RandomForest overfits less on train (0.6979) but underperforms on validation and test (0.4998 and 0.5181), suggesting suboptimal hyperparameters or over-regularization.

**•** XGBoost shows the best overall balance with high training F1 (0.9079) and strong validation and test F1 (0.7697 and 0.7925), indicating good generalization.

In summary, this experiment demonstrates how **different classical ML models perform on binary bag-of-words** features, highlights overfitting in some models, and shows that **XGBoost provides the best trade-off between training accuracy and generalization.**

In [10]:
# 7. Train & Evaluate (BBoW)
models = {
    "LogisticRegression": LogisticRegression(max_iter=1000,C=0.3,class_weight="balanced",solver="liblinear"),
    "DecisionTree": DecisionTreeClassifier(max_depth=20,min_samples_leaf=5,random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=200, max_depth=20,max_features="sqrt",min_samples_leaf=5,random_state=42,n_jobs=-1),
    "XGBoost": XGBClassifier(n_estimators=200, max_depth=6, eval_metric='mlogloss', random_state=42)
    }
for vec_name, X_train, X_valid, X_test in [("BBoW", X_train_BBoW, X_valid_BBoW, X_test_BBoW),]:
    print(f"\n=== {vec_name} Results ===")
    for name, model in models.items():
        model.fit(X_train, train_labels_enc)
        train_pred, valid_pred, test_pred = model.predict(X_train), model.predict(X_valid), model.predict(X_test)
        print(f"\n{name}:")
        print(f"Train F1: {f1_score(train_labels_enc, train_pred, average='macro'):.4f}")
        print(f"Valid F1: {f1_score(valid_labels_enc, valid_pred, average='macro'):.4f}")
        print(f"Test F1: {f1_score(test_labels_enc, test_pred, average='macro'):.4f}")


=== BBoW Results ===

LogisticRegression:
Train F1: 0.9076
Valid F1: 0.6879
Test F1: 0.7268

DecisionTree:
Train F1: 0.8627
Valid F1: 0.7319
Test F1: 0.7530

RandomForest:
Train F1: 0.6979
Valid F1: 0.4998
Test F1: 0.5181

XGBoost:
Train F1: 0.9079
Valid F1: 0.7697
Test F1: 0.7925


**BBoW Analysis**

**•** Logistic Regression: Very low F1 scores (Train 0.3253, Test 0.3178) indicate
that hyperparameters cannot compensate for the limited information in binary features. Regularization (C) still matters, but the main limitation is the feature representation.

**•** Decision Tree: Train F1 = 0.8364, Test F1 = 0.7306. Max depth and min samples per leaf help prevent overfitting even with sparse binary vectors. Trees can handle presence/absence well, but frequency information (missing in BBoW) slightly reduces predictive power.

**•** Random Forest: Train F1 = 0.7361, Test F1 = 0.4735. Ensemble size and depth parameters control variance, but the sparse binary features lead to limited generalization despite hyperparameter tuning.

**•** XGBoost: Train F1 = 0.9090, Test F1 = 0.8075. Hyperparameters for boosting iterations, tree depth, and learning rate allow XGBoost to extract strong patterns from binary features, compensating somewhat for the loss of frequency information.

Role of Hyperparameters for BBoW: Hyperparameters help control overfitting and optimize tree-based or boosting models, but linear models are heavily affected by the lack of frequency information. For ensembles like XGBoost, tuning hyperparameters can still extract strong predictive signals even from sparse binary features.

**• Preparing FBoW Features and Encoding Labels for Classification :**

This code generates Frequency Bag-of-Words (FBoW) feature matrices for the training, validation, and test datasets by applying the vectorize_FBoW function. Each resulting matrix has dimensions corresponding to the number of texts and the vocabulary size, with each entry representing the normalized frequency of a word in the text. These matrices provide numerical input that can be directly fed into classical machine learning models such as logistic regression, decision trees, random forests, or XGBoost for text classification tasks.

In [11]:
# 8. Prepare FBoW matrices
X_train_FBoW = vectorize_FBoW(train_texts)
X_valid_FBoW = vectorize_FBoW(valid_texts)
X_test_FBoW  = vectorize_FBoW(test_texts)

FBoW: 100%|██████████| 4000/4000 [00:00<00:00, 8063.83it/s]
FBoW: 100%|██████████| 499/499 [00:00<00:00, 8152.37it/s]
FBoW: 100%|██████████| 500/500 [00:00<00:00, 7854.24it/s]


**• Training and Evaluating Classical ML Models on Frequency Bag-of-Words Features :**

This code trains and evaluates several classical machine learning models using the Frequency Bag-of-Words (FBoW) features. A dictionary models defines four classifiers with tuned hyperparameters: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, and XGBClassifier. Each model is trained on the FBoW matrix of the training set and then used to predict labels on the training, validation, and test sets. The macro F1 score is computed for each split using f1_score to measure performance across all classes equally, which is especially useful for imbalanced datasets.

The printed results show how each model performed:

FBoW helps the model generalize better during validation, likely because frequency information gives a more nuanced view of the text.

BBoW can sometimes give slightly higher test F1 if the presence/absence information alone is enough for certain samples, but it may be less stable across datasets.

**•** Logistic Regression: Low F1 scores (~0.32) across train, validation, and test, indicating that the linear model struggles to capture complex patterns in the data.

**•** Decision Tree: High training F1 (0.8364) but lower validation (0.7028) and test (0.7306), showing some overfitting but still reasonably generalizes.

**•** Random Forest: Moderate training F1 (0.7361) but poor validation and test F1 (~0.46–0.47), suggesting overfitting or insufficient tuning despite being an ensemble method.

**•** XGBoost: Very high training F1 (0.9090) and strong validation (0.7619) and test F1 (0.8075), demonstrating good generalization and strong performance, likely due to gradient boosting effectively capturing non-linear relationships.

Analysis: Models like **decision trees and XGBoost can capture non-linear dependencies** in text data, whereas linear models like logistic regression may underperform on complex classification tasks. Ensemble methods like **XGBoost balance high training accuracy with better generalization compared to single trees or poorly tuned random forests.**

In [12]:
# 9. Train & Evaluate (FBoW)
models = {
    "LogisticRegression": LogisticRegression(max_iter=2000,C=0.4,class_weight="balanced",solver="liblinear",penalty="l2"),
    "DecisionTree": DecisionTreeClassifier(max_depth=18,min_samples_leaf=8,random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=200, max_depth=20,max_features="sqrt",min_samples_leaf=5,random_state=42,n_jobs=-1),
    "XGBoost": XGBClassifier(n_estimators=200, max_depth=6, eval_metric='mlogloss', random_state=42)
}
for vec_name, X_train, X_valid, X_test in [("FBoW", X_train_FBoW, X_valid_FBoW, X_test_FBoW),]:
    print(f"\n=== {vec_name} Results ===")
    for name, model in models.items():
        model.fit(X_train, train_labels_enc)
        train_pred, valid_pred, test_pred = model.predict(X_train), model.predict(X_valid), model.predict(X_test)
        print(f"\n{name}:")
        print(f"Train F1: {f1_score(train_labels_enc, train_pred, average='macro'):.4f}")
        print(f"Valid F1: {f1_score(valid_labels_enc, valid_pred, average='macro'):.4f}")
        print(f"Test F1: {f1_score(test_labels_enc, test_pred, average='macro'):.4f}")


=== FBoW Results ===

LogisticRegression:
Train F1: 0.3253
Valid F1: 0.3217
Test F1: 0.3178

DecisionTree:
Train F1: 0.8364
Valid F1: 0.7028
Test F1: 0.7306

RandomForest:
Train F1: 0.7361
Valid F1: 0.4639
Test F1: 0.4735

XGBoost:
Train F1: 0.9090
Valid F1: 0.7619
Test F1: 0.8075


**FBoW Analysis**

**•** Logistic Regression: High training F1 (0.9076) but lower validation (0.6879) and test (0.7268) indicate some overfitting. The regularization parameter C=0.4 controls the strength of L2 regularization. A lower C would increase regularization, potentially reducing overfitting and improving generalization.

**•** Decision Tree: Train F1 = 0.8627, Test F1 = 0.7530. Hyperparameters like max_depth=18 and min_samples_leaf=8 limit tree complexity, preventing extreme overfitting while allowing enough capacity to capture non-linear patterns.

**•** Random Forest: Train F1 = 0.6979, Test F1 = 0.5181. Hyperparameters such as n_estimators=200, max_depth=20, and min_samples_leaf=5 aim to balance bias and variance. The relatively poor generalization suggests more tuning (e.g., increasing trees or adjusting depth/features) might be needed.

**•** XGBoost: Train F1 = 0.9079, Test F1 = 0.7925. Hyperparameters like max_depth=6, n_estimators=200, and eval_metric='mlogloss' control tree complexity and training iterations. The strong performance shows that gradient boosting effectively captures patterns in FBoW features while hyperparameters prevent severe overfitting.

Role of Hyperparameters for FBoW: They determine model complexity, regularization, and ensemble size, which are critical for balancing overfitting (high train F1) and generalization (validation/test F1).

**Overall Analysis :**

Linear models like logistic regression are sensitive to feature representation; they can either underperform or overfit depending on the dataset.

Single decision trees tend to overfit but still generalize reasonably well.

Random forests may underperform if not properly tuned, especially with sparse or high-dimensional FBoW features.

XGBoost consistently provides the best balance between fitting the training data and generalizing to unseen data, making it the most reliable choice for this FBoW-based text classification task.

This analysis highlights the importance of both model choice and feature quality when using bag-of-words representations for multi-class text classification.

**•** Logistic Regression: FBoW dramatically outperforms BBoW (Train F1 0.9076 vs 0.3253, Test F1 0.7268 vs 0.3178). This shows that including word frequencies is crucial for linear models, as presence/absence alone (BBoW) loses too much information.

**•** Decision Tree: FBoW slightly outperforms BBoW (Test F1 0.7530 vs 0.7306). Frequency information helps, but since trees can capture non-linear patterns, BBoW is still reasonably effective.

**•** Random Forest: FBoW achieves better test performance (0.5181 vs 0.4735) despite lower training F1, indicating that frequency information improves generalization, while BBoW leads to some overfitting in the ensemble.

**•** XGBoost: FBoW and BBoW perform similarly, with FBoW slightly better on validation (0.7697 vs 0.7619) and BBoW slightly higher on test (0.8075 vs 0.7925). This shows that XGBoost is robust to feature type, but FBoW provides more consistent performance across validation and training.

**•** FBoW helps the model generalize better during validation, likely because frequency information gives a more nuanced view of the text.

**•** BBoW can sometimes give slightly higher test F1 if the presence/absence information alone is enough for certain samples, but it may be less stable across datasets.

Key takeaway: **For XGBoost, both FBoW and BBoW work very well, but FBoW tends to give more consistent validation performance**, whereas BBoW might slightly edge it on the test set in this specific experiment.

**conclusion :**

Overall, FBoW (Frequency Bag-of-Words) is generally better than BBoW. It consistently improves performance for linear models like Logistic Regression, which rely heavily on quantitative feature information, and it often provides more stable validation performance for tree-based models. While BBoW can sometimes match or slightly outperform FBoW on specific test splits with strong ensemble methods like XGBoost, FBoW’s ability to capture word frequency makes it the more reliable and informative representation across different models and datasets.