"Feature-based transfer learning” in NLP.

* Pre-trained DL model = feature extractor.

* ML algorithm = downstream task solver.

HYBRID: It’s not “pure deep learning” (since you’re not training/fine-tuning the whole transformer end-to-end).

It’s also not “pure machine learning” (since your features are learned by a deep neural network, not hand-crafted).

### SBERT + Logistic Regression = Deep learning for embeddings + Machine learning for classification → a hybrid approach that’s especially effective for small/medium labeled datasets. 

In [3]:
import pandas as pd

# Loading the dataset
path = "/Users/gozde/code/g0zzy/stress_sense/raw_data/Data.csv"
data = pd.read_csv(path)

data.drop(columns=["Unnamed: 0"], inplace=True)
data.dropna(inplace=True)
data.drop_duplicates(inplace=True)

In [4]:
import re

def strip_urls(text: str) -> str:
    """
    Remove URLs (http, https, www, youtu links) from a string.
    """
    # remove http/https URLs
    text = re.sub(r"http\S+", "", text)
    # remove www.* URLs
    text = re.sub(r"www\.\S+", "", text)
    # remove youtube short links
    text = re.sub(r"youtu\.be\S+", "", text)
    return text.strip()

## Cleaned data

In [5]:
data.statement = data.statement.apply(strip_urls)

## Only classify stress, anxiety and normal

In [6]:
from collections import Counter
from sklearn.model_selection import train_test_split


TARGET = {"Stress", "Anxiety", "Normal"}

# 1) Filter to the 3 classes
df = data.dropna(subset=["statement","status"])
df = df[df["status"].isin(TARGET)].reset_index(drop=True)

texts  = df["statement"].tolist()
labels = df["status"].tolist()
print(Counter(labels))

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42, stratify=labels)


Counter({'Normal': 16040, 'Anxiety': 3623, 'Stress': 2296})


In [7]:
from sentence_transformers import SentenceTransformer
sbert = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Encode (normalize for cosine/logreg stability)
E_train = sbert.encode(X_train, normalize_embeddings=True, show_progress_bar=True)
E_test  = sbert.encode(X_test,  normalize_embeddings=True, show_progress_bar=True)

  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 549/549 [00:23<00:00, 23.73it/s]
Batches: 100%|██████████| 138/138 [00:05<00:00, 25.89it/s]


In [14]:
sbert.save("../models/sbert")

In [None]:
sbert_loaded = SentenceTransformer("../models/sbert")
embs = sbert_loaded.encode(["test sentence"])

array([[ 4.29728255e-02,  9.66348425e-02, -2.12916755e-03,
         7.82683119e-02, -6.41745795e-03,  3.80002335e-02,
         9.46167856e-02,  3.93962167e-04, -5.45614287e-02,
         1.48365507e-02,  1.35712266e-01, -7.15561882e-02,
         1.98368244e-02,  4.60874522e-03,  2.93407235e-02,
        -2.44426429e-02,  2.55676303e-02, -3.15778330e-02,
        -6.94974065e-02,  2.44781375e-03,  4.08358239e-02,
        -1.56645086e-02,  6.57425355e-03,  4.47515920e-02,
         4.42355545e-03,  5.28066866e-02, -5.22431955e-02,
         2.03372296e-02,  7.58799985e-02, -2.19202787e-02,
        -2.24517044e-02,  2.38462687e-02,  9.50959045e-03,
         8.76056328e-02,  5.16428873e-02, -5.79361245e-03,
         6.01320434e-03,  2.46345298e-03,  1.73994303e-02,
        -2.02387641e-03, -1.28350954e-03, -1.18024223e-01,
         6.54889569e-02, -1.59119023e-03,  2.21067723e-02,
         3.98671394e-03, -5.12573980e-02,  4.59731221e-02,
        -5.94871677e-02, -4.02599275e-02, -4.32095118e-0

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc  = le.transform(y_test)

model = LogisticRegression(
    max_iter=2000,
    class_weight="balanced",      # helpful when classes are imbalanced
    multi_class="auto",
    n_jobs=-1
)
model.fit(E_train, y_train_enc)

pred = model.predict(E_test)
print("Accuracy:", accuracy_score(y_test_enc, pred))
print(classification_report(y_test_enc, pred, target_names=le.classes_))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Accuracy: 0.9043715846994536
              precision    recall  f1-score   support

     Anxiety       0.83      0.87      0.85       725
      Normal       0.99      0.93      0.96      3208
      Stress       0.60      0.81      0.69       459

    accuracy                           0.90      4392
   macro avg       0.80      0.87      0.83      4392
weighted avg       0.92      0.90      0.91      4392



### 👆 Probably also good enough for our use case. 

In [19]:
pred

array([1, 1, 0, ..., 0, 1, 0])

In [12]:
import pickle

filename = '/Users/gozde/code/g0zzy/stress_sense/models/hybrid_sbert_logreg_model.pkl'
pickle.dump(model, open(filename, 'wb'))