# **EXPLORING JOB DISSATISFACTION IN THE UK USING REDDIT DATA**

## **AN NLP-BASED THEMATIC AND SENTIMENT ANALYSIS**


### Student Name: Awopetu Rasheed Oluwadamilare

### Student Number: 202432121

### **Library Installation**

In [10]:
pip install praw pandas tqdm python-dotenv emoji vaderSentiment



### **Import the necessary libraries**

In [11]:
import pandas as pd
import os, time, re, json
import praw

from dotenv import load_dotenv
from datetime import datetime, timedelta
from tqdm import tqdm

### **Loading Environment Variable from .env File**

##### To keep Reddit API credentials secure, we store them in a `.env` file instead of hardcoding into the script. So anytime we need them we can easily pass them to the reddit API easily

In [12]:
environment_variable = load_dotenv()
if environment_variable:
    print("Environment variables loaded successfully.")
else:
    print("Failed to load environment variables. Please Upload .env file")

Environment variables loaded successfully.


### **Loading Dataset & Data Collection**

We will collects Reddit posts and comments related to **job dissatisfaction & Satisfaction in the UK**.  
- Subreddits: `ukjobs`, `AskUK`, `CasualUK`, `unitedkingdom`, `britishproblems`, `antiwork`, `antiwork`, `WorkReform`, `careerguidance`, `AskHR`, `britishproblems`,.  
- Queries: dissatisfaction terms (e.g., *"hate my job"*, *"toxic workplace"*) and satisfaction terms (e.g., *"love my job"*).  
- UK filter ensures posts are from UK-focused subs or mention UK places.  
- Data Period Lenght: last **6 months** of activity.  
- Output: two CSV files (`reddit_posts_uk_jobs.csv`, `reddit_comments_uk_jobs.csv`).


##### Firstly, let create a subreddit list that we intend to use.

In [13]:
SUBREDDITS = [
    "UKJobs", "AskUK", "CasualUK", "unitedkingdom", "antiwork", "WorkReform",
    "careerguidance", "AskHR", "britishproblems", "recruitinghell", "WorkReformUK"
]

print(f"List of subreddits: {(SUBREDDITS)}")

List of subreddits: ['UKJobs', 'AskUK', 'CasualUK', 'unitedkingdom', 'antiwork', 'WorkReform', 'careerguidance', 'AskHR', 'britishproblems', 'recruitinghell', 'WorkReformUK']


##### **Query Terms**

Two sets of keyword queries were defined to guide Reddit searches:  

- **Job Dissatisfaction** → phrases like *"hate my job"*, *"toxic workplace"*, *"low pay"*, *"burnout"*, *"bad boss"*, and *"bullying at work"*.  
- **Job Satisfaction** → phrases like *"love my job"*, *"happy at work"*, *"good boss"*, *"work life balance"*, and *"fair pay"*.  

These terms ensure we capture a broad range of posts expressing both negative and positive experiences with work in the UK.


In [14]:
DISSATISFACTION_TERMS = [
    "job dissatisfaction", "hate my job", "toxic workplace", "burnout", "overworked",
    "bad boss", "micromanagement", "low pay", "underpaid", "quit my job", "resign",
    "stress at work", "bullying at work", "zero hours", "unhappy at work", "stressful job", "poor management", "burnt out", "miserable at work",
    "dead-end job", "exploited at work", "no work-life balance", "hate going to work", "rejection", "really bad", "unreasonable", "management cuts benefits"
]

print(f"List of Dissatisfaction terms: {(DISSATISFACTION_TERMS)}")

List of Dissatisfaction terms: ['job dissatisfaction', 'hate my job', 'toxic workplace', 'burnout', 'overworked', 'bad boss', 'micromanagement', 'low pay', 'underpaid', 'quit my job', 'resign', 'stress at work', 'bullying at work', 'zero hours', 'unhappy at work', 'stressful job', 'poor management', 'burnt out', 'miserable at work', 'dead-end job', 'exploited at work', 'no work-life balance', 'hate going to work', 'rejection', 'really bad', 'unreasonable', 'management cuts benefits']


In [15]:
SATISFACTION_TERMS = [
    "love my job", "happy at work", "good boss", "great team", "work life balance",
    "supportive manager", "flexible working", "fair pay"
]
print(f"List of Satisfaction terms: {(SATISFACTION_TERMS)}")

List of Satisfaction terms: ['love my job', 'happy at work', 'good boss', 'great team', 'work life balance', 'supportive manager', 'flexible working', 'fair pay']


##### A regex filter **(`UK_REGEX`)** was applied to retain only posts from UK-focused subreddits or those mentioning UK locations (e.g., London, Manchester, Scotland).


In [16]:
UK_REGEX = re.compile(r"\b(UK|United Kingdom|England|Scotland|Wales|Northern Ireland|London|Manchester|Birmingham|Leeds|Glasgow|Bristol|Liverpool)\b", re.IGNORECASE)

##### Posts were collected from the **last 180 days (Approximately 6 months)** to capture recent discussions on job dissatisfaction and satisfaction in the UK.

In [17]:
DAYS_BACK = 180

##### **Reddit API Authentication** -> Access to Reddit data was established using the **PRAW (Python Reddit API Wrapper)** library.  Authentication requires three environment variables loaded from the `.env` file:

In [18]:
reddit = praw.Reddit(client_id=os.getenv("CLIENT_ID"), client_secret=os.getenv("CLIENT_SECRET"), user_agent=os.getenv("USER_AGENT"), check_for_async=False, )

##### A helper function was used to retain only posts from UK-focused subreddits or those mentioning UK locations.  


In [19]:
def is_uk_related(text: str, subreddit: str) -> bool:
    if subreddit.lower() in {"ukjobs", "askuk", "casualuk", "unitedkingdom", "britishproblems"}:
        return True
    return bool(UK_REGEX.search(text or ""))

##### A custom function was used to search target subreddits with defined queries, filter by UK relevance and time window, and return labelled posts with top-level comments for analysis.  

#### push shift documentation -> https://github.com/pushshift/api


In [20]:
def search_and_collect(queries, label):
    """
    Search posts for each query across target subreddits, INCLUDE metadata AND top comments.
    label = 'dissatisfaction' or 'satisfaction'
    """
    since = int((datetime.utcnow() - timedelta(days=DAYS_BACK)).timestamp())
    rows_posts, rows_comments = [], []

    for sr in SUBREDDITS:
        sub = reddit.subreddit(sr)
        for q in queries:
            results = sub.search(q, sort="new", limit=500)
            for s in tqdm(results, desc=f"{sr} | {q}"):
                if s.created_utc < since:
                    continue

                # UK filter
                blob = f"{s.title}\n{s.selftext or ''}"
                if not is_uk_related(blob, sr):
                    continue

                post_row = {
                    "id": s.id,
                    "subreddit": sr,
                    "created_utc": s.created_utc,
                    "created_iso": datetime.utcfromtimestamp(s.created_utc).isoformat(),
                    "author": str(s.author) if s.author else None,
                    "title": s.title,
                    "selftext": s.selftext,
                    "score": s.score,
                    "num_comments": s.num_comments,
                    "url": s.url,
                    "permalink": f"https://www.reddit.com{s.permalink}",
                    "query": q,
                    "label": label,  # satisfaction OR dissatisfaction
                    "source": "reddit_api",
                }
                rows_posts.append(post_row)

                # Fetch a few top level comments for context
                s.comment_sort = "top"
                s.comments.replace_more(limit=0)
                for c in s.comments[:25]:
                    rows_comments.append({
                        "post_id": s.id,
                        "comment_id": c.id,
                        "created_utc": c.created_utc,
                        "created_iso": datetime.utcfromtimestamp(c.created_utc).isoformat(),
                        "author": str(c.author) if c.author else None,
                        "body": c.body,
                        "score": c.score,
                        "permalink": f"https://www.reddit.com{c.permalink}",
                        "subreddit": sr,
                        "post_query": q,
                        "post_label": label,
                    })

                time.sleep(0.6)

    return pd.DataFrame(rows_posts), pd.DataFrame(rows_comments)

###### **Data Collection and Storage**
The script collects Reddit posts for both dissatisfaction and satisfaction queries, merges and de-duplicates results, and saves labelled posts and comments as CSV files for subsequent NLP, sentiment, and thematic analysis.  


In [21]:
if __name__ == "__main__":
    posts_diss, comments_diss = search_and_collect(DISSATISFACTION_TERMS, "dissatisfaction")
    posts_sat,  comments_sat  = search_and_collect(SATISFACTION_TERMS, "satisfaction")

    # Combine & de-duplicate by id
    posts = pd.concat([posts_diss, posts_sat], ignore_index=True)
    posts = posts.drop_duplicates(subset=["id"])
    comments = pd.concat([comments_diss, comments_sat], ignore_index=True)
    comments = comments.drop_duplicates(subset=["comment_id"])

    # Save
    os.makedirs("data", exist_ok=True)
    posts.to_csv("data/reddit_posts_uk_jobs.csv", index=False)
    comments.to_csv("data/reddit_comments_uk_jobs.csv", index=False)

    print("Saved data/reddit_posts_uk_jobs.csv and data/reddit_comments_uk_jobs.csv")

  since = int((datetime.utcnow() - timedelta(days=DAYS_BACK)).timestamp())
  "created_iso": datetime.utcfromtimestamp(s.created_utc).isoformat(),
  "created_iso": datetime.utcfromtimestamp(c.created_utc).isoformat(),
UKJobs | job dissatisfaction: 12it [00:03,  3.22it/s]
UKJobs | hate my job: 225it [01:58,  1.89it/s]
UKJobs | toxic workplace: 69it [00:11,  5.78it/s]
UKJobs | burnout: 83it [00:29,  2.78it/s]
UKJobs | overworked: 72it [00:07,  9.03it/s]
UKJobs | bad boss: 145it [00:24,  5.90it/s]
UKJobs | micromanagement: 109it [00:31,  3.45it/s]
UKJobs | low pay: 234it [01:44,  2.24it/s]
UKJobs | underpaid: 229it [00:59,  3.88it/s]
UKJobs | quit my job: 237it [04:07,  1.04s/it]
UKJobs | resign: 237it [01:42,  2.30it/s]
UKJobs | stress at work: 237it [02:41,  1.47it/s]
UKJobs | bullying at work: 99it [00:16,  5.83it/s]
UKJobs | zero hours: 178it [00:35,  5.06it/s]
UKJobs | unhappy at work: 106it [00:21,  4.99it/s]
UKJobs | stressful job: 243it [03:32,  1.14it/s]
UKJobs | poor management: 

Saved data/reddit_posts_uk_jobs.csv and data/reddit_comments_uk_jobs.csv


In [22]:
posts = pd.read_csv("data/reddit_posts_uk_jobs.csv")
comments = pd.read_csv("data/reddit_comments_uk_jobs.csv")

print("Posts:", posts.shape)
print("Comments:", comments.shape)

Posts: (3718, 14)
Comments: (41318, 11)


In [23]:
import re, emoji, nltk, spacy
from nltk.corpus import stopwords

nltk.download("stopwords")
EN_STOP = set(stopwords.words("english"))
nlp = spacy.load("en_core_web_sm", exclude=["ner","parser","textcat"])

# cleaning helpers
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = emoji.replace_emoji(text, replace=" ")   # remove emojis
    text = re.sub(r"http\S+|www\.\S+", " ", text)  # remove URLs
    text = re.sub(r"[^\w\s']", " ", text)          # remove punctuation
    text = re.sub(r"\s+", " ", text).strip()
    return text

def lemmatize(text):
    doc = nlp(text)
    return " ".join(
        tok.lemma_ for tok in doc
        if tok.lemma_ not in EN_STOP and tok.is_alpha
    )

# posts: combine title + selftext
posts["title"] = posts["title"].fillna("")
posts["selftext"] = posts["selftext"].fillna("")
posts["text_raw"] = (posts["title"] + " " + posts["selftext"]).str.strip()
posts["text_clean"] = posts["text_raw"].apply(clean_text)
posts["text_lemma"] = posts["text_clean"].apply(lemmatize)

# comments: use body
comments["body"] = comments["body"].fillna("")
comments["text_raw"] = comments["body"]
comments["text_clean"] = comments["text_raw"].apply(clean_text)
comments["text_lemma"] = comments["text_clean"].apply(lemmatize)

print("Cleaned & lemmatized posts and comments")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Cleaned & lemmatized posts and comments


In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load tokenizer + model + pipeline
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# We ask for all class scores so we can keep neg/neu/pos columns like VADER
clf = pipeline(
    task="sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    return_all_scores=True,      # gives probs for all three classes
    truncation=True,
    max_length=128,              # social posts are short; prevents very long inputs
)

def _batch_hf_sentiment(text_series: pd.Series, batch_size: int = 64) -> pd.DataFrame:
    """
    Run HF sentiment in batches over a pandas Series of text and return
    a DataFrame with columns: neg, neu, pos, sentiment (string label).
    """
    # Ensure strings (empty string for NaNs)
    text_series = text_series.fillna("").astype(str)

    # Run the pipeline on the full list (the pipeline batches internally)
    outputs = clf(list(text_series), batch_size=batch_size)

    rows = []
    for scores in outputs:
        # scores is a list like: [{'label': 'Negative', 'score': 0.05}, {'label': 'Neutral', ...}, {'label': 'Positive', ...}]
        label_to_score = {d["label"].lower(): float(d["score"]) for d in scores}
        # Argmax label (negative / neutral / positive)
        top_label = max(label_to_score, key=label_to_score.get)

        rows.append({
            "neg": label_to_score.get("negative"),
            "neu": label_to_score.get("neutral"),
            "pos": label_to_score.get("positive"),
            "sentiment": top_label
        })

    return pd.DataFrame(rows, index=text_series.index)

posts_scores = _batch_hf_sentiment(posts["text_clean"])
comments_scores = _batch_hf_sentiment(comments["text_clean"])

posts = pd.concat([posts, posts_scores], axis=1)
comments = pd.concat([comments, comments_scores], axis=1)

print(posts[["text_clean", "sentiment"]].head())

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

In [None]:
display(comments)

In [None]:
posts["source"] = "post"
comments["source"] = "comment"

In [None]:
# For posts → combine title + selftext
posts["title"] = posts["title"].fillna("")
posts["selftext"] = posts["selftext"].fillna("")
posts["text_raw"] = (posts["title"] + " " + posts["selftext"]).str.strip()

# For comments → use body
comments["body"] = comments["body"].fillna("")
comments["text_raw"] = comments["body"]

In [None]:
# Columns to keep for posts
posts_small = posts[[
    "id","subreddit","created_utc","author","text_raw","score",
    "query","label","source","neg","neu","pos","compound","sentiment","permalink"
]]

# Columns to keep for comments (rename to match posts)
comments_small = comments.rename(columns={
    "comment_id": "id",
    "post_query": "query",
    "post_label": "label"
})[[
    "id","subreddit","created_utc","author","text_raw","score",
    "query","label","source","neg","neu","pos","compound","sentiment","permalink"
]]


In [None]:
dataset = pd.concat([posts_small, comments_small], ignore_index=True)
print("Combined dataset shape:", dataset.shape)
dataset.head()

In [None]:
df_two = dataset.loc[:, dataset.columns == 'sentiment']

In [None]:
df_two = dataset.loc[:, dataset.columns == 'sentiment'].copy()
df_two.columns = [f"sentiment_{i+1}" for i in range(df_two.shape[1])]

In [None]:
df_two

In [None]:
dataset.to_csv('dataset.csv', index=False)

In [None]:
dataset = dataset.drop(columns=['sentiment'])

In [None]:
dataset['sentiment'] = df_two['sentiment_2']

In [None]:
print(dataset['sentiment'].value_counts())

In [None]:
ax = sns.countplot(
    x="sentiment",
    data=dataset,
    palette={"positive": "green", "negative": "red", "neutral": "blue"},
    order=["positive","negative", "neutral"],
    legend=False
)

# Add percentages above bars
total = len(dataset)
for p in ax.patches:
    count = p.get_height()
    percentage = 100 * count / total
    ax.annotate(f"{percentage:.1f}%",
                (p.get_x() + p.get_width() / 2., count),
                ha="center", va="bottom", fontsize=11, color="black", xytext=(0, 3),
                textcoords="offset points")

# Labels & title
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Distribution (%)")
plt.show()

In [None]:
texts = dataset["text_raw"].dropna().astype(str).tolist()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Convert text to bag-of-words
vectorizer = CountVectorizer(max_df=0.9, min_df=10, stop_words="english")
X = vectorizer.fit_transform(texts)

# Fit LDA model
lda = LatentDirichletAllocation(n_components=6, random_state=42)
lda.fit(X)

# Show top words per topic
terms = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}: ", [terms[i] for i in topic.argsort()[:-11:-1]])

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def plot_wordcloud(words):
    wc = WordCloud(width=800, height=400, background_color="white").generate(" ".join(words))
    plt.figure(figsize=(10,5))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()

# For LDA
for idx, topic in enumerate(lda.components_):
    top_words = [terms[i] for i in topic.argsort()[:-21:-1]]
    print(f"Topic {idx}: {top_words}")
    plot_wordcloud(top_words)

In [None]:
doc_topics = lda.transform(X).argmax(axis=1)
dataset["topic"] = doc_topics

In [None]:
# Sentiment per topic
sent_summary = dataset.groupby("topic")["sentiment"].value_counts(normalize=True).unstack().fillna(0)
print(sent_summary)

# Plot
sent_summary.plot(kind="bar", stacked=True, figsize=(10,6))
plt.title("Sentiment Distribution per Topic")
plt.ylabel("Proportion")
plt.show()

In [None]:
# bertopic
!pip install bertopic sentence-transformers umap-learn hdbscan

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(texts, show_progress_bar=True)

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(texts, embeddings)

# Get summary
topic_info = topic_model.get_topic_info()
print(topic_info.head())

# Show top words from a topic
print(topic_model.get_topic(0))


In [None]:
# Topic hierarchy and intertopic distance maps
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(top_n_topics=10)

In [None]:
topic_model.visualize_hierarchy()

In [None]:
#LDA

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

texts = dataset["text_raw"].dropna().astype(str).tolist()

vectorizer = CountVectorizer(max_df=0.9, min_df=10, stop_words="english")
X = vectorizer.fit_transform(texts)

lda = LatentDirichletAllocation(n_components=6, random_state=42)
doc_topics = lda.fit_transform(X)

# Assign each document its dominant topic
dataset["topic"] = doc_topics.argmax(axis=1)

# Print top words per topic
terms = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}: ", [terms[i] for i in topic.argsort()[:-11:-1]])


In [None]:
# Count how many posts per (topic, sentiment)
sent_topic = dataset.groupby(["topic","sentiment"]).size().unstack(fill_value=0)

# Convert to percentages
sent_topic_pct = sent_topic.div(sent_topic.sum(axis=1), axis=0)

print(sent_topic)
print(sent_topic_pct)


In [None]:
import matplotlib.pyplot as plt

# Stacked bar chart (percentages)
sent_topic_pct.plot(kind="bar", stacked=True, figsize=(10,6),
                    color={"negative":"red", "neutral":"blue", "positive":"green"})

plt.title("Sentiment Distribution by Topic")
plt.ylabel("Proportion")
plt.xlabel("Topic")
plt.legend(title="Sentiment")
plt.show()

In [None]:
topic_labels = {
    0: "Low Pay & Wage Issues",
    1: "Toxic Management",
    2: "Work Stress & Burnout",
    3: "Career Uncertainty",
    4: "Work-Life Balance",
    5: "Job Insecurity"
}

dataset["topic_label"] = dataset["topic"].map(topic_labels)


In [None]:
dataset["topic_label"]

In [None]:
import matplotlib.pyplot as plt

# Group by topic label + sentiment
sent_topic = dataset.groupby(["topic_label","sentiment"]).size().unstack(fill_value=0)

# Convert to percentages
sent_topic_pct = sent_topic.div(sent_topic.sum(axis=1), axis=0) * 100

# Plot stacked bar
ax = sent_topic_pct.plot(
    kind="bar",
    stacked=True,
    figsize=(10,6),
    color={"negative":"red","neutral":"blue","positive":"green"}
)

# Add percentages on each bar segment
for c in ax.containers:
    labels = [f"{v.get_height():.1f}%" if v.get_height() > 0 else "" for v in c]
    ax.bar_label(c, labels=labels, label_type="center", fontsize=9, color="white")

plt.title("Sentiment Distribution by Theme")
plt.ylabel("Percentage")
plt.xlabel("Theme")
plt.xticks(rotation=45, ha="right")
plt.legend(title="Sentiment")
plt.tight_layout()
plt.show()

#CNN

In [None]:
import pandas as pd

df = dataset.copy()

# Prefer a cleaned column if you have one; else fall back to raw
TEXT_COL = "text_clean" if "text_clean" in df.columns else "text_raw"
df = df.dropna(subset=[TEXT_COL, "sentiment"]).copy()
df = df[df[TEXT_COL].str.strip().astype(bool)]

df["sentiment"] = df["sentiment"].str.lower().map({
    "negative":"negative", "neutral":"neutral", "positive":"positive"
})
df = df[df["sentiment"].isin(["negative","neutral","positive"])]

print(df[["sentiment"]].value_counts(normalize=True).mul(100).round(1))
print("Samples:", len(df))


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers, models, callbacks

# Encode labels
le = LabelEncoder()
y = le.fit_transform(df["sentiment"])  # 0..C-1

# Split
X_train, X_val, y_train, y_val = train_test_split(
    df[TEXT_COL].values, y, test_size=0.2, random_state=42, stratify=y
)

# Tokenize
MAX_WORDS = 40000
tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)
Xtr = tokenizer.texts_to_sequences(X_train)
Xva = tokenizer.texts_to_sequences(X_val)

# Pad
MAXLEN = 200
Xtr = pad_sequences(Xtr, maxlen=MAXLEN, padding="post", truncating="post")
Xva = pad_sequences(Xva, maxlen=MAXLEN, padding="post", truncating="post")

# Class weights (to mitigate imbalance)
classes = np.unique(y_train)
cw = compute_class_weight(class_weight="balanced", classes=classes, y=y_train)
class_weight = {int(c): float(w) for c, w in zip(classes, cw)}
print("Class weights:", class_weight, "Label map:", dict(zip(le.classes_, classes)))

# CNN model (1D conv)
EMB_DIM = 128
model = models.Sequential([
    layers.Embedding(input_dim=MAX_WORDS, output_dim=EMB_DIM, input_length=MAXLEN),
    layers.Conv1D(256, 5, activation="relu"),
    layers.GlobalMaxPooling1D(),
    layers.Dropout(0.3),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.2),
    layers.Dense(len(le.classes_), activation="softmax")
])

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
es = callbacks.EarlyStopping(patience=3, restore_best_weights=True, monitor="val_accuracy")

hist = model.fit(
    Xtr, y_train,
    validation_data=(Xva, y_val),
    epochs=10,
    batch_size=128,
    class_weight=class_weight,
    callbacks=[es],
    verbose=1
)

# Evaluate
val_loss, val_acc = model.evaluate(Xva, y_val, verbose=0)
print(f"Validation accuracy: {val_acc:.3f}")

# Inference helper
def predict_sentiment(texts):
    seq = tokenizer.texts_to_sequences(texts)
    seq = pad_sequences(seq, maxlen=MAXLEN, padding="post", truncating="post")
    probs = model.predict(seq, verbose=0)
    preds = probs.argmax(axis=1)
    return le.inverse_transform(preds), probs

# Quick sanity check
print(predict_sentiment(["I love my job", "My boss is awful and I want to quit"])[0])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.metrics import classification_report

X = df[TEXT_COL].values
y = df["sentiment"].values

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

smote_clf = ImbPipeline(steps=[
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), max_features=100_000)),
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression(max_iter=200, class_weight=None, n_jobs=-1))
])

smote_clf.fit(X_train, y_train)
y_pred = smote_clf.predict(X_val)
print(classification_report(y_val, y_pred, digits=3))

In [None]:
texts = df[TEXT_COL].astype(str).tolist()

use_bertopic = False
try:
    from bertopic import BERTopic
    from sentence_transformers import SentenceTransformer
    use_bertopic = True
except:
    print("BERTopic not available; will use LDA + NMF.")

topic_labels = None
if use_bertopic:
    model_st = SentenceTransformer("all-MiniLM-L6-v2")
    emb = model_st.encode(texts, show_progress_bar=True, normalize_embeddings=True)
    tm = BERTopic(calculate_probabilities=True, verbose=True, min_topic_size=30, nr_topics="auto")
    topics, probs = tm.fit_transform(texts, emb)
    df["topic_id"] = topics
    # Human-readable names (you can edit later)
    info = tm.get_topic_info()
    print(info.head())
else:
    # LDA (bow)
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.decomposition import LatentDirichletAllocation, NMF
    import numpy as np

    # LDA
    cv = CountVectorizer(stop_words="english", min_df=10, max_df=0.9)
    X_bow = cv.fit_transform(texts)
    lda = LatentDirichletAllocation(n_components=8, random_state=42)
    W = lda.fit_transform(X_bow)
    df["topic_id"] = W.argmax(axis=1)
    terms_bow = np.array(cv.get_feature_names_out())
    print("LDA topics (top words):")
    for k, comp in enumerate(lda.components_):
        print(k, terms_bow[comp.argsort()[:-11:-1]])

    # NMF (tf-idf)
    tfidf = TfidfVectorizer(stop_words="english", min_df=10, max_df=0.9)
    X_tfidf = tfidf.fit_transform(texts)
    nmf = NMF(n_components=8, random_state=42)
    H = nmf.fit_transform(X_tfidf)
    terms_tfidf = np.array(tfidf.get_feature_names_out())
    print("\nNMF topics (top words):")
    for k, comp in enumerate(nmf.components_):
        print(k, terms_tfidf[comp.argsort()[:-11:-1]])

In [None]:
# Create an initial mapping from top words you printed above (edit to fit your data)
# Example placeholders:
topic_map = {
    0: "Low Pay & Wage Issues",
    1: "Toxic Management",
    2: "Work Stress & Burnout",
    3: "Job Insecurity/Redundancy",
    4: "Work-Life Balance",
    5: "Scheduling & Hours",
    6: "Career Progression",
    7: "Remote/Flexible Work",
}
df["topic_label"] = df["topic_id"].map(topic_map).fillna("Other/Outlier")

In [None]:
import matplotlib.pyplot as plt

# Group to percentages
ct = (df.groupby(["topic_label","sentiment"])
        .size().unstack(fill_value=0))
pct = (ct.div(ct.sum(axis=1), axis=0) * 100).sort_index()

ax = pct.plot(kind="bar", stacked=True, figsize=(11,6),
              color={"negative":"red","neutral":"blue","positive":"green"})
for container in ax.containers:
    labels = [f"{h:.1f}%" if h>0 else "" for h in container.datavalues]
    ax.bar_label(container, labels=labels, label_type="center", fontsize=9, color="white")

plt.title("Sentiment distribution by topic (percentage)")
plt.ylabel("Percentage")
plt.xlabel("Topic")
plt.xticks(rotation=45, ha="right")
plt.legend(title="Sentiment")
plt.tight_layout()
plt.show()
