## **Data Gathering for JOB DISSATISFACTION IN THE UK USING REDDIT Archive for (2020-2022 Dataset)**

#### **Import the necessary libraries**

In [1]:
import sys
import subprocess
import zstandard as zstd
import json
from datetime import datetime
import pandas as pd
import re
import glob
import os

subprocess.check_call([sys.executable, "-m", "pip", "install", "zstandard"])

0

#### **Data Source**

The data was gathered from the **Reddit archive** available at [Eye.eu](http://eye.eu/).  
The files are provided in **`.zst`** format, which contains compressed Reddit data for subreddit.

A total of **10 subreddits** were downloaded, including both **submissions (posts)** and **comments**, to ensure a comprehensive view of discussions related to job dissatisfaction.  

The subreddits include:

- `UKJobs`  
- `AskUK`  
- `CasualUK`  
- `unitedkingdom`  
- `antiwork`  
- `WorkReform`  
- `careerguidance`  
- `AskHR`  
- `britishproblems`  
- `recruitinghell`  
- `WorkReformUK`  

Data from these subreddits was Collected and we will later filter it for the years **2020 to 2022**.  
A keyword-based filtering process was then applied to extract posts and comments specifically related to **job dissatisfaction** and **workplace sentiment**.

###### To find posts and comments that talk about **job dissatisfaction**, we created a list of keywords that people often use when expressing frustration about their jobs. These keywords helped us filter through all the Reddit data collected from **2020 to 2022** and focus only on discussions that mention things like **burnout**, **bad management**, **low pay**, or simply **hating their job**.

In [2]:
DISSATISFACTION_TERMS = [
    "job dissatisfaction", "hate my job", "toxic workplace", "burnout", "overworked",
    "bad boss", "micromanagement", "low pay", "underpaid", "quit my job", "resign",
    "stress at work", "bullying at work", "zero hours", "unhappy at work",
    "stressful job", "poor management", "burnt out", "miserable at work",
    "dead-end job", "exploited at work", "no work-life balance",
    "hate going to work", "rejection", "really bad", "unreasonable",
    "management cuts benefits"
]

DISSATISFACTION_TERMS = [t.lower() for t in DISSATISFACTION_TERMS]
print(DISSATISFACTION_TERMS)

['job dissatisfaction', 'hate my job', 'toxic workplace', 'burnout', 'overworked', 'bad boss', 'micromanagement', 'low pay', 'underpaid', 'quit my job', 'resign', 'stress at work', 'bullying at work', 'zero hours', 'unhappy at work', 'stressful job', 'poor management', 'burnt out', 'miserable at work', 'dead-end job', 'exploited at work', 'no work-life balance', 'hate going to work', 'rejection', 'really bad', 'unreasonable', 'management cuts benefits']


##### Just like with job dissatisfaction, we also wanted to find posts and comments where people talk positively about their work. To do this, we created a list of keywords that capture feelings of **job satisfaction**, such as enjoying work, having a good boss, fair pay, or a supportive team. These keywords were used to filter Reddit data (2020–2022) for content that reflects **positive work experiences** and **employee satisfaction**.


In [3]:
SATISFACTION_TERMS = [
    "love my job", "happy at work", "good boss", "great team", "work life balance",
    "supportive manager", "flexible working", "fair pay"
]

SATISFACTION_TERMS = [t.lower() for t in SATISFACTION_TERMS]
print(SATISFACTION_TERMS)

['love my job', 'happy at work', 'good boss', 'great team', 'work life balance', 'supportive manager', 'flexible working', 'fair pay']


#### **Extracting the Year from Unix Timestamps**

##### Each Reddit post and comment in the dataset includes a **Unix timestamp** (the number of seconds since January 1, 1970). To make the data easier to analyze by year, we created a small helper function that converts each timestamp into its corresponding **UTC year**.

In [4]:
def get_year_from_utc(ts):
    """Convert Unix timestamp to a year (UTC)."""
    return datetime.utcfromtimestamp(int(ts)).year

#### **Sentiment Classification Function**

##### To identify whether a Reddit post or comment expresses **job satisfaction**, **job dissatisfaction**, or **neutral** sentiment, we created a simple keyword-based function called `classify_sentiment()`. This function checks each text entry against predefined keyword lists:
- `DISSATISFACTION_TERMS` = Negative experiences (e.g., “hate my job”, “toxic workplace”)  
- `SATISFACTION_TERMS` = Positive experiences (e.g., “love my job”, “great team”)  

##### If none of the keywords are found, the function returns `"none"`.

In [5]:
def classify_sentiment(text):
    if text is None:
        text_l = ""
    else:
        text_l = text.lower()

    # dissatisfaction first
    for term in DISSATISFACTION_TERMS:
        if term in text_l:
            return "dissatisfaction"

    # satisfaction second
    for term in SATISFACTION_TERMS:
        if term in text_l:
            return "satisfaction"

    return "none"

#### **Defining Years of Interest**

##### Since the Reddit data was collected between **2020 and 2022**, we created a simple structure to help organize the posts and comments by year. This makes it easier to process, analyze, or export data for each year separately (for example, 2020 data → one file, 2021 data → another).

In [6]:
YEARS_OF_INTEREST = [2020, 2021, 2022]

year_buckets = {year: [] for year in YEARS_OF_INTEREST}
print(year_buckets)

{2020: [], 2021: [], 2022: []}


#### **Filtering UK-Related Content**

##### Because this project focuses on **UK-based job discussions**, we needed a way to identify whether each Reddit post or comment was **UK-related**.  
To do that, we created a small helper function called `is_uk_related()`.

This function checks:
1. If the post comes from a **UK-focused subreddit** (like `r/UKJobs` or `r/AskUK`), **or**  
2. If the post text mentions **UK regions or cities** (like “London”, “Scotland”, “Manchester”, etc.).


In [7]:
UK_REGEX = re.compile(
    r"\b(UK|United Kingdom|England|Scotland|Wales|Northern Ireland|London|Manchester|Birmingham|Leeds|Glasgow|Bristol|Liverpool)\b",
    re.IGNORECASE
)

def is_uk_related(text: str, subreddit: str) -> bool:
    """
    Return True if the post is considered UK-related.

    We treat a post as UK-related if:
    - it's from a known UK subreddit, OR
    - the text contains UK place/region keywords
    """
    if subreddit.lower() in {"ukjobs", "askuk", "casualuk", "unitedkingdom", "britishproblems"}:
        return True
    return bool(UK_REGEX.search(text or ""))

#### **Processing Reddit Submission Files (`.zst`)**

To build the final dataset, we process each compressed Reddit file (`.zst`) and extract only the posts that match our criteria:
- Within the years **2020–2022**
- Related to the **UK**
- Expressing **job satisfaction** or **job dissatisfaction**

The function below reads a `.zst` file **line by line** (in a memory-efficient way), filters the data, and stores the results into `year_buckets`.


In [8]:
def process_submission_file(zst_path, year_buckets):
    dctx = zstd.ZstdDecompressor()

    with open(zst_path, "rb") as compressed_file:
        with dctx.stream_reader(compressed_file) as reader:
            buffer = b""

            while True:
                chunk = reader.read(2**20)
                if not chunk:
                    break
                buffer += chunk

                *lines, buffer = buffer.split(b"\n")

                for line in lines:
                    line = line.strip()
                    if not line:
                        continue
                    try:
                        obj = json.loads(line)
                    except json.JSONDecodeError:
                        continue

                    created_utc = obj.get("created_utc")
                    if created_utc is None:
                        continue

                    year = get_year_from_utc(created_utc)
                    if year not in YEARS_OF_INTEREST:
                        continue

                    data_id = obj.get("id", "")
                    subreddit = obj.get("subreddit", "")
                    selftext = obj.get("selftext", "") or obj.get("body", "")

                    full_text = f"{selftext}".strip()

                    if not is_uk_related(full_text, subreddit):
                        continue

                    sent = classify_sentiment(full_text)
                    if sent == "none":
                        continue

                    row = {
                        "id": data_id,
                        "created_utc": int(created_utc),
                        "year": year,
                        "subreddit": subreddit,
                        "sentiment": sent,
                    }

                    year_buckets[year].append(row)

##### **Running the Processing Pipeline**

After defining the helper functions for filtering and classification,  
we use `run_process()` to **automate the processing** of all `.zst` files in the project directory.

This function finds every compressed Reddit file that matches a given `type` (for example, `"submission"` or `"comment"`)  
and processes them one by one using the `process_submission_file()` function.


In [9]:
def run_process(types):
    cwd = os.getcwd()
    directory = cwd
    file_pattern = f"*{types}*.zst*"

    ZST_FILES = glob.glob(os.path.join(directory, file_pattern))

    for path in ZST_FILES:
        print("Processing:", path)
        process_submission_file(path, year_buckets)
    print("Done streaming all files.")

#### **Running the Data Processing Pipeline for Post**

After setting up all helper functions and defining the keyword filters,  
the next step is to actually **run the pipeline** to process all `.zst` Reddit files.

We start by processing all **submission** files (Reddit posts)

In [10]:
run_process("submissions")

Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\antiwork_submissions.zst


  return datetime.utcfromtimestamp(int(ts)).year


Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\AskHR_submissions.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\AskUK_submissions (1).zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\britishproblems_submissions.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\careerguidance_submissions.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\CasualUK_submissions.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\recruitinghell_submissions.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\UKJobs_submissions (1).zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\unitedkingdom_submissions.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\WorkReform_submissions.zst
Done streaming all files.


#### **Checking Collected Data by Year for Submission (Post)**

After running the data processing pipeline, it’s important to verify how many posts and comments were collected for each year (2020–2022). This quick check helps confirm that the filtering and extraction process worked correctly.

In [11]:
for y in YEARS_OF_INTEREST:
    print(y, "items collected:", len(year_buckets[y]))

2020 items collected: 317
2021 items collected: 589
2022 items collected: 996


#### **Saving Processed Results**

Once all Reddit data has been streamed, filtered, and organized into yearly buckets,  
the final step is to **save each year’s dataset** as a CSV file for further analysis.

The function `save_result()` does exactly that — it loops through the collected data for each year,  
creates a DataFrame, and saves it into the `result` folder.

In [12]:
def save_result(filetype):
    output_folder = r"result"  # make sure this folder exists

    for year in YEARS_OF_INTEREST:
        rows = year_buckets[year]
        if len(rows) == 0:
            print(f"Year {year}: no matches, skipping CSV.")
            continue
        df = pd.DataFrame(rows)

        out_path = fr"{output_folder}\subreddit_{year}_{filetype}.csv"
        df.to_csv(out_path, index=False, encoding="utf-8")
        print(f"Saved {len(df)} rows for {year} -> {out_path}")

##### **Saving for post using the function that was created**

In [13]:
save_result("post")

Saved 317 rows for 2020 -> result\subreddit_2020_post.csv
Saved 589 rows for 2021 -> result\subreddit_2021_post.csv
Saved 996 rows for 2022 -> result\subreddit_2022_post.csv


#### **Running the Data Processing Pipeline**

In [14]:
run_process("comments")

Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\antiwork_comments (1).zst


  return datetime.utcfromtimestamp(int(ts)).year


Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\AskHR_comments.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\AskUK_comments.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\britishproblems_comments (1).zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\britishproblems_comments.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\careerguidance_comments.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\CasualUK_comments.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\recruitinghell_comments.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\UKJobs_comments (2).zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\unitedkingdom_comments.zst
Processing: C:\users\923826\OneDrive - hull.ac.uk\Desktop\dammy\Dissertation\W

#### **Checking Collected Data by Year for Submission (Post)**

In [15]:
for y in YEARS_OF_INTEREST:
    print(y, "items collected:", len(year_buckets[y]))

2020 items collected: 14356
2021 items collected: 23216
2022 items collected: 38062


#### **Saving for post using the function that was created**

In [16]:
save_result("comment")

Saved 14356 rows for 2020 -> result\subreddit_2020_comment.csv
Saved 23216 rows for 2021 -> result\subreddit_2021_comment.csv
Saved 38062 rows for 2022 -> result\subreddit_2022_comment.csv
