## Import libraries

In [398]:
import os
import pandas as pd
import tarfile
from datetime import datetime
import string
import re

## Step 1: Load raw data and metadata + combine the datasets

## 🧩 Step 1: Data Loading

In this first step, we load the data required for our task.  
To begin, we use a **small dataset (10K samples)** to test and validate the workflow.

Depending on the task’s needs, we will later expand the experiments to include:
- **Medium dataset:** 100K samples  
- **Large dataset:** 1M samples  

This scaling will allow us to compare results across different dataset sizes.

---

## 🌍 Selected Languages

We include three languages with different word order characteristics:

| Language | Word Order Type | Years Observed |
|-----------|------------------|----------------|
| **English** | Fixed word order | 2018, 2019, 2020, 2023, 2024 |
| **German** | Non-fixed word order | 2020, 2021, 2022, 2023, 2024 |
| **Russian** | Non-fixed word order | 2020, 2021, 2022, 2023, 2024 |

Each language dataset spans **five years**, depending on data availability.

---

## 🗓️ Metadata and Critical Year Handling

We also load **metadata** that links each sentence to its **source** (as described in the `README.md` file).  
The source information includes the **date** of each sentence, which is particularly important for handling the **critical year of 2022** — the year when **ChatGPT** became available.


In [399]:
base_dir = "data/raw"
languages = ["english", "german", "russian"]
data = {lang: [] for lang in languages}

for root, _, files in os.walk(base_dir):
    lang = next((l for l in languages if l in root.lower().split(os.sep)), None)
    if not lang:
        continue

    for f in files:
        if not f.endswith(".tar"):
            continue

        tar_path = os.path.join(root, f)
        with tarfile.open(tar_path, "r") as tar:
            # List members in the archive
            members = tar.getnames()

            # Identify target files inside
            sentence_file = next((m for m in members if "sentence" in m.lower()), None)
            source_file   = next((m for m in members if "source" in m.lower()), None)
            inv_so_file   = next((m for m in members if "inv_so" in m.lower()), None)

            # Extract & read content directly from memory
            if all([sentence_file, source_file, inv_so_file]):
                def read_text(member_name):
                    with tar.extractfile(member_name) as fobj:
                        return fobj.read().decode("utf-8").splitlines()

                sentences = read_text(sentence_file)
                sources   = read_text(source_file)
                inv_sos   = read_text(inv_so_file)

                data[lang].append({
                    "archive": f,
                    "sentences": sentences,
                    "sources": sources,
                    "inv_so": inv_sos
                })


Lets explore what has been loaded.

In [400]:
print("\n📜 Loaded internal files:")

for lang, entries in data.items():
    print(f"\n🌍 {lang.capitalize()}:")
    if not entries:
        print("  (no data loaded)")
        continue

    for e in entries:
        print(f"  📦 From archive: {e['archive']}")
        print(f"    - sentences ({len(e['sentences'])} lines)")
        print(f"    - sources   ({len(e['sources'])} lines)")
        print(f"    - inv_so    ({len(e['inv_so'])} lines)")



📜 Loaded internal files:

🌍 English:
  📦 From archive: eng_news_2020_10K.tar
    - sentences (10000 lines)
    - sources   (9939 lines)
    - inv_so    (10000 lines)
  📦 From archive: eng_news_2023_10K.tar
    - sentences (10000 lines)
    - sources   (9968 lines)
    - inv_so    (10000 lines)
  📦 From archive: eng_news_2018_10K.tar
    - sentences (10000 lines)
    - sources   (9960 lines)
    - inv_so    (10000 lines)
  📦 From archive: eng_news_2019_10K.tar
    - sentences (10000 lines)
    - sources   (9929 lines)
    - inv_so    (10000 lines)
  📦 From archive: eng_news_2024_10K.tar
    - sentences (10000 lines)
    - sources   (9972 lines)
    - inv_so    (10000 lines)

🌍 German:
  📦 From archive: deu_news_2024_10K.tar
    - sentences (10000 lines)
    - sources   (9948 lines)
    - inv_so    (10000 lines)
  📦 From archive: deu_news_2020_10K.tar
    - sentences (10000 lines)
    - sources   (9952 lines)
    - inv_so    (10000 lines)
  📦 From archive: deu_news_2021_10K.tar
    - se

Lets explore in which format are sentences structured.

In [401]:
print("\n🧾 First 5 sentences from each 'sentences' file:")

for lang, entries in data.items():
    print(f"\n🌍 {lang.capitalize()}:")
    if not entries:
        print("  (no data loaded)")
        continue

    for e in entries:
        print(f"\n  📦 Archive: {e['archive']}")
        sentences = e["sentences"]

        if not sentences:
            print("    (no sentences found)")
            continue

        # Print up to 5 sentences
        for i, line in enumerate(sentences[:5], start=1):
            print(f"    {i:>2}. {line}")



🧾 First 5 sentences from each 'sentences' file:

🌍 English:

  📦 Archive: eng_news_2020_10K.tar
     1. 1	“18 months ago, we expelled a boy at Nations for selling drugs in six schools.
     2. 2	” 41 Nigeria Centre for Disease Control (NCDC) staff and 17 World Health Organisation (WHO) staff are deployed at the moment to support the Kano response.
     3. 3	⏰8.00pm ⚽️Liverpool v Arsenal Watch the match live at The Arch on six screens with surround sound commentary!
     4. 4	A 13-year veteran of the department, he worked his way up the ranks of the department starting as an ambulance paramedic at Station 49, then going to the SFFD Academy and graduating as a paramedic and firefighter.
     5. 5	A 1975 class ring from Robert E. Lee High School in Houston was found near the bones.

  📦 Archive: eng_news_2023_10K.tar
     1. 1	£1 must be staked on Willie Mullins to have a winning horse in the Gold Cup.
     2. 2	£27 million facility at Altens was built using the latest technology at the 

We want to replace row_id for each sentence with date.

In [402]:
for lang, entries in data.items():
    for e in entries:
        # Build mappings for quick lookup
        inv_map = {}      # sentence_id -> source_id
        source_map = {}   # source_id -> date

        # Parse inv_so file
        for line in e["inv_so"]:
            parts = line.strip().split()
            if len(parts) >= 2:
                source_id, sentence_id = parts[0], parts[1]
                inv_map[sentence_id] = source_id

        # Parse sources file
        for line in e["sources"]:
            parts = line.strip().split()
            if len(parts) >= 3:
                source_id, _, date = parts[0], parts[1], parts[2]
                source_map[source_id] = date

        # Replace sentence_id with corresponding date (tab-based split)
        new_sentences = []
        for line in e["sentences"]:
            parts = line.strip().split("\t", 1)  # split once by tab
            if len(parts) != 2:
                new_sentences.append(line)
                continue

            sentence_id, text = parts
            source_id = inv_map.get(sentence_id)
            date = source_map.get(source_id, "UNKNOWN_DATE")

            # Replace the numeric ID with the date
            new_sentences.append(f"{date}\t{text}")

        # Store the updated sentences back
        e["sentences"] = new_sentences

Now, lets print our sentences again, to check if we joined the data correctly.

In [403]:
for lang, entries in data.items():
    print(f"\n🌍 {lang.capitalize()}:")
    if not entries:
        print("  (no data loaded)")
        continue

    for e in entries:
        print(f"\n  📦 Archive: {e['archive']}")
        sentences = e["sentences"]

        if not sentences:
            print("    (no sentences found)")
            continue

        for i, line in enumerate(sentences[:5], start=1):
            print(f"    {i:>2}. {line}")



🌍 English:

  📦 Archive: eng_news_2020_10K.tar
     1. 2020-10-19	“18 months ago, we expelled a boy at Nations for selling drugs in six schools.
     2. 2020-05-05	” 41 Nigeria Centre for Disease Control (NCDC) staff and 17 World Health Organisation (WHO) staff are deployed at the moment to support the Kano response.
     3. 2020-09-28	⏰8.00pm ⚽️Liverpool v Arsenal Watch the match live at The Arch on six screens with surround sound commentary!
     4. 2020-10-11	A 13-year veteran of the department, he worked his way up the ranks of the department starting as an ambulance paramedic at Station 49, then going to the SFFD Academy and graduating as a paramedic and firefighter.
     5. 2020-11-21	A 1975 class ring from Robert E. Lee High School in Houston was found near the bones.

  📦 Archive: eng_news_2023_10K.tar
     1. 2023-03-11	£1 must be staked on Willie Mullins to have a winning horse in the Gold Cup.
     2. 2023-07-09	£27 million facility at Altens was built using the latest tech

Lets join the data into one data set per language, and then split it into 2 datasets, one for period before chatgpt release, and one for the after period.

In [404]:
cutoff = datetime(2022, 11, 30)

split_datasets = {}

for lang, entries in data.items():
    # combine all SENTENCES (not sources)
    all_sentences = []
    for e in entries:
        all_sentences.extend(e["sentences"])

    before, after = [], []
    for line in all_sentences:
        # extract the date at the start of the line
        match = re.match(r"^(\S+)\s+(.*)", line.strip())
        if not match:
            continue
        date_str, sentence_text = match.groups()

        try:
            date_obj = datetime.strptime(date_str, "%Y-%m-%d")
        except ValueError:
            continue  # skip malformed or unknown dates

        if date_obj < cutoff:
            before.append(sentence_text)
        else:
            after.append(sentence_text)

    split_datasets[lang] = {
        "before_2022_11_30": before,
        "after_2022_11_30": after
    }

Lets explore how dataset is split for each of the languages.

In [405]:
print("\n📊 Sources split summary (before & after 30 Nov 2022):\n")
print(f"{'Language':<12} {'Before (#)':>12} {'Before (%)':>12} {'After (#)':>12} {'After (%)':>12} {'Total':>8}")
print("-" * 70)

for lang, splits in split_datasets.items():
    before_count = len(splits["before_2022_11_30"])
    after_count = len(splits["after_2022_11_30"])
    total = before_count + after_count

    if total == 0:
        before_pct = after_pct = 0.0
    else:
        before_pct = (before_count / total) * 100
        after_pct = (after_count / total) * 100

    print(f"{lang.capitalize():<12} "
          f"{before_count:>11} {before_pct:>11.2f}% "
          f"{after_count:>11} {after_pct:>11.2f}% "
          f"{total:>8}")



📊 Sources split summary (before & after 30 Nov 2022):

Language       Before (#)   Before (%)    After (#)    After (%)    Total
----------------------------------------------------------------------
English            30000       60.00%       20000       40.00%    50000
German             29164       58.33%       20836       41.67%    50000
Russian            29127       58.25%       20873       41.75%    50000


In [406]:
#for lang, splits in split_datasets.items():
#    for key, lines in splits.items():
#        new_lines = []
#        for line in lines:
#            # remove any leading date pattern like YYYY-MM-DD
#            cleaned = re.sub(r"^\d{4}-\d{2}-\d{2}\s+", "", line.strip())
#            new_lines.append(cleaned)
#        splits[key] = new_lines

Before we go into actual pre-processing, lets do final check of what the data looks like.

In [407]:
print("\n🧾 Preview: first 5 rows for each split per language\n")

for lang, splits in split_datasets.items():
    print(f"\n🌍 {lang.capitalize()}")

    for split_name, lines in splits.items():
        print(f"  📂 {split_name} (total: {len(lines)}):")

        if not lines:
            print("     (no data)")
            continue

        for i, line in enumerate(lines[:5], start=1):
            print(f"     {i:>2}. {line}")

        print()  # blank line between splits



🧾 Preview: first 5 rows for each split per language


🌍 English
  📂 before_2022_11_30 (total: 30000):
      1. “18 months ago, we expelled a boy at Nations for selling drugs in six schools.
      2. ” 41 Nigeria Centre for Disease Control (NCDC) staff and 17 World Health Organisation (WHO) staff are deployed at the moment to support the Kano response.
      3. ⏰8.00pm ⚽️Liverpool v Arsenal Watch the match live at The Arch on six screens with surround sound commentary!
      4. A 13-year veteran of the department, he worked his way up the ranks of the department starting as an ambulance paramedic at Station 49, then going to the SFFD Academy and graduating as a paramedic and firefighter.
      5. A 1975 class ring from Robert E. Lee High School in Houston was found near the bones.

  📂 after_2022_11_30 (total: 20000):
      1. £1 must be staked on Willie Mullins to have a winning horse in the Gold Cup.
      2. £27 million facility at Altens was built using the latest technology at the

Nice!

## Step 2: Basic pre-processing

In [408]:
def clean_text(sentence):
    s = sentence.lower()
    s = re.sub(r"http\S+|www\.\S+|@\w+|#\w+", "", s)
    s = s.replace("“", '"').replace("”", '"').replace("„", '"').replace("«", '"').replace("»", '"')
    s = s.replace("—", "-").replace("–", "-")

    # fallback: remove common emoji/symbol ranges (not perfect, but helps)
    s = re.sub(r'[\U0001F300-\U0001F6FF\U0001F900-\U0001F9FF\U00002600-\U000027BF]+', ' ', s)

    # remove non-printable/control characters
    s = re.sub(r'[\x00-\x1f\x7f-\x9f]', ' ', s)

    # collapse repeated punctuation
    s = re.sub(r"([!?.,;:])\1+", r"\1", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

In [409]:
for lang, splits in split_datasets.items():
    for split_name, lines in splits.items():
        splits[split_name] = [clean_text(line) for line in lines]


In [410]:
print("\n🧽 Refined cleaned sentences (first 5 per split)\n")

for lang, splits in cleaned_datasets.items():
    print(f"\n🌍 {lang.capitalize()}")
    for split_name, lines in splits.items():
        print(f"  📂 {split_name} (total: {len(lines)}):")
        for i, line in enumerate(lines[:5], start=1):
            print(f"     {i:>2}. {line}")
        print()



🧽 Refined cleaned sentences (first 5 per split)


🌍 English
  📂 before_2022_11_30 (total: 30000):
      1. 18 months ago, we expelled a boy at nations for selling drugs in six schools.
      2. 41 nigeria centre for disease control ncdc staff and 17 world health organisation who staff are deployed at the moment to support the kano response.
      3. 00pm liverpool v arsenal watch the match live at the arch on six screens with surround sound commentary!
      4. a 13-year veteran of the department, he worked his way up the ranks of the department starting as an ambulance paramedic at station 49, then going to the sffd academy and graduating as a paramedic and firefighter.
      5. a 1975 class ring from robert e. lee high school in houston was found near the bones.

  📂 after_2022_11_30 (total: 20000):
      1. 1 must be staked on willie mullins to have a winning horse in the gold cup.
      2. 27 million facility at altens was built using the latest technology at the time.
      3. 40