# Planning the Gold Layer

**goal**
- 1. make sure silver is not cooked (missing key data, duplicates, flattened extrafields etc)

- 2. create the features for downstream clustering and analysis (figure out what I want first, and like dummy processes before i do the full 2m rows)

- 3. Embeddings for sure, NER FEATURES (PERSON / ORG / GPE / MONEY), STRUCTURAL FEATURES (word_count, avg_sentence_length),
PUBLICATION CATEGORY (publication_group_id), LEXICON SCORES (econ_score fore example), TEMPORAL FEATURE (is_weekend, pre_market, during market, post_market (when possible to calculate)).

- 4. Write to gold delta table

In [1]:
#getting spark fired up
from pathlib import Path
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, trim, to_date, to_timestamp,
    current_timestamp, length, from_json
)
from pyspark.sql.types import *
from delta import configure_spark_with_delta_pip

def build_spark():
    builder = (
        SparkSession.builder
        .appName("gold_transform")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
        .config("spark.driver.memory", "3g")
        .config("spark.executor.memory", "3g")
        .config("spark.sql.shuffle.partitions", "16")
    )
    return configure_spark_with_delta_pip(builder).getOrCreate()

spark = build_spark()

from pathlib import Path

# silver notebook base dir
BASE_DIR = Path.cwd()  # /pipelines/silver

# bronze output (one level up + bronze folder)
SILVER_DELTA = BASE_DIR.parent / "silver" / "delta_news_silver"
 
# silver output (current directory)
GOLD_DELTA = BASE_DIR / "delta_news_gold"

print("Silver Delta path:", SILVER_DELTA)
print("Gold Delta path:", GOLD_DELTA)

#loading silver
df_silver = spark.read.format("delta").load(str(SILVER_DELTA))

print("\n Silver LOADED")
df_silver.printSchema()

print("\n Silver preview (5 rows):")
df_silver.show(5, truncate=False)



25/12/02 14:14:59 WARN Utils: Your hostname, david-ThinkPad-T490 resolves to a loopback address: 127.0.1.1; using 172.16.0.186 instead (on interface wlp0s20f3)
25/12/02 14:14:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/david/School/CapStone/.venv/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/david/.ivy2/cache
The jars for the packages stored in: /home/david/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-55f0218a-5364-4b6e-a0ea-9a81283771f6;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.1.0 in central
	found io.delta#delta-storage;3.1.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 802ms :: artifacts dl 32ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.1.0 from central in [default]
	io.delta#delta-storage;3.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0  

Silver Delta path: /home/david/School/CapStone/pipelines/silver/delta_news_silver
Gold Delta path: /home/david/School/CapStone/pipelines/gold/delta_news_gold


25/12/02 14:15:22 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                


 Silver LOADED
root
 |-- date: timestamp (nullable = true)
 |-- text: string (nullable = true)
 |-- ingestion_ts: timestamp (nullable = true)
 |-- source_file: string (nullable = true)
 |-- publication: string (nullable = true)
 |-- author: string (nullable = true)
 |-- url: string (nullable = true)
 |-- text_type: string (nullable = true)
 |-- time_precision: string (nullable = true)
 |-- dataset_source: string (nullable = true)
 |-- dataset: string (nullable = true)
 |-- source: string (nullable = true)
 |-- raw_type: string (nullable = true)
 |-- tz_hint: string (nullable = true)
 |-- date_raw: timestamp (nullable = true)
 |-- date_trading: timestamp (nullable = true)
 |-- anchor_policy: string (nullable = true)
 |-- len_text: integer (nullable = true)
 |-- silver_ingestion_ts: timestamp (nullable = true)


 Silver preview (5 rows):


25/12/02 14:15:31 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

In [5]:
from pyspark.sql.functions import col, length, sum as fsum, count, countDistinct

MIN_TEXT_LEN = 50
total = df_silver.count()

# -----------------------------
# Missing-value metrics
# -----------------------------
missing_df = (
    df_silver.select(
        fsum(col("date").isNull().cast("int")).alias("missing_date"),
        fsum(
            (col("text").isNull() | (length(col("text")) < MIN_TEXT_LEN))
            .cast("int")
        ).alias("missing_text")
    )
    .withColumn("pct_missing_date", col("missing_date") / total)
    .withColumn("pct_missing_text", col("missing_text") / total)
)


# -----------------------------
# Duplicate detection
# -----------------------------
dup_key = ["date", "text"]   # adjust if needed

duplicate_groups = (
    df_silver.groupBy(dup_key)
             .count()
             .where(col("count") > 1)
)

duplicate_count = duplicate_groups.count()
duplicate_pct = duplicate_count / total

duplicate_examples = duplicate_groups.orderBy(col("count").desc()).limit(5)


# -----------------------------
# Display results
# -----------------------------
print("=== Missing value summary ===")
missing_df.show(truncate=False)
print("Total rows:", total)

print("\n=== Duplicate summary ===")
print("Duplicate groups:", duplicate_count)
print("Percent duplicate groups:", duplicate_pct)

print("\n=== Example duplicate groups ===")
duplicate_examples.show(truncate=False)


                                                                                

=== Missing value summary ===


                                                                                

+------------+------------+----------------+----------------+
|missing_date|missing_text|pct_missing_date|pct_missing_text|
+------------+------------+----------------+----------------+
|0           |0           |0.0             |0.0             |
+------------+------------+----------------+----------------+

Total rows: 1954081

=== Duplicate summary ===
Duplicate groups: 0
Percent duplicate groups: 0.0

=== Example duplicate groups ===




+----+----+-----+
|date|text|count|
+----+----+-----+
+----+----+-----+



                                                                                

**silver looks about right...**
- proper data types
- extra_fields flattened fine
- core fields no misisngs/no empty strings
- no duplicates

### Gold Features I want to make

**Embeddings**
-reduce with PCA

**Lexicon scores using our 6 lexicon sets** 

1. *Loughran–McDonald (finance, econ, risk)*
    
2. *General Inquirer (politics, legal, institutional)*
    
3. *ACLED + UCDP + CrisisLex (war, conflict, crisis)*
    
4. *IPCC + NOAA climate/environment terms*
    
5. *MeSH + WHO (health & disease)*
    
6. *NRC Emotion Lexicon (fear, anger, etc. as event markers)*

**NER**
NER model to generate

- num_PERSON
    
- num_ORG
    
- num_GPE
    
- num_LOC
    
- num_MONEY

**STRUCTURAL FEATURES (word_count, avg_sentence_length**
-trivial to make

-> write to gold delta


so the next few cells are just gonna be me testing this out before I makea full gold script that works through the entire 2m dataset

### **dummy embeddings process**

In [8]:
# ----------------------------------------
# 0. SAMPLE A FEW ROWS SAFELY FROM SPARK
# ----------------------------------------
import pandas as pd

# select only simple columns
cols = ["date", "text"]

# robust sampling method
rows = df_silver.select(*cols).take(10)

# convert safely
pdf = pd.DataFrame([r.asDict() for r in rows])

print(pdf)


# ----------------------------------------
# 1. IMPORTS
# ----------------------------------------
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


# ----------------------------------------
# 2. LOAD EMBEDDING MODEL
# ----------------------------------------
model = SentenceTransformer("intfloat/e5-large-v2")


# ----------------------------------------
# 3. EMBEDDING FUNCTION
# ----------------------------------------
def embed_texts(texts, model):
    cleaned = [t.strip() if isinstance(t, str) else "" for t in texts]
    return model.encode(cleaned, normalize_embeddings=True)


# ----------------------------------------
# 4. GENERATE RAW EMBEDDINGS (1024-d)
# ----------------------------------------
embeddings = embed_texts(pdf["text"], model)
embeddings = np.vstack(embeddings)

print("Raw embeddings shape:", embeddings.shape)


# ----------------------------------------
# 5. SCALE + PCA REDUCE
# NOTE: tiny sample → MUST use small n_components (≤ n_samples-1)
# ----------------------------------------
scaler = StandardScaler()
emb_scaled = scaler.fit_transform(embeddings)

# choose safe number of components for dummy test
n_components = min(5, embeddings.shape[0] - 1)

pca = PCA(n_components=n_components, random_state=42)
emb_reduced = pca.fit_transform(emb_scaled)

print("Reduced embedding shape:", emb_reduced.shape)


# ----------------------------------------
# 6. ATTACH EMBEDDINGS BACK TO PANDAS
# ----------------------------------------
pdf["embedding_1024"] = list(embeddings)
pdf["embedding_reduced"] = list(emb_reduced)


# ----------------------------------------
# 7. PREVIEW
# ----------------------------------------
print("\nExample reduced embedding (first 10 dims):")
print(pdf["embedding_reduced"].iloc[0][:10])

print("\nPreview DataFrame:")
print(pdf[["text", "embedding_reduced"]].head())


                 date                                               text
0 2016-09-13 10:58:22  iOS 10 review: the coming of age of apps – Tec...
1 2016-09-13 11:01:23  Market Falls After Forecast of Weaker Demand f...
2 2016-09-13 11:02:08  Apple just released tvOS 10 and here’s what’s ...
3 2016-09-13 11:12:20  Pneumonia, Polyps and Gunshots: A Short Histor...
4 2016-09-13 11:15:08  Back in Los Angeles, Rams Lose in a Rare N.F.L...
5 2016-09-13 11:25:48  Big Eats: Pete Wells on Food, Restaurants and ...
6 2016-09-13 11:30:02  Here are the three biggest changes coming with...
7 2016-09-13 11:36:00  The Best Pairing for Indian Food? It’s Not Bee...
8 2016-09-13 12:15:33  Fetch Robotics CEO Melonee Wise welcomes our n...
9 2016-09-13 12:17:19  Defense Department reaffirms its commitment to...
Raw embeddings shape: (10, 1024)
Reduced embedding shape: (10, 5)

Example reduced embedding (first 10 dims):
[16.585844   5.5314584  3.672019  -4.874273   1.889009 ]

Preview DataFrame:
          

### **dummy lexicon process**

In [None]:
import pandas as pd




from pyspark.sql.functions import rand

cols = ["date", "text"]

rows = (
    df_silver
        .select(*cols)
        .orderBy(rand())       # random shuffle on the cluster
        .limit(30)             # pick 30 random rows
        .collect()             # safe for small sample
)

pdf = pd.DataFrame([r.asDict() for r in rows])
print(pdf.head())




# ---------------------------
# LM lexicon
# ---------------------------
lm = pd.read_csv("lexicons/Loughran-McDonald_MasterDictionary_1993-2024.csv")
lm["Word"] = lm["Word"].str.lower()

LEX_LM_POS = set(lm[lm["Positive"] > 0]["Word"])
LEX_LM_NEG = set(lm[lm["Negative"] > 0]["Word"])
LEX_LM_UNC = set(lm[lm["Uncertainty"] > 0]["Word"])
LEX_LM_LIT = set(lm[lm["Litigious"] > 0]["Word"])
LEX_LM_CON = set(lm[lm["Constraining"] > 0]["Word"])


# ---------------------------
# GI lexicon
# ---------------------------
gi = pd.read_excel("lexicons/inquireraugmented.xls")
gi["Entry"] = gi["Entry"].str.lower().str.strip()

# A GI category is TRUE if the cell is NOT empty/NaN
LEX_GI_ECON  = set(gi.loc[gi["Econ@"].notna(),  "Entry"])
LEX_GI_POLIT = set(gi.loc[gi["Polit@"].notna(), "Entry"])
LEX_GI_LEGAL = set(gi.loc[gi["Legal"].notna(),  "Entry"])

print("GI Econ words:", len(LEX_GI_ECON))
print("GI Polit words:", len(LEX_GI_POLIT))
print("GI Legal words:", len(LEX_GI_LEGAL))

# ----------------------------------------
# 2. SIMPLE TOKENIZER
# ----------------------------------------
def tokenize(text):
    if not isinstance(text, str):
        return []
    return text.lower().split()


# ----------------------------------------
# 3. LEXICON SCORING FUNCTION
# ----------------------------------------
def lexicon_features(tokens):
    # fast local alias
    tset = set(tokens)

    return {
        # LM categories
        "lm_pos": len(tset & LEX_LM_POS),
        "lm_neg": len(tset & LEX_LM_NEG),
        "lm_unc": len(tset & LEX_LM_UNC),
        "lm_lit": len(tset & LEX_LM_LIT),
        "lm_con": len(tset & LEX_LM_CON),

        # GI categories
        "gi_econ": len(tset & LEX_GI_ECON),
        "gi_polit": len(tset & LEX_GI_POLIT),
        "gi_legal": len(tset & LEX_GI_LEGAL),
    }


# ----------------------------------------
# 4. APPLY TO SAMPLE PDF
# ----------------------------------------
lex_rows = []

for text in pdf["text"]:
    tokens = tokenize(text)
    feats = lexicon_features(tokens)
    lex_rows.append(feats)

lex_df = pd.DataFrame(lex_rows)


# ----------------------------------------
# 5. MERGE FEATURES BACK INTO PDF
# ----------------------------------------
pdf = pd.concat([pdf, lex_df], axis=1)


# ----------------------------------------
# 6. PREVIEW
# ----------------------------------------
print("\nLexicon features preview:")
print(pdf.head())

print("\nExample feature vector:")
print(pdf.iloc[0][["lm_pos","lm_neg","lm_unc","lm_lit","lm_con",
                  "gi_econ","gi_polit","gi_legal"]])


                                                                                

                 date                                               text
0 2016-01-19 00:00:00  Ample storage space for oil may limit price mo...
1 2017-04-26 01:00:00  BRIEF-Vail Resorts provide fiscal year 2017 gu...
2 2019-01-13 00:00:00  Scoop: Trump dressed down Mulvaney in front of...
3 2018-05-10 01:00:00  Russia's Lavrov says Nord Stream 2 gas pipelin...
4 2018-05-10 13:30:01  Equifax denied passport numbers were involved ...
GI Econ words: 511
GI Polit words: 264
GI Legal words: 193

Lexicon features preview:
                 date                                               text  \
0 2016-01-19 00:00:00  Ample storage space for oil may limit price mo...   
1 2017-04-26 01:00:00  BRIEF-Vail Resorts provide fiscal year 2017 gu...   
2 2019-01-13 00:00:00  Scoop: Trump dressed down Mulvaney in front of...   
3 2018-05-10 01:00:00  Russia's Lavrov says Nord Stream 2 gas pipelin...   
4 2018-05-10 13:30:01  Equifax denied passport numbers were involved ...   

   lm_pos  lm_neg  

### **NER dummy**

In [22]:
# ----------------------------------------
# 0. SAMPLE A FEW ROWS SAFELY FROM SPARK
# ----------------------------------------
import pandas as pd
from pyspark.sql.functions import rand

cols = ["date", "text"]

rows = (
    df_silver
        .select(*cols)
        .orderBy(rand())     # random sample
        .limit(10)
        .collect()
)

pdf = pd.DataFrame([r.asDict() for r in rows])
print(pdf)


# ----------------------------------------
# 1. IMPORTS
# ----------------------------------------
import spacy


# ----------------------------------------
# 2. LOAD NER MODEL
# ----------------------------------------
# transformer version (slower, more accurate):
# nlp = spacy.load("en_core_web_trf")

# small fast version (recommended for testing + Gold pipeline):
nlp = spacy.load("en_core_web_sm")


# ----------------------------------------
# 3. NER FEATURE EXTRACTOR
# ----------------------------------------
def ner_features(text):
    if not isinstance(text, str):
        return {"PERSON":0, "ORG":0, "GPE":0, "LOC":0, "MONEY":0}

    doc = nlp(text)

    counts = {"PERSON":0, "ORG":0, "GPE":0, "LOC":0, "MONEY":0}
    for ent in doc.ents:
        label = ent.label_
        if label in counts:
            counts[label] += 1

    return counts


# ----------------------------------------
# 4. APPLY NER TO SAMPLE ROWS
# ----------------------------------------
ner_rows = [ner_features(t) for t in pdf["text"]]
ner_df = pd.DataFrame(ner_rows)


# ----------------------------------------
# 5. MERGE BACK TO PANDAS
# ----------------------------------------
pdf = pd.concat([pdf, ner_df], axis=1)


# ----------------------------------------
# 6. PREVIEW
# ----------------------------------------
print("\nNER Preview:")
print(pdf[["text", "PERSON", "ORG", "GPE", "LOC", "MONEY"]].head())


                                                                                

                 date                                               text
0 2019-03-06 00:00:00  UPDATE 2-Abercrombie predicts strong 2019 sale...
1 2017-08-15 01:00:00  Scaramucci: If it were up to me, Bannon would ...
2 2017-06-17 09:59:36  Amazon’s Move Signals End of Line for Many Cas...
3 2018-12-11 00:00:00  DNC hustles to line up donors as 2020 pack sta...
5 2019-10-31 11:05:30  Russian students to be taught how to assemble ...
6 2017-06-26 01:00:00  BRIEF-M&C completes allocation of shares subje...
7 2018-11-20 05:00:03  Palm Oil Was Supposed to Help Save the Planet....
8 2019-06-19 01:00:00  TREASURIES-U.S. yields fall after Fed signals ...
9 2019-01-06 00:00:00  Vinyl and cassette sales saw double digit grow...

NER Preview:
                                                text  PERSON  ORG  GPE  LOC  \
0  UPDATE 2-Abercrombie predicts strong 2019 sale...       8    8    4    0   
1  Scaramucci: If it were up to me, Bannon would ...       9   11    2    0   
2  Amazon’s Move Si

### **Structural Dummy**

In [24]:
# ----------------------------------------
# 0. SAMPLE A FEW ROWS SAFELY FROM SPARK
# ----------------------------------------
import pandas as pd
from pyspark.sql.functions import rand

cols = ["date", "text"]

rows = (
    df_silver
        .select(*cols)
        .orderBy(rand())
        .limit(10)
        .collect()
)

pdf = pd.DataFrame([r.asDict() for r in rows])
print(pdf.head())


# ----------------------------------------
# 1. IMPORTS
# ----------------------------------------
import re
import numpy as np


# ----------------------------------------
# 2. STRUCTURAL FEATURE EXTRACTOR
# ----------------------------------------
sentence_split = re.compile(r"[.!?]+")
word_split = re.compile(r"\w+")

def structural_features(text):
    if not isinstance(text, str) or not text.strip():
        return {
            "word_count": 0,
            "char_count": 0,
            "avg_word_length": 0.0,
            "num_sentences": 0,
            "avg_sentence_length": 0.0,
            "punct_count": 0,
            "upper_words": 0,
        }

  

    words = word_split.findall(text)
    word_count = len(words)

    avg_word_len = np.mean([len(w) for w in words]) if words else 0.0

    sentences = [s.strip() for s in sentence_split.split(text) if s.strip()]
    num_sent = len(sentences)
    avg_sent_len = np.mean([len(word_split.findall(s)) for s in sentences]) if num_sent else 0.0

    punct_count = sum(ch in ".,;:!?()" for ch in text)

    upper_words = sum(w.isupper() for w in words if len(w) > 1)

    return {
        "word_count": word_count,
        "avg_word_length": avg_word_len,
        "num_sentences": num_sent,
        "avg_sentence_length": avg_sent_len,
        "punct_count": punct_count,
        "upper_words": upper_words,
    }


# ----------------------------------------
# 3. APPLY TO SAMPLE ROWS
# ----------------------------------------
struct_rows = [structural_features(t) for t in pdf["text"]]
struct_df = pd.DataFrame(struct_rows)


# ----------------------------------------
# 4. MERGE BACK INTO PDF
# ----------------------------------------
pdf = pd.concat([pdf, struct_df], axis=1)


# ----------------------------------------
# 5. PREVIEW
# ----------------------------------------
print("\nStructural feature preview:")
print(
    pdf[
        [
            "text",
            "word_count",
            "avg_word_length",
            "num_sentences",
            "avg_sentence_length",
            "punct_count",
            "upper_words",
        ]
    ].head()
)




                 date                                               text
0 2019-02-25 00:00:00  Diversity wins at the Oscars\n\nGreen Book, a ...
1 2019-04-28 01:00:00  These two city-building puzzle games play very...
2 2017-10-11 01:00:00  Kushner praised Bannon for Fox News interview ...
3 2018-03-30 11:10:02  What this Silicon Valley VC learned on the ‘Ru...
4 2020-03-17 18:15:30  Historic surge in coronavirus phishing meets n...

Structural feature preview:
                                                text  word_count  \
0  Diversity wins at the Oscars\n\nGreen Book, a ...         310   
1  These two city-building puzzle games play very...         999   
2  Kushner praised Bannon for Fox News interview ...         260   
3  What this Silicon Valley VC learned on the ‘Ru...        1153   
4  Historic surge in coronavirus phishing meets n...        1080   

   avg_word_length  num_sentences  avg_sentence_length  punct_count  \
0         4.803226             16            19.37500

                                                                                

## Gold Schema — `delta_news_gold`

The Gold layer aggregates all **semantic**, **structural**, and **entity-level** features needed for
clustering, topic modelling, market alignment, and downstream ML tasks.  
Whereas Silver provides clean text and metadata, **Gold attaches meaning**, producing a fully enriched,
model-ready representation of each article.

---

## 1. Columns Inherited Directly from Silver

Gold retains all essential Silver fields so each enriched row remains grounded in its original context:

| Column               | Type      | Description                                                |
|----------------------|-----------|------------------------------------------------------------|
| date                 | date      | Canonical publication date.                                |
| text                 | string    | Clean article text.                                         |
| publication          | string    | Normalised publication name.                               |
| author               | string    | Author string (optional).                                   |
| url                  | string    | Canonical URL.                                              |
| text_type            | string    | Article format label.                                       |
| time_precision       | string    | Timestamp granularity.                                      |
| date_trading         | string    | Market-aligned timestamp.                                   |
| tz_hint              | string    | Timezone hint.                                              |
| dataset              | string    | Dataset label.                                              |
| dataset_source       | string    | Provenance info.                                            |
| source               | string    | Dataset source identifier.                                  |
| anchor_policy        | string    | Timestamp alignment policy.                                 |
| source_file          | string    | Originating Bronze file.                                    |
| len_text             | integer   | QA field: cleaned text length.                              |
| silver_ingestion_ts  | timestamp | Timestamp of Silver creation.                               |

---

## 2. New Semantic Features Created in Gold

### **2.1 Embedding Features**

| Column            | Type              | Description                                              |
|-------------------|-------------------|----------------------------------------------------------|
| embedding_1024    | array\<float\>    | Raw E5-large embeddings (1024 dimensions).               |
| embedding_reduced | array\<float\>    | PCA-reduced embedding (5–50 dimensions).                 |

Embeddings form the backbone of clustering and semantic similarity.

---

### **2.2 Lexicon Features**

Gold incorporates domain-specific lexicon counts from  
**Loughran–McDonald** (finance/econ sentiment) and  
**General Inquirer** (economic / political / legal domains).

#### *Loughran–McDonald*

| Column   | Type    | Description                       |
|----------|---------|-----------------------------------|
| lm_pos   | integer | Positive sentiment words.         |
| lm_neg   | integer | Negative sentiment words.         |
| lm_unc   | integer | Uncertainty-related terms.        |
| lm_lit   | integer | Litigation/legal-risk terms.      |
| lm_con   | integer | Constraining or restrictive words.|

#### *General Inquirer*

| Column    | Type    | Description                       |
|-----------|---------|-----------------------------------|
| gi_econ   | integer | Economic-domain words.            |
| gi_polit  | integer | Political-domain words.           |
| gi_legal  | integer | Legal/governance-related words.   |

Lexicons add interpretable anchor signals to each article’s semantic profile.

---

## 2.3 Named Entity Recognition (NER) Features

Counts of named entities detected via spaCy:

| Column | Type    | Description                                    |
|--------|---------|------------------------------------------------|
| PERSON | integer | Mentions of individuals.                        |
| ORG    | integer | Companies, institutions, organisations.         |
| GPE    | integer | Countries, states, cities (geo-political).     |
| LOC    | integer | Non-political locations.                        |
| MONEY  | integer | Currency-denominated amounts.                   |

NER helps identify *who* the article is about and *where* the events take place.

---

## 2.4 Structural Text Features

Shape-level indicators describing the structure of the article:

| Column              | Type    | Description                                 |
|---------------------|---------|---------------------------------------------|
| word_count          | integer | Total number of words.                      |
| char_count          | integer | Character count.                             |
| avg_word_length     | double  | Mean word length.                            |
| num_sentences       | integer | Sentence count.                              |
| avg_sentence_length | double  | Mean sentence length (words).                |
| punct_count         | integer | Number of punctuation characters.            |
| upper_words         | integer | Uppercase words (>1 character).              |

These features help differentiate short bulletins, long analyses, editorials, and disclosures.

---

## 3. Final Gold Table Schema

| Column               | Type              | Description                                          |
|----------------------|-------------------|------------------------------------------------------|
| date                 | date              | Canonical publication date.                          |
| text                 | string            | Clean article text.                                  |
| publication          | string            | Publication identifier.                              |
| author               | string            | Author field.                                        |
| url                  | string            | Canonical URL.                                       |
| text_type            | string            | Format label.                                        |
| time_precision       | string            | Timestamp granularity.                               |
| date_trading         | string            | Market-aligned timestamp.                            |
| tz_hint              | string            | Timezone hint.                                       |
| dataset              | string            | Dataset label.                                       |
| dataset_source       | string            | Provenance metadata.                                 |
| source               | string            | Dataset source identifier.                           |
| anchor_policy        | string            | Timestamp alignment logic.                           |
| source_file          | string            | Originating Bronze file.                             |
| len_text             | integer           | Character length for QA.                             |
| silver_ingestion_ts  | timestamp         | Silver transformation timestamp.                     |
| embedding_1024       | array\<float\>    | High-dimensional embedding.                          |
| embedding_reduced    | array\<float\>    | PCA-compressed embedding.                            |
| lm_pos               | integer           | LM positive sentiment.                               |
| lm_neg               | integer           | LM negative sentiment.                               |
| lm_unc               | integer           | LM uncertainty terms.                                |
| lm_lit               | integer           | LM litigious terms.                                  |
| lm_con               | integer           | LM constraining terms.                               |
| gi_econ              | integer           | GI economic-domain terms.                            |
| gi_polit             | integer           | GI political-domain terms.                           |
| gi_legal             | integer           | GI legal-domain terms.                               |
| PERSON               | integer           | Person entities.                                     |
| ORG                  | integer           | Organisation entities.                               |
| GPE                  | integer           | Geo-political entities.                              |
| LOC                  | integer           | Location entities.                                   |
| MONEY                | integer           | Monetary entities.                                   |
| word_count           | integer           | Word count.                                          |
| char_count           | integer           | Character count.                                     |
| avg_word_length      | double            | Average word length.                                 |
| num_sentences        | integer           | Sentence count.                                      |
| avg_sentence_length  | double            | Mean sentence length.                                |
| punct_count          | integer           | Punctuation count.                                   |
| upper_words          | integer           | Uppercase word count.                                |

---

## 4. Design Principles of the Gold Layer

- **Semantic richness:** embeddings, lexicons, and NER capture multiple linguistic axes.  
- **Interpretability:** lexicon and structural features make clusters explainable.  
- **Completeness:** all model-ready features are present for every retained row.  
- **No lossy transformations:** original text and core metadata remain intact.  
- **Model-readiness:** this table feeds clustering, PCA, regressions, and financial aggregation.

Gold is the **final enriched representation** of the full news corpus the foundation for
topic discovery, behavioural insights, and market-reaction modelling.
