#  BigQuerry x GDELTS

On the GDELTS x BigQuerry we querries the links with this prompt

```SQL
SELECT
  DocumentIdentifier AS url,
  SourceCommonName AS source,
  V2Themes,
  V2Tone,
  DATE(PARSE_TIMESTAMP('%Y%m%d%H%M%S', CAST(Date AS STRING))) AS publish_date
FROM
  `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
  _PARTITIONDATE BETWEEN '2025-01-01' AND '2025-12-31'
  AND LOWER(SourceCommonName) IN (
    'republika.co.id',
    'antaranews.com',
    'okezone.com',
    'kontan.co.id',
    'merdeka.com'
  )
  AND (
    V2Themes LIKE '%ECON_INFLATION%'
    OR V2Themes LIKE '%ECON_PRICE%'
    OR V2Themes LIKE '%UNGP_INTEREST_RATES%'
    OR V2Themes LIKE '%EPU_POLICY_PRICE%'
    OR V2Themes LIKE '%WB_CPI%'
    OR V2Themes LIKE '%WB_WPI%'
    OR V2Themes LIKE '%ECON_MACRO%'
  )
  AND V2Locations LIKE '%Indonesia%'
```

we decided to use 5 unseen source in the range of 2025 with the build in theme filters of our topic, even though it will show irrelavant news topic. Then we save it into a local ssms database


# Read Data

In [None]:
import pyodbc
import pandas as pd
import re

In [None]:
conn = pyodbc.connect(
    'DRIVER={ODBC Driver 17 for SQL Server};'
    'SERVER=LAPTOP-H9LRGGLD;'
    'DATABASE=InflationNews;'
    'Trusted_Connection=yes;'
)

In [None]:
querry = " SELECT  * FROM BigQuerry01012025_31122025_FidelityTest"
df = pd.read_sql(querry, conn)

In [None]:
df.to_csv('fidelity_link_dataset.csv',index=False)

In [6]:
df = pd.read_csv('fidelity_link_dataset.csv')

# Simple EDA

In [7]:
df.head()

Unnamed: 0,url,source,V2Themes,V2Tone,publish_date
0,https://www.antaranews.com/berita/5269469/menh...,antaranews.com,"MEDIA_MSM,4021;TAX_FNCACT_REPORTER,4021;TAX_FN...","4.67128027681661,6.05536332179931,1.3840830449...",2025-11-26
1,https://www.antaranews.com/berita/4756189/sim-...,antaranews.com,"TAX_FNCACT_ORGANIZER,1812;TAX_ECON_PRICE,938;T...","2.10526315789474,3.15789473684211,1.0526315789...",2025-04-07
2,https://www.antaranews.com/berita/5182401/teru...,antaranews.com,"TAX_FNCACT_NOBLE,214;TAX_FNCACT_COMPANIONS,152...","14.4781144781145,15.1515151515152,0.6734006734...",2025-10-17
3,https://www.antaranews.com/berita/5135121/harg...,antaranews.com,"WB_336_NON_BANK_FINANCIAL_INSTITUTIONS,33;WB_3...","17.6,18,0.4,18.4,36,0,226",2025-09-26
4,https://www.antaranews.com/berita/5269481/kkp-...,antaranews.com,"WB_1983_HEALTHY_OCEANS,2243;WB_1979_NATURAL_RE...","8.22784810126582,9.07172995780591,0.8438818565...",2025-11-26


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40632 entries, 0 to 40631
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   url           40632 non-null  object
 1   source        40632 non-null  object
 2   V2Themes      40632 non-null  object
 3   V2Tone        40632 non-null  object
 4   publish_date  40632 non-null  object
dtypes: object(5)
memory usage: 1.6+ MB


we collected 40632 links, but when we check manually, not every links are related to our topics that we will preprocess later.

In [9]:
df['source'].value_counts()

source
antaranews.com     18287
republika.co.id     7550
kontan.co.id        6912
merdeka.com         4065
okezone.com         3818
Name: count, dtype: int64

All 5 local news website source provide a significant link news, but dominantly by `antaranews.com`

# Check Date

In [10]:
df['publish_date'] = pd.to_datetime(df['publish_date'])
df['year'] = df['publish_date'].dt.year
df['month'] = df['publish_date'].dt.month
print(df.groupby('year').size())
print("")
print(df.groupby('month').size())


year
2025    40632
dtype: int64

month
1     3613
2     4271
3     4378
4     4026
5     4762
6     2175
7     3426
8     3547
9     3675
10    3637
11    3122
dtype: int64


every month has its own news representation

# Preprocessing

# Drop unused column

In [11]:
df = df.drop(columns=['V2Themes','V2Tone'])

## Add slug column from url

In [12]:
df['slug'] = df['url'].apply(lambda x:x.split('/')[-1] if isinstance(x,str) else None)
df['slug'] = df['slug'].str.replace('-', ' ')  # ganti tanda minus dengan spasi
df.head()

Unnamed: 0,url,source,publish_date,year,month,slug
0,https://www.antaranews.com/berita/5269469/menh...,antaranews.com,2025-11-26,2025,11,menhub ajak masyarakat gunakan diskon tarif st...
1,https://www.antaranews.com/berita/4756189/sim-...,antaranews.com,2025-04-07,2025,4,sim keliling kembali dibuka di lima lokasi di ...
2,https://www.antaranews.com/berita/5182401/teru...,antaranews.com,2025-10-17,2025,10,terus meroket emas di pegadaian ada yang sentu...
3,https://www.antaranews.com/berita/5135121/harg...,antaranews.com,2025-09-26,2025,9,harga tiga produk emas di pegadaian jumat ini ...
4,https://www.antaranews.com/berita/5269481/kkp-...,antaranews.com,2025-11-26,2025,11,kkp tiga penyuluh kp raih satyalancana wira ka...


## Cleaning Slugs into Clean Headline

In [13]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text) 
    text = re.sub(r'\s+',' ', text).strip()
    return text

df['clean_headline'] = df['slug'].apply(clean_text)
df.head()

Unnamed: 0,url,source,publish_date,year,month,slug,clean_headline
0,https://www.antaranews.com/berita/5269469/menh...,antaranews.com,2025-11-26,2025,11,menhub ajak masyarakat gunakan diskon tarif st...,menhub ajak masyarakat gunakan diskon tarif st...
1,https://www.antaranews.com/berita/4756189/sim-...,antaranews.com,2025-04-07,2025,4,sim keliling kembali dibuka di lima lokasi di ...,sim keliling kembali dibuka di lima lokasi di ...
2,https://www.antaranews.com/berita/5182401/teru...,antaranews.com,2025-10-17,2025,10,terus meroket emas di pegadaian ada yang sentu...,terus meroket emas di pegadaian ada yang sentu...
3,https://www.antaranews.com/berita/5135121/harg...,antaranews.com,2025-09-26,2025,9,harga tiga produk emas di pegadaian jumat ini ...,harga tiga produk emas di pegadaian jumat ini ...
4,https://www.antaranews.com/berita/5269481/kkp-...,antaranews.com,2025-11-26,2025,11,kkp tiga penyuluh kp raih satyalancana wira ka...,kkp tiga penyuluh kp raih satyalancana wira ka...


In [14]:
df['clean_headline'].to_csv('clean_headline.csv', index=False)

# Similarity

In [16]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2',local_files_only=True)

# 20 manually picked relevant headline
golden_slugs = [
    'harga emas antam naik rp pecah rekor tertinggi tembus rp juta per gram',
    'bps diskon listrik sampai cabai bawang picu deflasi jabar persen',
    'mentan amran jemput presiden brasil setelah turunkan harga pupuk persen',
    'bi proyeksi suku bunga the fed turun ke persen di di',
    'tpid pontianak antisipasi kenaikan harga pangan jelang imlek',
    'harga emas di pegadaian meroket hingga rp gram',
    'komisi iv dpr ri dorong penguatan distribusi cbp demi stabilitas harga',
    'emas di pegadaian antam galeri stabil ubs melonjak pada senin',
    'ingat harga tiket pesawat turun selama libur sekolah',
    'menko airlangga sebut inflasi karena emas dampak positif bullion bank',
    'safe haven dolar kurang menarik emas naik daun lagi',
    'kapolri jaga stabilitas harga dan ketersediaan pangan',
    'harga emas antam makin mahal dibanderol rp gram',
    'selesaikan konflik petani singkong mentan amran tetapkan harga dan larang impor',
    'kopdes merah putih metuk pasok gas hingga beras murah bagi warga',
    'ieu cepa akan bikin harga mercedesbenz lebih murah',
    'mentan pastikan harga ayam kembali normal dalam seminggu',
    'harga emas antam stabil ubs galeri turun di pegadaian sabtu ini',
    'bulog diy jamin stok beras cukup dan aman hingga akhir tahun',
    'tpid papua pegunungan dorong pengendalian harga barang cegah inflasi'
]

golden_embeddings = model.encode(golden_slugs, convert_to_tensor=True)

print("Vectoring")
all_headlines = df['clean_headline'].tolist()
all_embeddings = model.encode(all_headlines, convert_to_tensor=True, show_progress_bar=True)

print("Calculating")
cosine_scores = util.cos_sim(all_embeddings, golden_embeddings)
max_similarity_scores = cosine_scores.max(axis=1).values

df['similarity_score'] = max_similarity_scores.cpu().numpy()
similarity_threshold = 0.7
df_similarity = df[df['similarity_score'] >= similarity_threshold].copy()



Vectoring


Batches:   0%|          | 0/1270 [00:00<?, ?it/s]

Calculating


To filter a large dataset of headlined based on their semantic relevance to a small set of manually curated examples. This ensures that the resulting dataset contains only headlines highly focused on a specific topic defined by the examples.


The filtering was performed using a Sentenc Embedding model and Cosine Similarity. We used a small  pre-trained Sentence Transformer model, specifically the 'paraphrase-multilingual-MiniLM-L12-v2'.


This model converts text into dense numerical vecotrs or embeddings. Sentence with similar meanings are mapped to vectors that are close to one another in the vector space.


Then, The twenty manually selected relevant headlines golden_slugs were encoded into a set of golden_embeddings. These embeddings define the topic space of interest.


All headlines in the dataset (df['clean_headline']) were also encoded to generate all_embeddings.

For the similarity calculatiuon, we us cosine similarity metric to quantify the semantic realtionshp between all headlines and teh golden examples.


Then, for every headline in the full dataset, the cosine similarity was calculated against all twenty golden embeddings.


The maximum similarity score for each headline was retained. This maximum score represents the closest semantic relationship between that headline and the target topic defined by the golden set.

In [18]:
similarity_threshold = 0.7
df_similarity = df[df['similarity_score'] >= similarity_threshold].copy()
print(f'Total Rows: {len(df_similarity)}\n')
print(df_similarity['year'].value_counts())

Total Rows: 1911

year
2025    1911
Name: count, dtype: int64


Then a similarity threshold of 0.7 as the minimum semantic closeness required for a headline to be considered relevant.


It shows the remaining 1911 data left after filterring.

In [19]:
df_similarity.head()

Unnamed: 0,url,source,publish_date,year,month,slug,clean_headline,similarity_score
2,https://www.antaranews.com/berita/5182401/teru...,antaranews.com,2025-10-17,2025,10,terus meroket emas di pegadaian ada yang sentu...,terus meroket emas di pegadaian ada yang sentu...,0.84413
3,https://www.antaranews.com/berita/5135121/harg...,antaranews.com,2025-09-26,2025,9,harga tiga produk emas di pegadaian jumat ini ...,harga tiga produk emas di pegadaian jumat ini ...,0.754299
5,https://www.antaranews.com/berita/5014569/harg...,antaranews.com,2025-08-04,2025,8,harga terbaru emas pegadaian galeri24 antam tu...,harga terbaru emas pegadaian galeri antam turu...,0.757368
9,https://www.antaranews.com/berita/4857989/harg...,antaranews.com,2025-05-25,2025,5,harga emas di pegadaian pada 26 mei kompak stabil,harga emas di pegadaian pada mei kompak stabil,0.835656
27,https://www.antaranews.com/berita/5123476/harg...,antaranews.com,2025-09-20,2025,9,harga emas antam galeri24 ubs di pegadaian har...,harga emas antam galeri ubs di pegadaian hari ...,0.709898


In [20]:
df_similarity['source'].value_counts()

source
antaranews.com     881
okezone.com        366
republika.co.id    302
kontan.co.id       294
merdeka.com         68
Name: count, dtype: int64

# Saved

In [21]:
df_similarity.to_csv("NewsDataClean_fidelity.csv", index=False, encoding="utf-8-sig")