# Querry

On the GDELTS x BigQuerry we querries the link with this prompt

```SQL
SELECT
  DocumentIdentifier AS url,
  SourceCommonName AS SOURCE,
  V2Themes,
  V2Tone,
  DATE(PARSE_TIMESTAMP('%Y%m%d%H%M%S', CAST(Date AS STRING))) AS publish_date
FROM
  `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
  _PARTITIONDATE BETWEEN '2017-01-01'
  AND '2025-12-31'
  AND LOWER(SourceCommonName) IN ( 
    'detik.com',
    'liputan6.com',
    'bisnis.com',
    'cnnindonesia.com',
    'tempo.co' )
  AND ( V2Themes LIKE '%ECON_INFLATION%'\
    OR V2Themes LIKE '%ECON_PRICE%'
    OR V2Themes LIKE '%UNGP_INTEREST_RATES%'
    OR V2Themes LIKE '%EPU_POLICY_PRICE%' )
  AND V2Locations LIKE '%Indonesia%'

```

We decided to used top 5 domain sites that are available for scrapping in the range of 2017 untill 2025 with the build in features of filtering of V2Themes.

# Read Data

In [None]:
import pyodbc
import pandas as pd
import re
from sentence_transformers import SentenceTransformer, util

pd.set_option('display.max_columns',None)
pd.set_option('display.width',None)
pd.set_option('display.max_colwidth',None)
pd.set_option('display.max_rows', None)

In [4]:
conn = pyodbc.connect(
    'DRIVER={ODBC Driver 17 for SQL Server};'
    'SERVER=LAPTOP-H9LRGGLD;'
    'DATABASE=InflationNews;'
    'Trusted_Connection=yes;'
)

In [5]:
querry = " SELECT  * FROM NewsData"
df = pd.read_sql(querry, conn)

  df = pd.read_sql(querry, conn)


# Split dataset ini untuk mencari tahu topic yang relevan nant untuk similarity test

In [None]:
# Split halfway
total_rows = len(df)

half = total_rows // 2
df1 = df.iloc[:half]
df2 = df.iloc[half:]

# Save both to CSV
df1.to_csv("relevant link folders/NewsData_part1.csv", index=False, encoding="utf-8-sig")
df2.to_csv("relevant link folders/NewsData_part2.csv", index=False, encoding="utf-8-sig")


# Simple EDA

In [7]:
df.head()

Unnamed: 0,url,source,publish_date
0,http://bali.bisnis.com/read/20141208/9/239302/hari-natal-2014-omzet-penjualan-pohon-aksesoris-natal-melonjak-300,bisnis.com,2019-02-04
1,http://bali.bisnis.com/read/20170403/1/65655/peserta-ta-periode-iii-ta-di-bali-53-wajib-pajak-pribadi-umkm,bisnis.com,2017-04-03
2,http://bali.bisnis.com/read/20180403/538/779524/transaksi-non-tunai-di-bali-masih-terhadang-sejumlah-persoalan,bisnis.com,2018-04-03
3,http://bali.bisnis.com/read/20170613/15/67084/cuaca-buruk-juga-pengaruhi-pasokan-umpan-ikan,bisnis.com,2017-06-13
4,http://bali.bisnis.com/read/20180409/538/782224/bali-paragon-resort-hadirkan-blackmud-lounge,bisnis.com,2018-04-09


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380167 entries, 0 to 380166
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   url           380167 non-null  object
 1   source        380167 non-null  object
 2   publish_date  380167 non-null  object
dtypes: object(3)
memory usage: 8.7+ MB


We collected 380167 links from the scraping results

In [9]:
df['source'].value_counts()

source
bisnis.com          178684
liputan6.com        149761
cnnindonesia.com     48326
tempo.co              3393
detik.com                3
Name: count, dtype: int64

The distribution of links is largely determined by their domain sources. The low count of unknown links, specifically from sources like detik.com, suggests that most domains are identifiable. We suspect that the primary cause for low link counts from certain sources is the domain's own policy, such as kompas.com's strict scraping rules defined in its robots.txt file.

# Check Source

In [10]:
df[df['source'] == 'detik.com']

Unnamed: 0,url,source,publish_date
71907,https://inet.detik.com/fotostop-news/d-4042717/fujifilm-x-t100-resmi-mendarat-di-indonesia-harganya,detik.com,2018-05-28
78750,https://inet.detik.com/consumer/d-3865207/di-peluncuran-nokia-8-ada-penampakan-nokia-9,detik.com,2018-02-14
360700,https://news.detik.com/kolom/d-5797345/school-waste-bank-project-as-environmental-education-at-sal-junior-high-school,detik.com,2021-11-04


Since the news form detik.com aren't significant and the content is uncorrelated with inflation, its better to drop it.

In [11]:
df = df[df['source'] != 'detik.com']

In [12]:
df['source'].value_counts()

source
bisnis.com          178684
liputan6.com        149761
cnnindonesia.com     48326
tempo.co              3393
Name: count, dtype: int64

# Check Date

In [13]:
df['publish_date'] = pd.to_datetime(df['publish_date'])
df['year'] = df['publish_date'].dt.year
df.groupby('year').size()


year
2017    31029
2018    45315
2019    45184
2020    39763
2021    45158
2022    44742
2023    46377
2024    46401
2025    36195
dtype: int64

every year has its own news representation that are distributed almost uniformly.

# Preprocessing

## Add slug col from url

In [14]:
df['slug'] = df['url'].apply(lambda x:x.split('/')[-1] if isinstance(x,str) else None)
df['slug'] = df['slug'].str.replace('-', ' ')  # ganti tanda minus dengan spasi
df.head()

Unnamed: 0,url,source,publish_date,year,slug
0,http://bali.bisnis.com/read/20141208/9/239302/hari-natal-2014-omzet-penjualan-pohon-aksesoris-natal-melonjak-300,bisnis.com,2019-02-04,2019,hari natal 2014 omzet penjualan pohon aksesoris natal melonjak 300
1,http://bali.bisnis.com/read/20170403/1/65655/peserta-ta-periode-iii-ta-di-bali-53-wajib-pajak-pribadi-umkm,bisnis.com,2017-04-03,2017,peserta ta periode iii ta di bali 53 wajib pajak pribadi umkm
2,http://bali.bisnis.com/read/20180403/538/779524/transaksi-non-tunai-di-bali-masih-terhadang-sejumlah-persoalan,bisnis.com,2018-04-03,2018,transaksi non tunai di bali masih terhadang sejumlah persoalan
3,http://bali.bisnis.com/read/20170613/15/67084/cuaca-buruk-juga-pengaruhi-pasokan-umpan-ikan,bisnis.com,2017-06-13,2017,cuaca buruk juga pengaruhi pasokan umpan ikan
4,http://bali.bisnis.com/read/20180409/538/782224/bali-paragon-resort-hadirkan-blackmud-lounge,bisnis.com,2018-04-09,2018,bali paragon resort hadirkan blackmud lounge


## Cleaning

In [15]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text) 
    text = re.sub(r'\s+',' ', text).strip()
    return text

df['clean_headline'] = df['slug'].apply(clean_text)
df.head()

Unnamed: 0,url,source,publish_date,year,slug,clean_headline
0,http://bali.bisnis.com/read/20141208/9/239302/hari-natal-2014-omzet-penjualan-pohon-aksesoris-natal-melonjak-300,bisnis.com,2019-02-04,2019,hari natal 2014 omzet penjualan pohon aksesoris natal melonjak 300,hari natal omzet penjualan pohon aksesoris natal melonjak
1,http://bali.bisnis.com/read/20170403/1/65655/peserta-ta-periode-iii-ta-di-bali-53-wajib-pajak-pribadi-umkm,bisnis.com,2017-04-03,2017,peserta ta periode iii ta di bali 53 wajib pajak pribadi umkm,peserta ta periode iii ta di bali wajib pajak pribadi umkm
2,http://bali.bisnis.com/read/20180403/538/779524/transaksi-non-tunai-di-bali-masih-terhadang-sejumlah-persoalan,bisnis.com,2018-04-03,2018,transaksi non tunai di bali masih terhadang sejumlah persoalan,transaksi non tunai di bali masih terhadang sejumlah persoalan
3,http://bali.bisnis.com/read/20170613/15/67084/cuaca-buruk-juga-pengaruhi-pasokan-umpan-ikan,bisnis.com,2017-06-13,2017,cuaca buruk juga pengaruhi pasokan umpan ikan,cuaca buruk juga pengaruhi pasokan umpan ikan
4,http://bali.bisnis.com/read/20180409/538/782224/bali-paragon-resort-hadirkan-blackmud-lounge,bisnis.com,2018-04-09,2018,bali paragon resort hadirkan blackmud lounge,bali paragon resort hadirkan blackmud lounge


Clean headlines slugs for later similarity test

# Similarity

In [None]:
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# 40 manually picked relevant headline
golden_slugs = [
    'rupiah menguat ke rp14233 di tengah lonjakan inflasi global',
    'menelusuri penyebab harga bbm naik di pom bensin pertamina dan vivo',
    'survei bi indeks harga konsumen di minggu pertama juni alami deflasi',
    'euforia penurunan suku bunga dorong indeks sp 500 ke rekor tertinggi',
    'kenaikan harga tomat kembali picu inflasi sulut november 2019',
    'konsumsi turun sebab ri catatkan deflasi di agustus',
    'pergerakan mata uang disinflasi as untungkan rupiah',
    'pemerintah lacak akar masalah lesunya daya beli',
    'inflasi indonesia naik pengaruh harga pangan global',
    'bupati tangerang bahan pokok makin mahal usai harga bbm naik',
    'indocement dongkrak harga jual produk imbas kenaikan harga energi',
    'diproyeksi terus naik harga gas jadi ancaman tarif listrik',
    'punya nikel indonesia bisa menguasai rantai pasok baterai lithium',
    'inflasi indonesia 2023 hanya 261 persen bi hasil dari konsistensi kebijakan moneter',
    'bank indonesia bakal kerek suku bunga usai harga bbm resmi naik',
    'bps harga batu bara masih naik hingga akhir 2022',
    'rupiah makin loyo dana subsidi pemerintah terancam jebol',
    'inflasi jadi ancaman nyata bagi indonesia waspada',
    'indonesia alami deflasi pertama sejak 25 tahun apa penyebabnya',
    'disinflasi global melambat apa pengaruhnya',
    'bank indonesia tak perlu khawatir dengan turunnya inflasi inti',
    'indeks harga konsumen september terjadi inflasi 013',
    'indeks harga konsumen ihk september 2019 diperkirakan deflasi 017 persen',
    'pelemahan dolar as bantu kenaikan harga emas',
    'ekonom prediksi terjadi disinflasi pada januari 2025 ini penyebabnya',
    'ekonom sebut pinjaman gadai melesat akibat daya beli turun dan badai phk',
    'permintaan masih sepi harga pangan ikut tertekan',
    'harga bahan pokok melejit pedagang pasar teriak',
    'harga minyak merangkak naik harga bbm tidak perlu turun',
    'harga listrik pembangkit sampah mahal berisiko bebani keuangan pln',
    'pemerintah pertimbangkan kenaikan tarif listrik pelanggan nonsubsidi di pertengahan tahun',
    'penutupan pelabuhan china picu masalah sistemik rantai pasok',
    'bos bi sebut suku bunga akan ditahan kebijakan moneter 2024 tetap pro stabilitas',
    'kebijakan moneter the fed tekan rupiah ke rp14450',
    'kebijakan moneter perang suku bunga makin melebar',
    'pertumbuhan ekonomi ri dibayangi kenaikan suku bunga as',
    'pemerintah tambah subsidi energi rp 749 triliun di 2022',
    'purbaya bakal tarik rp200 t uang pemerintah di bi buat ditaruh di bank',
    'airlangga klaim deflasi di ri buah sukses pemerintah kendalikan harga',
    'the fed ingatkan inflasi berpeluang naik usai trump pasang tarif'
]

golden_embeddings = model.encode(golden_slugs, convert_to_tensor=True)

print("Vectoring")
all_headlines = df['clean_headline'].tolist()
all_embeddings = model.encode(all_headlines, convert_to_tensor=True, show_progress_bar=True)

print("Calculating")
cosine_scores = util.cos_sim(all_embeddings, golden_embeddings)
max_similarity_scores = cosine_scores.max(axis=1).values

Vectoring


Batches:   0%|          | 0/11881 [00:00<?, ?it/s]

Calculating


To filter a large dataset of headlined based on their semantic relevance to a small set of manually curated examples. This ensures that the resulting dataset contains only headlines highly focused on a specific topic defined by the examples.


The filtering was performed using a Sentenc Embedding model and Cosine Similarity. We used a small  pre-trained Sentence Transformer model, specifically the 'paraphrase-multilingual-MiniLM-L12-v2'.


This model converts text into dense numerical vecotrs or embeddings. Sentence with similar meanings are mapped to vectors that are close to one another in the vector space.


The 40 manually selected relevant headlines (golden_slugs) were encoded into a set of golden_embeddings to define the topic space of interest, specifically covering key economic clusters such as Energy & Fuel, Food Commodities, Monetary Policy, Currency Exchange Rates, Government Fiscal Actions, and Macroeconomic Indicators.

All headlines in the dataset (df['clean_headline']) were also encoded to generate all_embeddings.

For the similarity calculatiuon, we us cosine similarity metric to quantify the semantic realtionshp between all headlines and teh golden examples.


Then, for every headline in the full dataset, the cosine similarity was calculated against all twenty golden embeddings.


The maximum similarity score for each headline was retained. This maximum score represents the closest semantic relationship between that headline and the target topic defined by the golden set.

In [None]:
df['similarity_score'] = max_similarity_scores
similarity_threshold = 0.7
df_similarity = df[df['similarity_score'] >= similarity_threshold].copy()
print(f'Total Rows: {len(df_similarity)}\n')
print(df_similarity['year'].value_counts())

Total Rows: 9293

year
2023    1521
2022    1383
2024    1362
2021     971
2018     960
2025     957
2019     916
2020     775
2017     448
Name: count, dtype: int64


Then a similarity threshold of 0.7 as the minimum semantic closeness required for a headline to be considered relevant.


It shows the remaining 9293 data left after filterring.

In [26]:
df_similarity.head()

Unnamed: 0,url,source,publish_date,year,slug,clean_headline,similarity_score
12,http://bali.bisnis.com/read/20180503/538/791388/satgas-pangan-bali-cermati-elpiji-oplosan-empat-risiko-inflasi,bisnis.com,2018-05-03,2018,satgas pangan bali cermati elpiji oplosan empat risiko inflasi,satgas pangan bali cermati elpiji oplosan empat risiko inflasi,0.757262
14,http://bali.bisnis.com/read/20170801/14/67987/target-pertumbuhan-ekonomi-nasional-52-cukup-berat,bisnis.com,2017-08-01,2017,target pertumbuhan ekonomi nasional 52 cukup berat,target pertumbuhan ekonomi nasional cukup berat,0.706792
15,http://bali.bisnis.com/read/20170403/16/65662/labu-siam-sumbang-inflasi-bulanan-denpasar,bisnis.com,2017-04-03,2017,labu siam sumbang inflasi bulanan denpasar,labu siam sumbang inflasi bulanan denpasar,0.713937
58,http://bandung.bisnis.com/read/20170113/34231/565987/ini-besaran-sumbangan-kenaikan-stnk-tarif-listrik-terhadap-inflasi,bisnis.com,2017-01-13,2017,ini besaran sumbangan kenaikan stnk tarif listrik terhadap inflasi,ini besaran sumbangan kenaikan stnk tarif listrik terhadap inflasi,0.744901
158,http://bandung.bisnis.com/read/20170216/34231/567541/ekonomi-ri-diprediksi-naik-53-di-2017,bisnis.com,2017-02-16,2017,ekonomi ri diprediksi naik 53 di 2017,ekonomi ri diprediksi naik di,0.783851


In [27]:
df_similarity['source'].value_counts()

source
bisnis.com          5552
liputan6.com        2822
cnnindonesia.com     866
tempo.co              53
Name: count, dtype: int64

# Saved

In [28]:
df_similarity.to_csv("NewsDataClean.csv", index=False, encoding="utf-8-sig")