<a href="https://colab.research.google.com/github/billycemerson/purbaya-net/blob/main/src/purbaya-ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/billycemerson/purbaya-net

Cloning into 'purbaya-net'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 15 (delta 1), reused 14 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (15/15), 127.29 KiB | 10.61 MiB/s, done.
Resolving deltas: 100% (1/1), done.


In [3]:
%cd purbaya-net
!ls

/content/purbaya-net
data  main.py  pyproject.toml  README.md  src  uv.lock


#### Install Package

In [4]:
!pip install torch transformers pandas



#### Import Package

In [5]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

#### Load Model

In [6]:
# Load model
model_name = "cahya/bert-base-indonesian-NER"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Setup NER pipeline
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Group tokens into entities
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of the model checkpoint at cahya/bert-base-indonesian-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


#### Load Data

In [8]:
# Load data
df = pd.read_csv("data/kompas.csv")
df.head()

Unnamed: 0,media,title,description,url,date,year
0,kompas,Menperin Nilai Kebijakan Purbaya Ramah Manufaktur,Menteri Perindustrian Agus Gumiwang Kartasasmi...,https://money.kompas.com/read/2025/10/02/18114...,2 Oktober 2025,2025
1,kompas,Purbaya Cap Pertamina Malas-malasan Bangun Kilang,Menkeu Purbaya Yudhi Sadewa menyinggung Pertam...,https://money.kompas.com/read/2025/10/01/08035...,1 Oktober 2025,2025
2,kompas,Ceplas-ceplos ala Koboi Menkeu Purbaya,"Purbaya memang begitu. Gaya spontan, ceplas-ce...",https://money.kompas.com/read/2025/09/12/06450...,12 September 2025,2025
3,kompas,"J.B Sumarlin, Krisis Moneter, dan Menteri Purbaya",Ada api dalam sekam yang bisa menyeret ekonomi...,https://money.kompas.com/read/2025/10/08/10461...,8 Oktober 2025,2025
4,kompas,"Menkeu Purbaya, Koboi yang Merawat Narasi Digital",Purbaya Yudhi Sadewa berhasil menarik perhatia...,https://money.kompas.com/read/2025/09/19/10270...,19 September 2025,2025


In [9]:
# Get only 1 row for testing
row = df.iloc[0]

# Concate the title and description to text for NER
text = f"{row['title']}. {row['description']}"
print(text)

Menperin Nilai Kebijakan Purbaya Ramah Manufaktur. Menteri Perindustrian Agus Gumiwang Kartasasmita menilai kebijakan Menteri Keuangan Purbaya Yudhi Sadewa sejalan dengan kepentingan industri.


#### Apply NER

In [10]:
# Apply NER
entities = ner_pipeline(text)

# See the results
for ent in entities:
    print(f"{ent['word']} -> {ent['entity_group']} ({ent['score']:.2f})")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


men -> NOR (0.54)
purba -> PER (0.56)
menteri perindustrian -> NOR (0.99)
agus gumiwang kartasasmita -> PER (0.99)
menteri keuangan -> NOR (0.96)
purbaya -> PER (0.78)
yudhi sadewa -> PER (0.92)


In [11]:
df_results = []

# Apply in all row data
for i, row in df.iterrows():
    text = f"{row['title']}. {row['description']}"
    entities = ner_pipeline(text)
    for ent in entities:
        df_results.append({
            'article_id': i,
            'entity': ent['word'],
            'label': ent['entity_group'],
            'score': ent['score']
        })

# Save results
df_ner = pd.DataFrame(df_results)
df_ner.to_csv("ner_results.csv", index=False)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [12]:
df_ner.head()

Unnamed: 0,article_id,entity,label,score
0,0,men,NOR,0.54405
1,0,purba,PER,0.55951
2,0,menteri perindustrian,NOR,0.987625
3,0,agus gumiwang kartasasmita,PER,0.993042
4,0,menteri keuangan,NOR,0.960561


In [13]:
df_ner.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3037 entries, 0 to 3036
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   article_id  3037 non-null   int64  
 1   entity      3037 non-null   object 
 2   label       3037 non-null   object 
 3   score       3037 non-null   float32
dtypes: float32(1), int64(1), object(2)
memory usage: 83.2+ KB
