<a href="https://colab.research.google.com/github/bryanbayup/phising-detection/blob/main/test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=44fc6d5557c370673bebb8ff6af0530c45315d9d85232479efe3785c9cac11bd
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [2]:
import json
import os
import re
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from seqeval.metrics import classification_report
from transformers import AutoTokenizer, TFAutoModel, create_optimizer

#######################
# Load Data
#######################
data_path = '/content/data2.json'  # Ganti dengan path dataset Anda
with open(data_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# Kumpulkan semua user utterance, intent, dan entities
utterances = []
intents = []
entities_all = []

for conv in data:
    for turn in conv['turns']:
        if turn['speaker'] == 'user':
            utt = turn['utterance']
            intent = turn['intent']
            ents = turn.get('entities', [])
            utterances.append(utt)
            intents.append(intent)
            entities_all.append(ents)

#######################
# Label Encoding
#######################
intent_labels = list(set(intents))
intent_labels.sort()
intent_label2id = {lbl: i for i, lbl in enumerate(intent_labels)}
intent_id2label = {v:k for k,v in intent_label2id.items()}

# Buat set untuk NER label
ner_tags = {"O"}
for ents in entities_all:
    for e in ents:
        ent_type = e['entity']
        # Gunakan konvensi BIO
        ner_tags.add("B-"+ent_type)
        ner_tags.add("I-"+ent_type)
ner_tags = list(ner_tags)
ner_tags.sort()
ner_label2id = {lbl: i for i, lbl in enumerate(ner_tags)}
ner_id2label = {i: lbl for lbl, i in ner_label2id.items()}

#######################
# Tokenizer & Preprocessing
#######################
model_name = "indobenchmark/indobert-base-p2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def clean_text(text):
    # opsional: normalisasi lagi
    text = text.strip()
    return text

max_length = 64  # sesuaikan panjang maksimum input

def encode_data(utterances, intents, entities_all):
    input_ids_list = []
    attention_masks_list = []
    intent_ids_list = []
    ner_labels_list = []

    for utt, intent, ents in zip(utterances, intents, entities_all):
        utt_clean = clean_text(utt)
        tokens = tokenizer.tokenize(utt_clean)
        tokens = ["[CLS]"] + tokens + ["[SEP]"]
        if len(tokens) > max_length:
            tokens = tokens[:max_length-1] + ["[SEP]"]

        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        attention_mask = [1]*len(input_ids)

        # Pad
        while len(input_ids) < max_length:
            input_ids.append(tokenizer.pad_token_id)
            attention_mask.append(0)

        # Map intent
        intent_id = intent_label2id[intent]

        # Map NER
        # Pertama, buat label O untuk semua token
        ner_labels = ["O"]*(len(tokens))

        # Mark entitas
        # Caranya: cari kata di utterance. Kita perlu mapping dari kata original ke subword.
        # Sederhana: gunakan tokenizer fast. Disini untuk kesederhanaan:
        # Kami asumsi entitas value cocok dengan substring pada utterance_clean.
        # Temukan posisi kata di utt_clean, lalu cari subword mana yang mencakup posisi tsb.
        # *Catatan*: Implementasi ini sederhana. Sebaiknya gunakan alignment token sebenarnya.

        # Buat peta kata dari utterance ke token
        # Kata-kata
        words = utt_clean.split()
        word_idx_map = []
        current_idx = 0
        for w in words:
            sub_toks = tokenizer.tokenize(w)
            word_idx_map.append((current_idx, current_idx+len(sub_toks), w))
            current_idx += len(sub_toks)

        # Sekarang align entitas ke subwords
        # Cara sederhana: untuk setiap entitas, cari kata-kata yang menyusunnya,
        # lalu tandai subword pertama dengan B-ENT dan selanjutnya I-ENT.
        # Disini kami asumsikan entitas single/multi kata tidak melampaui tata letak kata.

        # Extract original words tokenized
        # Re-tokenize by words
        # Sebetulnya approach ini perlu alignment yang lebih cermat.

        # Lebih mudah: kita cocokkan substring entitas di words.
        for e in ents:
            ent_type = e['entity']
            ent_value = e['value'].strip().split() # per kata
            # temukan sequence ini di words
            for i in range(len(words)-len(ent_value)+1):
                if words[i:i+len(ent_value)] == ent_value:
                    # Map ini ke subwords
                    # hitung subwords start
                    subword_start = 1 # karena [CLS] = index 0
                    for w_i in range(i):
                        w_sub = tokenizer.tokenize(words[w_i])
                        subword_start += len(w_sub)

                    # Tandai entitas
                    total_sub = 0
                    for w_j, wv in enumerate(ent_value):
                        w_sub = tokenizer.tokenize(wv)
                        for k, _ in enumerate(w_sub):
                            if subword_start < max_length-1:
                                if w_j == 0 and k == 0:
                                    ner_labels[subword_start] = "B-"+ent_type
                                else:
                                    ner_labels[subword_start] = "I-"+ent_type
                                subword_start += 1
                                total_sub += 1
                    break

        # Convert ner_labels to ids
        ner_ids = [ner_label2id[lbl] for lbl in ner_labels]
        while len(ner_ids) < max_length:
            ner_ids.append(ner_label2id["O"])
        if len(ner_ids) > max_length:
            ner_ids = ner_ids[:max_length]

        input_ids_list.append(input_ids)
        attention_masks_list.append(attention_mask)
        intent_ids_list.append(intent_id)
        ner_labels_list.append(ner_ids)

    return np.array(input_ids_list), np.array(attention_masks_list), np.array(intent_ids_list), np.array(ner_labels_list)


X_input_ids, X_att_mask, Y_intent, Y_ner = encode_data(utterances, intents, entities_all)

X_train_ids, X_val_ids, X_train_mask, X_val_mask, Y_train_intent, Y_val_intent, Y_train_ner, Y_val_ner = train_test_split(
    X_input_ids, X_att_mask, Y_intent, Y_ner, test_size=0.2, random_state=42
)

#######################
# Build Multi-task Model
#######################
from transformers import TFAutoModel

num_intent_labels = len(intent_label2id)
num_ner_labels = len(ner_label2id)

base_model = TFAutoModel.from_pretrained(model_name)

# Input
input_ids_ = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name='input_ids')
attention_mask_ = tf.keras.Input(shape=(max_length,), dtype=tf.int32, name='attention_mask')

outputs = base_model(input_ids=input_ids_, attention_mask=attention_mask_)
sequence_output = outputs.last_hidden_state # (batch, seq, hidden)
cls_output = sequence_output[:,0,:]  # CLS token representation

# Intent head
intent_logits = tf.keras.layers.Dense(num_intent_labels, name='intent_classifier')(cls_output)

# NER head
ner_logits = tf.keras.layers.Dense(num_ner_labels, name='ner_classifier')(sequence_output)

model = tf.keras.Model(inputs=[input_ids_, attention_mask_], outputs=[intent_logits, ner_logits])

# Define losses
# Intent: sparse categorical crossentropy
# NER: sparse categorical crossentropy
loss_fct_intent = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fct_ner = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Custom training step (or use model.compile with loss weighting)
# Di sini kita pakai compile dengan dictionary losses
model.compile(optimizer='adam',
              loss={
                  'intent_classifier': loss_fct_intent,
                  'ner_classifier': loss_fct_ner
              },
              loss_weights={
                  'intent_classifier': 1.0,
                  'ner_classifier': 1.0
              },
              metrics={
                  'intent_classifier':'accuracy'
              })

# Train
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
history = model.fit(
    [X_train_ids, X_train_mask],
    {'intent_classifier': Y_train_intent, 'ner_classifier': Y_train_ner},
    validation_data=([X_val_ids, X_val_mask], {'intent_classifier': Y_val_intent, 'ner_classifier': Y_val_ner}),
    epochs=5,
    batch_size=16,
    callbacks=[early_stop]
)

#######################
# Evaluation
#######################

# Evaluate Intent
val_intent_preds, val_ner_preds = model.predict([X_val_ids, X_val_mask])
val_intent_labels = val_intent_preds.argmax(axis=1)
acc = np.mean(val_intent_labels == Y_val_intent)
print("Intent Accuracy:", acc)

# Evaluate NER
val_ner_labels = val_ner_preds.argmax(axis=-1)
# Convert to tag sequence
true_tags = []
pred_tags = []
for i in range(len(Y_val_ner)):
    true_seq = [ner_id2label[id_] for id_ in Y_val_ner[i]]
    pred_seq = [ner_id2label[id_] for id_ in val_ner_labels[i]]
    # strip padding
    true_seq = true_seq[:np.count_nonzero(X_val_mask[i])]
    pred_seq = pred_seq[:np.count_nonzero(X_val_mask[i])]
    true_tags.append(true_seq[1:-1]) # buang [CLS], [SEP]
    pred_tags.append(pred_seq[1:-1])

print(classification_report(true_tags, pred_tags))

#######################
# Dialog Manager
#######################
# Setelah model dilatih, kita bisa gunakan model untuk multi-turn.
# Setiap user input di-tokenize, prediksi intent & NER, lalu dialog manager memutuskan respon.

class DialogManager:
    def __init__(self):
        self.state = 'IDLE'
        self.reported_animal = None
        self.symptoms = []

    def predict_intent_ner(self, text):
        utt_clean = clean_text(text)
        tok = tokenizer(text, truncation=True, padding='max_length', max_length=max_length, return_tensors='tf')
        intent_logits, ner_logits = model([tok['input_ids'], tok['attention_mask']], training=False)
        intent_id = np.argmax(intent_logits, axis=1)[0]
        intent = intent_id2label[intent_id]

        ner_ids = np.argmax(ner_logits, axis=-1)[0]
        tokens = tokenizer.convert_ids_to_tokens(tok['input_ids'][0])
        # Reconstruct entities
        pred_labels = [ner_id2label[i] for i in ner_ids]
        # Align subwords to form entities (mirip yang di training)
        # (Sederhana, sama seperti di encoding, tapi kebalik)
        entities = []
        current_ent = None
        current_val = []
        for w, l in zip(tokens, pred_labels):
            if w in ['[CLS]', '[SEP]', '[PAD]']:
                continue
            if l.startswith("B-"):
                # Save previous entity
                if current_ent is not None:
                    entities.append({'entity': current_ent, 'value': " ".join(current_val)})
                current_ent = l[2:]
                current_val = [w]
            elif l.startswith("I-") and current_ent == l[2:]:
                current_val.append(w)
            else:
                # O atau beda entitas
                if current_ent is not None:
                    entities.append({'entity': current_ent, 'value': " ".join(current_val)})
                    current_ent = None
                    current_val = []
        if current_ent is not None:
            entities.append({'entity': current_ent, 'value': " ".join(current_val)})

        # Bersihkan entity value dari subword "##"
        for ent in entities:
            ent['value'] = ent['value'].replace("##", "")

        return intent, entities

    def get_next_response(self, user_input):
        intent, entities = self.predict_intent_ner(user_input)
        # Logika multi-turn sesuai keinginan Anda
        if intent == "Greeting":
            return "Halo! Ada yang bisa saya bantu?"
        elif intent == "Thanks":
            return "Sama-sama! Senang membantu."
        elif intent == "Rekomendasi Penanganan Awal":
            # Contoh logika:
            # Jika hewan belum dilaporkan dan user menyebut hewan
            animal_ents = [e for e in entities if e['entity']=='animal']
            symptom_ents = [e for e in entities if e['entity']=='symptom']
            if animal_ents:
                self.reported_animal = animal_ents[0]['value']
            if symptom_ents:
                self.symptoms.extend([s['value'] for s in symptom_ents])
            return f"Saya mencatat {self.reported_animal if self.reported_animal else 'hewan'} dengan gejala {', '.join(self.symptoms)}. Apa yang Anda rasakan perlu saya bantu selanjutnya?"
        else:
            return "Maaf, saya belum mengerti."

# Contoh penggunaan
dm = DialogManager()
print(dm.get_next_response("Halo, saya butuh bantuan."))
print(dm.get_next_response("Anjing saya sering menggaruk bagian telinganya."))
print(dm.get_next_response("apa yang perlu saya lakukan"))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/229k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/656M [00:00<?, ?B/s]

Some layers from the model checkpoint at indobenchmark/indobert-base-p2 were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at indobenchmark/indobert-base-p2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Epoch 1/5




Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Intent Accuracy: 0.43333333333333335
              precision    recall  f1-score   support

      animal       0.00      0.00      0.00         9
   condition       0.00      0.00      0.00         3
    location       0.00      0.00      0.00         1
      status       0.00      0.00      0.00         1
     symptom       0.00      0.00      0.00        19

   micro avg       0.00      0.00      0.00        33
   macro avg       0.00      0.00      0.00        33
weighted avg       0.00      0.00      0.00        33



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Saya mencatat hewan dengan gejala . Apa yang Anda rasakan perlu saya bantu selanjutnya?
Saya mencatat hewan dengan gejala . Apa yang Anda rasakan perlu saya bantu selanjutnya?


In [3]:
print(dm.get_next_response("Halo, saya butuh bantuan."))
print(dm.get_next_response("Anjing saya sering menggaruk bagian telinganya."))
print(dm.get_next_response("apa yang perlu saya lakukan"))

Saya mencatat hewan dengan gejala . Apa yang Anda rasakan perlu saya bantu selanjutnya?
Saya mencatat hewan dengan gejala . Apa yang Anda rasakan perlu saya bantu selanjutnya?
Saya mencatat hewan dengan gejala . Apa yang Anda rasakan perlu saya bantu selanjutnya?
