In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In this code, we're going to use the fine-tuned model to predict new data.
The following are the steps:
1. Import the packages
2. Define the fine-tuned model
3. Create the pipeline
4. Prediction

In [2]:
#1 Import packages
from transformers import AutoTokenizer, AutoModel, pipeline

In [3]:
#2 Define the model
model_name = "fathan/abusive_content_identification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Some weights of the model checkpoint at fathan/abusive_content_identification were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
#3 Create pipeline
my_pipeline  = pipeline("text-classification", model=model_name, tokenizer=tokenizer)

In [16]:
#4 Prediction
text_data = ['Kemarin ada gempa, biasanya gak terasa', 
             'Bocah ingusan bodoh!', 
             'banyak menunya, udh mau buka puasa sesuatu banget, wkwkw', 
             'permainannya jelek tolol semua..']

for t in text_data:
  print(my_pipeline(t))

[{'label': 'Bukan_konten_kasar', 'score': 0.9997958540916443}]
[{'label': 'Konten_kasar', 'score': 0.9999817609786987}]
[{'label': 'Bukan_konten_kasar', 'score': 0.9467130899429321}]
[{'label': 'Konten_kasar', 'score': 0.9999781847000122}]


In [17]:
# additional info
# see the tokenization result
for t in text_data:
  print(tokenizer.tokenize(t))
  print('\n')
  print(tokenizer(t))

['kemarin', 'ada', 'gempa', ',', 'biasanya', 'gak', 'terasa']


{'input_ids': [3, 4454, 1684, 4447, 16, 2276, 7552, 5776, 4], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
['bocah', 'ing', '##usan', 'bodoh', '!']


{'input_ids': [3, 7916, 1974, 2278, 10500, 5, 4], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
['banyak', 'menun', '##ya', ',', 'udh', 'mau', 'buka', 'puasa', 'sesuatu', 'banget', ',', 'wkwkw']


{'input_ids': [3, 1814, 9128, 1494, 16, 6138, 2882, 9012, 7336, 2876, 10218, 16, 6624, 4], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['permainan', '##nya', 'jelek', 'tolol', 'semua', '.', '.']


{'input_ids': [3, 4090, 1519, 13290, 10689, 2014, 18, 18, 4], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
