### Installing Necessary Modules

In [1]:
!pip install simpletransformers




## Importing Libraries

In [19]:
import os
import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

"""
I use simpletransformers to finetune my BERT model
Simpletransformers is a NLP library which is  designed to simplify the usage of Transformer models
It is built on top of the Hugging Face and their Transformers library
see https://simpletransformers.ai/about/ for more details
"""
from simpletransformers.classification import ClassificationModel

## Pre-Processing Data

In [6]:
# loading my dataset, "A Benchmark Data for Turkish Text Categorization" of Savaş Yıldırım
# which is available at https://www.kaggle.com/datasets/savasy/ttc4900
# It is a dataset based on the work of kemik group, http://www.kemik.yildiz.edu.tr/
# The data are pre-processed for the text categorization 
train_data_df = pd.read_csv('C:/Users/aatak/Desktop/7allV03.csv', encoding='utf-8', header=None, names=['cat', 'text'])

# note: I modified the data I downloaded from kaggle to delete the first row, which was "cat, text" as column headers.

train_data_df.head()

Unnamed: 0,cat,text
0,siyaset,3 milyon ile ön seçim vaadi mhp nin 10 olağan...
1,siyaset,mesut_yılmaz yüce_divan da ceza alabilirdi pr...
2,siyaset,disko lar kaldırılıyor başbakan_yardımcısı ar...
3,siyaset,sarıgül anayasa_mahkemesi ne gidiyor mustafa_...
4,siyaset,erdoğan idamın bir haklılık sebebi var demek ...


In [7]:
# print all the categories
categories = train_data_df.cat.unique()
print(categories)

['siyaset ' 'dunya ' 'ekonomi ' 'kultur ' 'saglik ' 'spor ' 'teknoloji ']


In [9]:
"""
We have to encode the category names to integers. I use pandas factorize method to do this
and add the encoded category labels as a additional column, labels.
for example, siyaset is now encoded as 0
"""
train_data_df['labels'] = pd.factorize(train_data_df.cat)[0]

train_data_df.head()

Unnamed: 0,cat,text,labels
0,siyaset,3 milyon ile ön seçim vaadi mhp nin 10 olağan...,0
1,siyaset,mesut_yılmaz yüce_divan da ceza alabilirdi pr...,0
2,siyaset,disko lar kaldırılıyor başbakan_yardımcısı ar...,0
3,siyaset,sarıgül anayasa_mahkemesi ne gidiyor mustafa_...,0
4,siyaset,erdoğan idamın bir haklılık sebebi var demek ...,0


In [14]:
"""
We have to split the data into training and test groups so we can finetune the model with our
train set and calculate its performance on our test set. I decided to have 80% training data and 20% test data
"""

train, test = train_test_split(train_data_df, test_size=0.2, random_state=99)

## Fine-Tuning Experiments

I use simpletransformers libraries ClassificationModel class to create my bert model and finetuning it


#### Experiment 1 - Bert Multilingual

In [15]:
model = ClassificationModel('bert', # it is a bert model
                            'bert-base-multilingual-uncased', # choose which bert model 
                                                              # I want to load; I start with multilingual bert
                                                              # which includes Turkish language
                            num_labels=7, # we have 7 categories
                            use_cuda=False, 
                            args={ # some additional arguments for model training, where to save the model, number of epochs etc
                                'num_train_epochs': 3, # 3 epochs seemed to be enough in the experiments of other people using 
                                                      # bert for this type of task. We have to remember bert is a pretrained
                                                      # model, we will just finetune it to our own dataset. so low number of
                                                      # epochs are fine
                                'overwrite_output_dir': True,
                                'reprocess_input_data': True,
                            }
                           )

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model 

In [16]:
model.train_model(train)

  0%|          | 0/3920 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/490 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/490 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/490 [00:00<?, ?it/s]

(1470, 0.4766262790647519)

In [17]:
# get evaluation from our library about the model
result, model_outputs, wrong_predictions = model.eval_model(test)

predictions = model_outputs.argmax(axis=1)
actuals = test.labels.values

  0%|          | 0/980 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/123 [00:00<?, ?it/s]

In [20]:
# get accuracy score of my model
accuracy_score(actuals, predictions)

0.8948979591836734

Performance is acceptable, but I'll try to improve by using bert-base-turkish-cased model in a second try.

#### Experiment 2 - Bert Turkish

MDZ Digital Library team's Turkish BERT model is accessable at https://huggingface.co/dbmdz/bert-base-turkish-cased

In [22]:
model = ClassificationModel('bert', # it is a bert model
                            'dbmdz/bert-base-turkish-cased', # choose which bert model 
                                                              # I want to load; here I choose Turkish cased bert model
                            num_labels=7, # we have 7 categories
                            use_cuda=False, 
                            args={ # some additional arguments for model training, where to save the model, number of epochs etc
                                'num_train_epochs': 3, # 3 epochs seemed to be enough in the experiments of other people using 
                                                      # bert for this type of task. We have to remember bert is a pretrained
                                                      # model, we will just finetune it to our own dataset. so low number of
                                                      # epochs are fine
                                'overwrite_output_dir': True,
                                'reprocess_input_data': True,
                            }
                           )

Some weights of the model checkpoint at dbmdz/bert-base-turkish-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were 

In [23]:
model.train_model(train)

  0%|          | 0/3920 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/490 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/490 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/490 [00:00<?, ?it/s]

(1470, 0.2991740818034072)

In [24]:
# get evaluation from our library about the model
result, model_outputs, wrong_predictions = model.eval_model(test)

predictions = model_outputs.argmax(axis=1)
actuals = test.labels.values

  0%|          | 0/980 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/123 [00:00<?, ?it/s]

In [25]:
# get accuracy score of my model
accuracy_score(actuals, predictions)

0.939795918367347

As we can see the performance is better. This model can be improved by using more data, trying different BERT pretrained model or making other changes in the model in the future.