# BERT text classification in Finnish. 
<br/>
Ekta Vats, Centre for Digital Humanities (CDHU), Department of ALM, Uppsala University <br>
ekta.vats@abm.uu.se <br> <br>

Matti La Mela, Department of ALM, Uppsala University <br> <br>

Using Simple Transformers - an NLP library based on the Transformers library by HuggingFace. <br>
Link: https://huggingface.co/docs/transformers/index

Dataset used: Berry corpus. <br>
Classification of OCR-ed texts into 2 categories (binary classification): <br>
Category 0: DESCRIPTIVE (i.e. descriptive articles) <br>
Category 1: ECONOMIC (i.e. economic-industrial articles) <br>

The binary division is between articles where berries / berry picking is mentioned for some contextual or descriptive reason. <br>
For example: <br> 
Snake bite a berry picking child => 0 <br>
Articles regarding selling berries, exports, industrial production, etc. => 1

Prerequisite: Install transformers: <br> 
Link https://github.com/ThilinaRajapakse/simpletransformers

BERT models: we are using Finnish BERT models, and more models can be explored here: <br>
Link: https://huggingface.co/models?sort=downloads <br>
Use the search function to explore!

Note: This program runs on a CPU, and one can add cuda support for processing on a GPU. <br>
Remove "use_cuda=False" from the ClassificationModel instance <br>
Install: <br>
conda install pytorch>=1.6 cudatoolkit=11.0 -c pytorch

In [1]:
import pandas as pd

In [2]:
class_list = ['0','1'] # 0 is descriptive; 1 is economic-industrial

In [3]:
# read the data
df = pd.read_csv('berries_class_binary.csv', sep=';', lineterminator='\n',encoding='utf8',names=["pred_class", "ocr_text"])

Note: It is always nice to use verified OCR-ed texts, free from errors, for training. For testing, they can have errors. 

In [4]:
df['ocr_text'] = df['ocr_text'].str.replace('\r', "") # clean

In [5]:
df['ocr_text'] = df['ocr_text'].str.replace('\n', "") # clean

In [6]:
df.head()

Unnamed: 0,pred_class,ocr_text
0,0,Säten tallan laillifen ebesroasta= uksen uhall...
1,0,Käärmeen pistolta pelastunut. Eräs kaupnntimme...
2,0,"Kadonnut. Talollisen lesken Oma Tirkkosen, Lcp..."
3,0,Lapsi kadonnut- Sunnuntaina t. k. 16 p:nä ilta...
4,0,Kertomus Lllppccnilln- Nllu MMachtoiscstll Pal...


In [7]:
df = df[['ocr_text','pred_class']]

In [8]:
print(df.shape)
df.head()

(415, 2)


Unnamed: 0,ocr_text,pred_class
0,Säten tallan laillifen ebesroasta= uksen uhall...,0
1,Käärmeen pistolta pelastunut. Eräs kaupnntimme...,0
2,"Kadonnut. Talollisen lesken Oma Tirkkosen, Lcp...",0
3,Lapsi kadonnut- Sunnuntaina t. k. 16 p:nä ilta...,0
4,Kertomus Lllppccnilln- Nllu MMachtoiscstll Pal...,0


In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train_df, test_df = train_test_split(df, test_size=0.10) # 90% training and 10% testing
print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)

train shape:  (373, 2)
test shape:  (42, 2)


# Load pre-trained model

Create a ClassificationModel instance with parameters:</br>

Architecture (e.g. "bert") </br>
Pre-trained model ("TurkuNLP/bert-base-finnish-cased-v1")</br>
No. of class labels (2)</br>
Hyperparameter for training (train_args)

In [11]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "fp16":False,
             "num_train_epochs": 4}

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "TurkuNLP/bert-base-finnish-cased-v1",
    num_labels=2,
    use_cuda=False, #cpu only
    args=train_args
)

Some weights of the model checkpoint at TurkuNLP/bert-base-finnish-cased-v1 were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the m

# Model training/fine-tuning

In [13]:
model.train_model(train_df)



  0%|          | 0/373 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Running Epoch 0 of 4:   0%|          | 0/47 [00:00<?, ?it/s]

Running Epoch 1 of 4:   0%|          | 0/47 [00:00<?, ?it/s]

Running Epoch 2 of 4:   0%|          | 0/47 [00:00<?, ?it/s]

Running Epoch 3 of 4:   0%|          | 0/47 [00:00<?, ?it/s]

(188, 0.153899671698108)

# Evaluate the results of training

Using a simple helper function f1_multiclass(), which is used to calculate the f1_score. <br/>
Derived from https://www.philschmid.de/bert-text-classification-in-a-different-language

In [14]:
from sklearn.metrics import f1_score, accuracy_score

def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')

result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score)



  0%|          | 0/42 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Running Evaluation:   0%|          | 0/6 [00:00<?, ?it/s]

# Results!

In [30]:
result

{'mcc': 0.9468641529479986,
 'tp': 13,
 'tn': 28,
 'fp': 1,
 'fn': 0,
 'auroc': 0.9973474801061009,
 'auprc': 0.9945054945054943,
 'f1': 0.9761904761904762,
 'acc': 0.9761904761904762,
 'eval_loss': 0.07071196141381127}

In [31]:
print("The accuracy is",result["acc"])

The accuracy is 0.9761904761904762


In [27]:
#print(wrong_predictions)

# Save the trained model

The model is automatically saved in the default directory "outputs" after every 2000 steps and in the end of the training.

# Predict!

Use an unseen berry corpus' OCR-ed text to predict the category (economic or industrial) that it belongs to.

In [15]:
from simpletransformers.classification import ClassificationModel

# define hyperparameter
train_args ={"reprocess_input_data": True,
             "fp16":False,
             "num_train_epochs": 4}

# Create a ClassificationModel with our trained model
model = ClassificationModel(
    "bert", 'outputs/',
    num_labels=2,
    use_cuda=False, # cpu only
    args=train_args
)

In [34]:
class_list = ['0','1']
#class_list = ['ECONOMIC','INDUSTRIAL']

In [33]:
# test_ocr1 is taken from excel file, last row
test_ocr1 = "Ilmoitus. Kaikenlainen metsästäminen, kalastaminen, marjain keruu ja metsien raihnaaminen falon< uhalla kielletään omistamiltani metsiltä ja rantamilta, Ilmarila, Rauhala ja loukola (Pört« sykänlahti, Pörtsykän-niemi ja Kolusalmi.) Samallainen kielto koskee entistä Kapakka» saarta nykyistä Walamoa. Käkisalmella 21 päiwänä heiniik. 1905. A. W. Mausnerus. Antti Ahtiainen."

predictions, raw_outputs = model.predict([test_ocr1])

print(class_list[predictions[0]])

  0%|          | 0/1 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/1 [00:00<?, ?it/s]

0


# Expert's reflection on test_ocr1:
Berry picking is forbidding on this person’s land. Should be in category 0 (thus descriptive, non-economic).

In [35]:
# test_ocr2 is a "seen" category 0 used only for verification
test_ocr2 = "Säten tallan laillifen ebesroasta= uksen uhalla kalastamisen, metsästämisen, tuin myös kaikenlaisen metsän haaskuun. etenkin marjan poimimisen Kukkarin saarista Sumasmeden rannalta. @nonlat)besfa Huh- tikuun 29 p. 1885. Aleks. Koponen."
predictions, raw_outputs = model.predict([test_ocr2])

print(class_list[predictions[0]])

  0%|          | 0/1 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/1 [00:00<?, ?it/s]

0


In [36]:
# test_ocr3 is a "seen" category 1 used only for verification
test_ocr3 = "Puolukkain wienti. Waasasta on taas hiljan lähetetty ..Iris laimassa Satsaan 1.100 laatikkoa puolukoita, kussakin laatikossa 22 a 23 kappaa. Walmiina lähctettämätsi ..Björn laimalla on miclä 1,900 laatikkoa, sisältäen yhteenjä noin 40,000 tappaa. Kun puolukkain hinta on 20 p. kapalta, omat ne siis samassa hinnassa kuin kaurat. — W. T."
predictions, raw_outputs = model.predict([test_ocr3])

print(class_list[predictions[0]])

  0%|          | 0/1 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/1 [00:00<?, ?it/s]

1


In [20]:
test_df.head() #to visualise test samples

Unnamed: 0,ocr_text,pred_class
350,Marjojen säilljttäminen ja (täyttäminen ta-lou...,1
320,Puolukkain wienti. Waasasta on taas hiljan läh...,1
39,Susi on taas tappanut lapsen. Torppari Isat Ha...,0
153,Keisarillisten olosta Barösundissa kirjoitetaa...,0
85,Käärmeenpistoon kuollut. Itäwä tapaus satwi Ka...,0


In [37]:
# test_ocr4 is an unseen test sample no. 39
test_ocr4 = "Susi on taas tappanut lapsen. Torppari Isak Hartman'in Wehmalaisten kylässä Karjalassa !> wuotias poika on joutunut suden saaliiksi wiime kuun 22 p:nä. Onneton oli marjoja poimimassa nuoremman »veljensä kanssa, kun peto hänen kohtasi ja tappoi — niin on taaskin luettaivana Turun sanomissa. Tämä on jo sitte wiime huhtikuun kuudes lapsi, joka meidän pienessä maassamme «n joutunut suden saaliiksi. Woi kauheata!"
predictions, raw_outputs = model.predict([test_ocr4])

print(class_list[predictions[0]])

  0%|          | 0/1 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/1 [00:00<?, ?it/s]

0
