Skip to content

gesistsa/grafzahl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

grafzahl

CRAN status

The goal of grafzahl (Gracious R Analytical Framework for Zappy Analysis of Human Languages [1]) is to duct tape the quanteda ecosystem to modern Transformer-based text classification models, e.g. BERT, RoBERTa, etc. The model object looks and feels like the textmodel S3 object from the package quanteda.textmodels.

If you don’t know what I am talking about, don’t worry, this package is gracious. You don’t need to know a lot about Transformers to use this package. See the examples below.

Please cite this software as:

Chan, C., (2023). grafzahl: fine-tuning Transformers for text data from within R. Computational Communication Research 5(1): 76-84. https://doi.org/10.5117/CCR2023.1.003.CHAN

Installation: Local environment

Install the CRAN version

install.packages("grafzahl")

After that, you need to setup your conda environment

require(grafzahl)
setup_grafzahl(cuda = TRUE) ## if you have GPU(s)

On remote environments, e.g. Google Colab

On Google Colab, you need to enable non-Conda mode

install.packages("grafzahl")
require(grafzahl)
use_nonconda()

Please refer the vignette.

Usage

Suppose you have a bunch of tweets in the quanteda corpus format. And the corpus has exactly one docvar that denotes the labels you want to predict. The data is from this repository (Theocharis et al., 2020).

unciviltweets
#> Corpus consisting of 19,982 documents and 1 docvar.
#> text1 :
#> "@ @ Karma gave you a second chance yesterday.  Start doing m..."
#> 
#> text2 :
#> "@ With people like you, Steve King there's still hope for we..."
#> 
#> text3 :
#> "@ @ You bill is a joke and will sink the GOP. #WEDESERVEBETT..."
#> 
#> text4 :
#> "@ Dream on. The only thing trump understands is how to enric..."
#> 
#> text5 :
#> "@ @ Just like the Democrat taliban party was up front with t..."
#> 
#> text6 :
#> "@ you are going to have more of the same with HRC, and you a..."
#> 
#> [ reached max_ndoc ... 19,976 more documents ]

In order to train a Transfomer model, please select the model_name from Hugging Face’s list. The table below lists some common choices. In most of the time, providing model_name is sufficient, there is no need to provide model_type.

Suppose you want to train a Transformer model using “bertweet” (Nguyen et al., 2020) because it matches your domain of usage. By default, it will save the model in the output directory of the current directory. You can change it to elsewhere using the output_dir parameter.

model <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")
### If you are hardcore quanteda user:
## model <- textmodel_transformer(unciviltweets,
##                                model_type = "bertweet", model_name = "vinai/bertweet-base")

Make prediction

predict(model)

That is it.

Extended examples

Several extended examples are also available.

Examples file
van Atteveldt et al. (2021) paper/vanatteveldt.md
Dobbrick et al. (2021) paper/dobbrick.md
Theocharis et al. (2020) paper/theocharis.md
OffensEval-TR (2020) paper/coltekin.md
Amharic News Text classification Dataset (2021) paper/azime.md

Some common choices of model_name

Your data model_type model_name
English tweets bertweet vinai/bertweet-base
Lightweight mobilebert google/mobilebert-uncased
distilbert distilbert-base-uncased
Long Text longformer allenai/longformer-base-4096
bigbird google/bigbird-roberta-base
English (General) bert bert-base-uncased
bert bert-base-cased
electra google/electra-small-discriminator
roberta roberta-base
Multilingual xlm xlm-mlm-17-1280
xml xlm-mlm-100-1280
bert bert-base-multilingual-cased
xlmroberta xlm-roberta-base
xlmroberta xlm-roberta-large

References

  1. Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.
  2. Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.

  1. Yes, I totally made up the meaningless long name. Actually, it is the German name of the Sesame Street character Count von Count, meaning “Count (the noble title) Number”. And it seems to be so that it is compulsory to name absolutely everything related to Transformers after Seasame Street characters.