Skip to content
Contextualised Embeddings and Language Modelling using BERT and Friends using R
R Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Only 64bit installation works due to torch Feb 12, 2020
inst Add package Feb 11, 2020
man Add package Feb 11, 2020
vignettes Add package Feb 11, 2020
.Rbuildignore Add package Feb 11, 2020
.gitignore Add package Feb 11, 2020
DESCRIPTION Add package Feb 11, 2020
LICENSE Add package Feb 11, 2020
NAMESPACE Add package Feb 11, 2020
README.md README Feb 12, 2020
golgotha.Rproj Add package Feb 11, 2020

README.md

golgotha - Contextualised Embeddings and Language Modelling using BERT and Friends using R

  • This R package wraps the transformers module using reticulate
  • The objective of the package is to get easily sentence embeddings using a BERT-like model in R For using in downstream modelling (e.g. Support Vector Machines / Sentiment Labelling / Classification / Regression / POS tagging / Lemmatisation / Text Similarities)
  • Golgotha: Hope for lonely AI pelgrims on their way to losing CPU power: http://costes.org/cdbm20.mp3

Installation

  • For installing the development version of this package:
    • Execute in R: devtools::install_github("bnosac/golgotha", INSTALL_opts = "--no-multiarch")
    • Look to the documentation of the functions: help(package = "golgotha")

Example

  • Download a model (e.g. bert multilingual lowercased)
library(golgotha)
bert_download_model("bert-base-multilingual-uncased")
  • Load the model and get the embedding of sentences / subword tokens or just tokenise
model <- BERT("bert-base-multilingual-uncased")
x <- data.frame(doc_id = c("doc_1", "doc_2"),
                text = c("give me back my money or i'll call the police.",
                         "talk to the hand because the face don't want to hear it any more."),
                stringsAsFactors = FALSE)
embedding <- predict(model, x, type = "embed-sentence")
embedding <- predict(model, x, type = "embed-token")
tokens    <- predict(model, x, type = "tokenise")
  • Same example but now on Dutch / French
text <- c("vlieg met me mee naar de horizon want ik hou alleen van jou",
          "l'amour n'est qu'un enfant de pute, il agite le bonheur mais il laisse le malheur",
          "http://costes.org/cdso01.mp3", 
          "http://costes.org/mp3.htm")
text <- setNames(text, c("doc_nl", "doc_fr", "le petit boudin", "thebible"))
embedding <- predict(model, text, type = "embed-sentence")
embedding <- predict(model, text, type = "embed-token")
tokens    <- predict(model, text, type = "tokenise")
  • Some other models
model <- BERT("bert-base-multilingual-uncased")
model <- BERT("bert-base-multilingual-cased")
model <- BERT("bert-base-dutch-cased")
model <- BERT("bert-base-uncased")
model <- BERT("bert-base-cased")
model <- BERT("bert-base-chinese")
You can’t perform that action at this time.