Skip to content

contefranz/NLPstudio

Repository files navigation

lifecycle R-CMD-check codecov release license DOI

NLPstudio

NLPstudio is an R package for scalable text analysis in research workflows. It is built around quanteda, data.table, and portable parallel backends, with particular attention to reproducible social science workflows, including financial disclosures, regulatory filings, and other structured document collections.

The package has two main workflows:

  • Corpus preparation and document-level text analysis, from SEC-style JSON files to quanteda corpora, tokens, dictionaries, readability, similarity, and export-ready tables.
  • A consistent topic-model API for fitting, adopting, evaluating, selecting, diagnosing, summarizing, and exporting topic models across supported R backends.

The detailed reference manual and vignettes are published at contefranz.github.io/NLPstudio.

Release Status

NLPstudio v1.0.0 is the first stable public release. The package is intended for reproducible social science text-analysis workflows, with stable output schemas for the core corpus and topic-model APIs. Repository archiving and DOI minting through Zenodo are handled from the public GitHub release.

The topic-model output schemas are frozen for v1.0.0: nlp_topic_fit, nlp_k_selection, nlp_k_selection_summary, nlp_topic_stability, topic summaries, STM topic summaries, and STM topic-effect tables. Evaluation and selection outputs retain the standard columns metric, level, topic_id, value, and supported; aggregate rows use topic_id = NA.

Installation

Install NLPstudio from GitHub with pak:

install.packages("pak")
pak::pkg_install("contefranz/NLPstudio")

Some modeling backends are optional. Install backend packages only when you need them; for example, STM support requires stm, and embedded topic models require both topicmodels.etm and a working torch backend.

Quick Example

library(NLPstudio)
library(quanteda)

docs <- data.frame(
  doc_id = paste0("doc", 1:6),
  text = c(
    "Revenue growth improved after subscription demand increased.",
    "Operating margin expanded as cloud costs declined.",
    "Audit committee oversight focused on internal controls.",
    "Risk disclosures emphasized liquidity and refinancing pressure.",
    "Customer retention supported recurring software revenue.",
    "Debt covenants and interest expense shaped capital allocation."
  )
)

corp <- quanteda::corpus(docs, text_field = "text", docid_field = "doc_id")
toks <- quanteda::tokens(corp, remove_punct = TRUE)
toks <- quanteda::tokens_tolower(toks)
toks <- quanteda::tokens_remove(toks, pattern = quanteda::stopwords("en"))
dfm <- quanteda::dfm(toks)

fit <- fit_topic_model(
  dfm,
  engine = "topicmodels",
  model = "lda",
  method = "Gibbs",
  k = 2,
  control = list(fit = list(seed = 1L, iter = 50L, burnin = 0L, thin = 1L))
)

get_top_terms(fit, n = 4)
evaluate_topic_model(
  fit,
  training = dfm,
  metrics = c("diversity", "exclusivity", "coherence_umass"),
  top_n = 4L
)

For complete workflows, see:

Citation

If you use NLPstudio in academic work, please cite the package. Citation metadata is available from R:

citation("NLPstudio")

Author

Francesco Grossetti
Assistant Professor of Accounting Analytics and Data Science
Department of Accounting, Bocconi University
Fellow at Bocconi Institute for Data Science and Analytics (BIDSA)
Contact: francesco.grossetti@unibocconi.it

About

High-performance and scalable NLP workflows with parallel backends in R

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages