NLPstudio is an R package for scalable text analysis in research workflows. It is built around quanteda, data.table, and portable parallel backends, with particular attention to reproducible social science workflows, including financial disclosures, regulatory filings, and other structured document collections.
The package has two main workflows:
- Corpus preparation and document-level text analysis, from SEC-style JSON files
to
quantedacorpora, tokens, dictionaries, readability, similarity, and export-ready tables. - A consistent topic-model API for fitting, adopting, evaluating, selecting, diagnosing, summarizing, and exporting topic models across supported R backends.
The detailed reference manual and vignettes are published at contefranz.github.io/NLPstudio.
NLPstudio v1.0.0 is the first stable public release. The package is intended
for reproducible social science text-analysis workflows, with stable output
schemas for the core corpus and topic-model APIs. Repository archiving and DOI
minting through Zenodo are handled from the public GitHub release.
The topic-model output schemas are frozen for v1.0.0: nlp_topic_fit,
nlp_k_selection, nlp_k_selection_summary, nlp_topic_stability, topic
summaries, STM topic summaries, and STM topic-effect tables. Evaluation and
selection outputs retain the standard columns metric, level, topic_id,
value, and supported; aggregate rows use topic_id = NA.
Install NLPstudio from GitHub with pak:
install.packages("pak")
pak::pkg_install("contefranz/NLPstudio")Some modeling backends are optional. Install backend packages only when you need them; for example, STM support requires stm, and embedded topic models require both topicmodels.etm and a working torch backend.
library(NLPstudio)
library(quanteda)
docs <- data.frame(
doc_id = paste0("doc", 1:6),
text = c(
"Revenue growth improved after subscription demand increased.",
"Operating margin expanded as cloud costs declined.",
"Audit committee oversight focused on internal controls.",
"Risk disclosures emphasized liquidity and refinancing pressure.",
"Customer retention supported recurring software revenue.",
"Debt covenants and interest expense shaped capital allocation."
)
)
corp <- quanteda::corpus(docs, text_field = "text", docid_field = "doc_id")
toks <- quanteda::tokens(corp, remove_punct = TRUE)
toks <- quanteda::tokens_tolower(toks)
toks <- quanteda::tokens_remove(toks, pattern = quanteda::stopwords("en"))
dfm <- quanteda::dfm(toks)
fit <- fit_topic_model(
dfm,
engine = "topicmodels",
model = "lda",
method = "Gibbs",
k = 2,
control = list(fit = list(seed = 1L, iter = 50L, burnin = 0L, thin = 1L))
)
get_top_terms(fit, n = 4)
evaluate_topic_model(
fit,
training = dfm,
metrics = c("diversity", "exclusivity", "coherence_umass"),
top_n = 4L
)For complete workflows, see:
If you use NLPstudio in academic work, please cite the package. Citation metadata is available from R:
citation("NLPstudio")Francesco Grossetti
Assistant Professor of Accounting Analytics and Data Science
Department of Accounting, Bocconi University
Fellow at Bocconi Institute for Data Science and Analytics (BIDSA)
Contact: francesco.grossetti@unibocconi.it
