Skip to content

devon7y/semanticfa

Repository files navigation

semanticfa

Response-free semantic analysis of psychometric scales.

semanticfa reads the meaning of a scale's item wording with a language model and recovers, interprets, and refines the scale's latent structure — entirely from the items, with no human response data. Factor analysis on the item embeddings is the centerpiece, but the package is a full toolkit for working with a scale before (or without) collecting data: building semantic similarity matrices, deciding how many factors to keep, reading a semantic "loadings" table, comparing the recovered structure to theory, flagging redundant items, building short forms, vetting brand-new candidate items, detecting jingle/jangle fallacies across scales, and visualizing the item space.

Installation

# from CRAN (once available):
install.packages("semanticfa")

# development version:
# install.packages("remotes")
remotes::install_github("devon7y/semanticfa")

The core of the package is pure R. Turning item text into embeddings on your machine uses Python via reticulate — needed only if you want the package to embed text for you (you can always bring your own embeddings):

sfa_install_python()   # one-time: provisions sentence-transformers

Two-dimensional item maps work out of the box (Rtsne and uwot are bundled). One optional package, EGAnet, powers EGA-based factor retention / dimension selection and the faithful UVA redundancy method — install it only if you use those parts.

Quick start

library(semanticfa)
data(big5)   # 50 IPIP Big-Five items + precomputed sentence-BERT embeddings

# one call: embed -> similarity -> retain -> extract -> diagnose
fit <- sfa(
  data.frame(code = big5$codes, item = big5$items,
             factor = big5$factors, scoring = big5$scoring),
  embeddings = big5$embeddings, nfactors = 5)
fit

# interpret and refine, all from the same fit
plot(fit, type = "scree")          # scree with parallel-analysis overlay
sfa_corplot(fit)                   # item-by-item similarity heatmap, grouped by factor
sfa_anchor(fit)                    # item-by-construct "belonging" (a semantic loadings table)
sfa_congruence(fit, target = big5$factors,    # agreement with theory (partition metrics)
               metrics = c("nmi", "ari"))
sfa_redundancy(fit)                # near-duplicate items

No respondents are involved at any step.

What's in the box

1. Embed text and build a similarity matrix

Function Purpose
sfa_embed() Embed item text — on-device sentence-BERT (Qwen, default), the OpenAI API, or any custom function. Results are cached.
sfa_load_npz() Load pre-generated embeddings (e.g. a GPU job) from a NumPy .npz, no Python needed.
sfa_similarity() Item-by-item similarity matrix with a choice of four encodings (below).
sfa_nli_matrix() Signed, valence-aware similarity from natural-language inference (entailment − contradiction), so reverse-keyed items are handled directly.
sfa_install_python(), sfa_clear_cache() Provision the embedding environment / clear the cache.

2. Recover the factor structure

Function Purpose
sfa() The end-to-end pipeline: embed → similarity → retain → extract → diagnose. Accepts raw text, precomputed embeddings, an sfa_embeddings object, or a precomputed similarity matrix.
sfa_nfactors() How many factors to keep — parallel analysis, Kaiser, TEFI, and EGA in one call.
sfa_parallel() Embedding-adapted parallel analysis (random-unit-vector null; no sample size needed).
sfa_dimselect() Select the informative leading embedding coordinates ("depth") by EGA depth optimization.
as_psych() Hand the solution to psych (factor.congruence(), fa.sort(), …) as a standard fa object.

3. Interpret the structure

Function Purpose
sfa_anchor() An item-by-construct belonging matrix — a semantic loadings table — built from construct centroids and/or construct-name embeddings.
sfa_project() Place items on interpretable bipolar axes (e.g. mildsevere, passiveactive).
sfa_congruence() Compare the recovered structure to an empirical or theoretical one: Tucker φ, NMI, ARI, Frobenius, and disattenuated correlation.
sfa_jinglejangle() Flag jingle (same name, different content) and jangle (different name, same content) fallacies across multiple scales.

4. Refine the scale — before collecting data

Function Purpose
sfa_redundancy() Detect near-duplicate items via faithful Unique Variable Analysis (absolute wTO on an EBICglasso network) or a direct cosine criterion.
sfa_simplify() Build response-free short forms by selecting the most representative items per factor.
sfa_item_fit() Vet a brand-new candidate item: how well does it match the construct name and the other items, and is it redundant with any of them?

5. Visualize

Function Purpose
sfa_corplot() Heatmap of the item-by-item similarity matrix, grouped/ordered by factor (order accepts factor-name abbreviations, e.g. c("D","A","S")).
sfa_itemplot() 2-D item map via t-SNE, UMAP, PCA, or MDS (sfa_tsneplot() is a deprecated alias).
plot(fit, "scree") Scree plot with the parallel-analysis overlay.

Fit diagnostics

Every sfa() fit reports KMO, a real partition-based TEFI (negative; lower is better), RMSR, CAF, McDonald's ω, and — when theoretical factors are supplied — a factor-to-theory alignment matrix (DAAL). summary(fit) adds the full breakdown, and calibrate = TRUE adds a Monte Carlo null reference for the diagnostics.

Encoding methods (sfa_similarity(..., encoding=))

Method Description Keying
"atomic" (default) L2-normalize, cosine similarity uses scoring sign-flip
"atomic_reversed" Sign-flip reverse-keyed items, L2-normalize, cosine uses scoring sign-flip
"squid" Subtract the questionnaire-mean embedding, then cosine keying-free
"mean_centered_pearson" Mean-center → cosine = Pearson correlation keying-free

Bundled data

data(big5) — the 50-item IPIP Big-Five markers (public domain) with precomputed sentence-transformers/all-MiniLM-L6-v2 embeddings, so every example runs without Python or network access.

Learn more

A full guided tour — every function, worked end-to-end on the Depression Anxiety Stress Scales — is in the package vignette (vignette("introduction", package = "semanticfa")).

References

  • Milano, N., Luongo, M., Ponticorvo, M., & Marocco, D. (2025). Semantic analysis of test items through large language model embeddings predicts a-priori factorial structure of personality tests. Current Research in Behavioral Sciences, 8, 100168. doi:10.1016/j.crbeha.2025.100168
  • Casella, M., Luongo, M., Marocco, D., Milano, N., & Ponticorvo, M. (2024). LLM embeddings on test items predict post hoc loadings in personality tests. Ital-IA 2024, CEUR Workshop Proceedings.
  • Guenole, N., D'Urso, E. D., Samo, A., Sun, T., & Haslbeck, J. M. B. (preprint). Enhancing Scale Development: Pseudo Factor Analysis of Language Embedding Similarity Matrices. OSF: https://osf.io/3mpzb/
  • Pellert, M., Lechner, C. M., Sen, I., & Strohmaier, M. (2026). Neural network embeddings recover value dimensions from psychometric survey items on par with human data (SQuID). Findings of the ACL: EACL 2026, 5738–5752.
  • Pokropek, A. (2026). From keyword-based text measures to latent variables: Confirmatory factor analysis with word embeddings. EPJ Data Science. doi:10.1140/epjds/s13688-026-00654-1
  • Kmetty, Z., Koltai, J., & Rudas, T. (2021). The presence of occupational structure in online texts based on word embedding NLP models. EPJ Data Science, 10, 55. doi:10.1140/epjds/s13688-021-00311-9
  • Christensen, A. P., Garrido, L. E., & Golino, H. (2023). Unique Variable Analysis: A network psychometrics method to detect local dependence. Multivariate Behavioral Research, 58(6), 1165–1182. doi:10.1080/00273171.2023.2194606
  • Golino, H. (2026). Optimizing the landscape of LLM embeddings with Dynamic Exploratory Graph Analysis for generative psychometrics. arXiv:2601.17010.
  • Grand, G., Blank, I. A., Pereira, F., & Fedorenko, E. (2022). Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature Human Behaviour, 6(7), 975–987. doi:10.1038/s41562-022-01316-8
  • Wulff, D. U., & Mata, R. (2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour, 9(5), 944–954. doi:10.1038/s41562-024-02089-y
  • Wulff, D. U., & Mata, R. (2026). Escaping the jingle-jangle jungle: Increasing conceptual clarity in psychology using large language models. Current Directions in Psychological Science, 35(2), 59–65. doi:10.1177/09637214251382083
  • Hommel, B. E., & Arslan, R. C. (2025). Language models accurately infer correlations between psychological items and scales from text alone. Advances in Methods and Practices in Psychological Science, 8(4). doi:10.1177/25152459251377093
  • Jung, S.-J., & Seo, J.-W. (2025). A transformer-based embedding approach to developing short-form psychological measures. Frontiers in Psychology, 16, 1640864. doi:10.3389/fpsyg.2025.1640864
  • Wang, B., Zhang, Y., Hu, Y., Hou, H., Peng, K., & Ni, S. (2026). Discovering semantic latent structures in psychological scales: A response-free pathway to efficient simplification. arXiv:2602.12575.

License

GPL (>= 3)

About

Semantic Factor Analysis of Language Model Embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors