A flexible, corpus-agnostic semantic analysis toolkit that uses transformer-based embeddings to analyze and classify any texts across custom conceptual dimensions.
This toolkit provides a configuration-driven approach to semantic text analysis, allowing you to:
- Analyze any texts (novels, articles, speeches, religious texts, etc.)
- Define custom concepts to measure (political ideology, emotional tone, themes, etc.)
- Compare texts across semantic dimensions
- Visualize semantic clustering with t-SNE
- Generate quantitative similarity metrics
- Corpus-Agnostic: Works with any text files
- Flexible Concepts: Define your own concept vocabularies via JSON
- Two Chunking Strategies:
- Fixed-size chunking for uniformity
- Semantic chunking for concept-preserving segments
- Configuration-Driven: Use JSON configs or command-line arguments
- Visualizations: t-SNE plots showing semantic clusters
- Quantitative Metrics: Mean, median, and distribution statistics
- Python 3.8+
- pip
- Clone this repository:
git clone https://github.com/esteininger/semantic-religions.git
cd semantic-religions
- Install dependencies:
pip install -r requirements.txt
Place your text files in the texts/
directory (or anywhere):
texts/
├── my_corpus/
│ ├── document1.txt
│ ├── document2.txt
│ └── document3.txt
Create a JSON file in concepts/
defining the concepts you want to measure:
{
"mode": "single",
"concepts": {
"optimism": [
"hope", "optimism", "bright future", "positive outlook",
"enthusiasm", "confidence", "success", "opportunity"
]
}
}
For comparative analysis (A vs B):
{
"mode": "dual",
"concepts": {
"progressive": ["innovation", "change", "progress", "reform", "future"],
"traditional": ["tradition", "heritage", "preservation", "continuity", "past"]
}
}
Create a config file:
{
"analysis_name": "My Analysis",
"texts": [
"texts/my_corpus/document1.txt",
"texts/my_corpus/document2.txt"
],
"labels": ["Document 1", "Document 2"],
"concepts": "concepts/my_concepts.json",
"output": "output/my_analysis",
"target_chunks": 100,
"use_semantic_chunking": true
}
Run the analysis:
python scripts/analyze.py --config my_config.json
python scripts/analyze.py \
--texts text1.txt text2.txt text3.txt \
--labels "Text 1" "Text 2" "Text 3" \
--concepts concepts/my_concepts.json \
--analysis_name "My Analysis" \
--output output/my_analysis \
--target_chunks 100 \
--use_semantic_chunking
semantic-text-analyzer/
├── texts/ # Your corpus files
│ └── [corpus_name]/ # Organized by corpus
├── concepts/ # Concept definitions (JSON)
│ ├── conservative.json
│ ├── liberal.json
│ └── [your_concepts].json
├── scripts/ # Analysis scripts
│ ├── analyze.py # Main generic analyzer
│ └── semantic_chunker.py # Semantic chunking utility
├── examples/ # Example analyses
│ └── religious/ # Religious texts example
│ ├── README.md
│ ├── config.json
│ ├── RESULTS.md
│ └── output/
├── output/ # Analysis results (generated)
├── README.md # This file
└── requirements.txt # Python dependencies
- Model:
sentence-transformers/all-MiniLM-L6-v2
(configurable) - Embeddings are L2-normalized for cosine similarity
- 384-dimensional dense vectors
Each concept is defined by seed phrases. The analyzer:
- Embeds each seed phrase
- Calculates the centroid (mean vector)
- Measures cosine similarity between text chunks and centroids
Fixed-size chunking:
- Uniform segments (e.g., 500 words)
- Fast and consistent
- May split coherent concepts
Semantic chunking:
- Variable-size segments based on topic shifts
- Preserves conceptual coherence
- Detects natural boundaries using embedding similarity
- More computationally intensive but more accurate
- t-SNE dimensionality reduction to 2D
- Color-coded by text
- Shows semantic clustering patterns
Analyze speeches or manifestos for liberal vs conservative ideology:
{
"concepts": {
"progressive": ["change", "reform", "innovation", "equality", "rights"],
"conservative": ["tradition", "stability", "order", "heritage", "values"]
}
}
Compare emotional frameworks in literature:
{
"concepts": {
"positive": ["joy", "love", "happiness", "hope", "warmth"],
"negative": ["sadness", "anger", "fear", "despair", "grief"]
}
}
Analyze how texts balance empirical and spiritual language:
{
"concepts": {
"empirical": ["data", "evidence", "experiment", "observation", "measurement"],
"mystical": ["transcendent", "spiritual", "divine", "mystical", "sacred"]
}
}
Measure environmental consciousness in texts over time:
{
"concepts": {
"environmental": ["nature", "ecology", "conservation", "sustainability",
"climate", "biodiversity", "planet", "earth"]
}
}
This repository includes a comprehensive example analyzing Bible, Torah, and Quran across 11 conceptual dimensions:
- Good vs Evil
- Liberal vs Conservative
- Hope vs Despair
- Love vs Fear
- And more...
See examples/religious/ for details.
Each analysis generates:
output/
└── [analysis_name]/
├── chunk_results.csv # Chunk-level scores
├── statistics.json # Summary statistics
└── tsne_visualization.png # t-SNE plot
chunk_results.csv:
- One row per chunk
- Columns: text, chunk_index, [concept_scores]
statistics.json:
- Mean, median, std for each text-concept pair
tsne_visualization.png:
- 2D visualization of semantic clusters
Use any sentence-transformers model:
python scripts/analyze.py \
--config my_config.json \
--model "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
Create multiple configs and run them sequentially:
for config in configs/*.json; do
python scripts/analyze.py --config "$config"
done
The analyzer outputs standard CSV and JSON formats for easy integration with:
- Pandas/Python analysis
- R statistical analysis
- Tableau/PowerBI visualization
- Custom dashboards
- Language: Default model works best with English
- Context: Chunking loses some broader narrative context
- Concept Bias: Results depend on seed phrase selection
- Semantic vs Intent: Measures linguistic patterns, not author intent
- Translation Effects: Translated texts may not reflect originals
Contributions welcome! Areas for improvement:
- Additional embedding model support
- Multi-language analysis
- More sophisticated chunking algorithms
- Statistical significance testing
- Interactive visualizations
- Pre-built concept libraries
If you use this toolkit in research, please cite:
Semantic Text Analyzer
https://github.com/esteininger/semantic-religions
This project uses public domain texts and open-source libraries. Code is provided for research and educational purposes.
- Models: Sentence-Transformers (HuggingFace)
- Libraries: scikit-learn, pandas, matplotlib, numpy
- Example Texts: Project Gutenberg, Open Siddur Project
For questions or feedback, please open an issue on GitHub.