This project aims to summarize response for questionnaires in Hebrew, English and Arabic for a language school NGO called מדרסה (Madrase). The project utilizes NLTK, BERTopic, Dicta 2.0, Llama3.1 and other natural language processing tools.
Tested on Python 3.11.9
To set up the project, follow these steps:
-
Install Ollama: Download and install Ollama from Ollama's website or use Homebrew:
brew install ollama
-
Pull local LLM: Download model, eg Hebrew-English LLM Dicta 2.0:
ollama pull aminadaven/dictalm2.0-instruct:f16 # ollama pull llama3.1 -
Set Up Virtual Environment:
python3 -m venv venv
-
Activate Virtual Environment:
. venv/bin/activate -
Upgrade pip:
pip install --upgrade pip
-
Install Required Packages:
pip install -r requirements.txt
- Basic sentence splitting (using NLTK Sentence Tokenizer)
- Topic Modeling (using BERTopic):
- Sentence Embedding (using HF sentence-transformers-alphabert)
- Dimensionality reduction (using UMAP)
- Clustering (using HDBSCAN)
- Topic representation (using BERTopic normalized-tfidf + LLM outside of BERTopic)
- Topic Summarizing (using LLM):
- Batch splitting
- LLM Summarization for each batch
- LLM Summarization of summaries