Skip to content

gitLouis/madarsa-summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Madarsa Questionnaire Summarization

Overview

This project aims to summarize response for questionnaires in Hebrew, English and Arabic for a language school NGO called מדרסה (Madrase). The project utilizes NLTK, BERTopic, Dicta 2.0, Llama3.1 and other natural language processing tools.

Tested on Python 3.11.9

Installation

To set up the project, follow these steps:

  1. Install Ollama: Download and install Ollama from Ollama's website or use Homebrew:

    brew install ollama
  2. Pull local LLM: Download model, eg Hebrew-English LLM Dicta 2.0:

    ollama pull aminadaven/dictalm2.0-instruct:f16
    # ollama pull llama3.1
  3. Set Up Virtual Environment:

    python3 -m venv venv
  4. Activate Virtual Environment:

    . venv/bin/activate
  5. Upgrade pip:

    pip install --upgrade pip
  6. Install Required Packages:

    pip install -r requirements.txt

Process Pipeline

  1. Basic sentence splitting (using NLTK Sentence Tokenizer)
  2. Topic Modeling (using BERTopic):
    • Sentence Embedding (using HF sentence-transformers-alphabert)
    • Dimensionality reduction (using UMAP)
    • Clustering (using HDBSCAN)
    • Topic representation (using BERTopic normalized-tfidf + LLM outside of BERTopic)
  3. Topic Summarizing (using LLM):
    • Batch splitting
    • LLM Summarization for each batch
    • LLM Summarization of summaries

About

Madarse Course questionnaire - LLM summarization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •