This RAG is designed to analyze user inputs for potential hate speech using a combination of the large language model (LLM) Llama and a ChromaDB database. The LLM evaluates the input against a curated database of hate speech entries, determining whether the input should be classified as such. If the input is identified as hate speech and sufficiently distinct from existing entries in the database, it is stored for future reference. The script also ensures that redundant entries are avoided, maintaining the efficiency and relevance of the database.
Note
This repository was created as part of a master's thesis at IU Internationale Hochschule.
clone this git repository and install Ollama.
Python:
pip install -r requirements/requirements.txtOllama:
ollama pull mannix/llama3.1-8b-abliteratedImportant
Attention: Please use uncensored versions of Llama, such as Mannix Llama 3.1; otherwise, this script will not work properly.
Use one of the following scripts:
- The
initial_load_csv.pyscript imports the CSV file stored in/datato update the Chroma database. - The
initial_load_ai.pyscript prompts Ollama to generate hate speech examples and updates the Chroma database.
Example:
python initial_load_csv.pyThis Python script using LangChain combines the use of the large language model Llama with a ChromaDB database to review user inputs for group-based hostility (hate speech) and, if applicable, store them. This helps to expand the database with new, pertinent examples of hate speech while avoiding redundant or identical entries.
python main.pyInitialization:
- An LLM model (
OllamaLLM) and a ChromaDB database (hatespeech) are initialized. The model is used to analyze user inputs, while the database is used to find similar entries and store new ones.
Prompt Creation:
- A specific prompt is defined for the LLM, instructing the model on how to evaluate the user input for hate speech. The prompt details how the analysis should be conducted and what information the LLM should consider in its response.
Query and Context Matching:
- A function,
retrieve_context_and_distances, searches the ChromaDB for similar entries based on the user input. The function returns documents and the distances of these documents to the input.
LLM Invocation:
- The user input and relevant context from the database are sent to the LLM. The LLM then responds, determining whether the input should be classified as hate speech.
Database Storage:
- If the LLM identifies the input as hate speech and the distance to similar entries in the database exceeds a certain threshold, the input is stored in the database.
User Feedback:
- The LLM's response is printed, and depending on the classification, the input is either stored in the database or ignored.
Note
This script is designed in German, but it will function consistently regardless of the language. You are welcome to modify the prompts, user outputs, and other language-specific elements.
In v1.0.3 - WebApp Support there is a new WebApp!
To use the WebApp you have to do following steps:
Navigate to the webapp folder where run.py is located and run the script by executing:
python run.pyThis command will start the web application, and you should keep this terminal window open and running while you interact with the web interface.
Once the run.py script is running, you can access the web application by opening a web browser and navigating to the appropriate URL, typically http://localhost:5000.
Note
To make the WebApp suitable for production (available online), you have to meet all the requirements for a production environment, such as handling large traffic, managing sessions securely, and logging errors appropriately. Please think about a Reverse Proxy, too!
Don`t forgt to have a look at the tools:
chroma_query.pyThis script is designed to query the database and retrieve specific information based on user input.chroma_inspector.pyThis script provides an overview of your ChromaDB database, displaying the contents and the total number of documents stored.
These scripts have been updated. I have kept the original versions in the /old folder because they were used during the writing of the thesis.
main_v1.pyis the first edition of the script. The new main.py is fully updated to include database writeback functionality and integration with Llama 3.1.chroma_query_v1.pyis the initial version of the query script. The latest version in/toolsincludes console input functionality.

