Skip to content

deadbits/wikipedia-chat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š wikichat

🌟 Overview

wikichat ingests Cohere's multilingual Wikipedia embeddings into a Chroma vector database and provides a Chainlit web interface for retrieval-augmented-generation against the data using gpt-4-1106-preview.

I wanted to explore the idea of maintaining a local copy of Wikipedia, and this seemed like a good entry point. Down the road I might update this code to regularly pull the full Wikipedia dump and create the embeddings, instead of relying on Cohere's prebuilt embeddings. I went this route as a proof of concept, and as an excuse to try out Chainlit.

Based on Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb

πŸ›  Installation

  1. Clone the Repository:

    git clone https://github.com/deadbits/wikipedia-chat.git
    cd wikipedia-chat
  2. Setup Python virtual environment:

    python3 -m venv venv
    source venv/bin/activate
  3. Install Dependencies:

    pip install -r requirements.txt

πŸ“– Usage

Set Cohere and OpenAI API keys

export OPENAI_API_KEY="...."
export COHERE_API_KEY="..."

Ingest Data

Run ingest.py to download the Wikipedia embeddings dataset and load into ChromaDB:

python ingest.py

The script adds records in batches of 100, but this will still take some time. The batch size could probably be increased.

Web Interface

To initiate the web interface, run the chainlit_ui.py script with the Chainlit library:

chainlit run chainlit_ui.py

Chainlit interface

Chainlit UI