ML/LLM experiments with Llama Index to develop a personal assistant for cognitive impaired patients.
Alternative indexing techniques using the FAISS library.
Python 3.11
is unsupported by Pytorch
. The application must run with Python 3.10
.
Install dependencies:
pip install -r requirements.txt
# additional dependencies
pip install pypdf fastembed chromadb accelerate streamlit langchainhub
In some scripts, the value device
must reflect the hardware:
auto
is the defaultcpu
should always work, but processing time will be too longmps
to use M1
Some scripts connect to HuggingFace. Set the following environment variables:
HF_HOME=path
: path of the HuggingFace cacheHUGGING_FACE_HUB_TOKEN=token
: HuggingFace token to download the models
Ollama is required for most scripts.
Install ollama
:
brew install ollama
mkdir -p ~/.ollama
# optional to store the models on an external drive
ln -s "/{PATH}/ollama" ~/.ollama/models
ollama serve
ollama pull mistral
ollama pull llama2
Run it:
ollama serve
ollama run llama2
That should open an interactive shell to chat with Llama2.
- 01-use-local-knowledge: basic experiment using
llama-index
andllama
to index and query a dataset. - 02-chat-bot: experiment using
ollama
/llama2
+streamlit
/landchan
/chromadb
to discuss a PDF with the LLM. - 03-fine-tuning: experiment fine-tuning
bert
with a dataset of reviews. - 04-training-with-colab: same as 03, but using Colab.
- 05-create-a-bio: generate knowledge with LLMs and use the results to build the knowledge base for further iterations.
- 06-sentence-split: evaluates how SentenceSplitter works.
- 07-rag-pipeline: variation of 06-sentence-split.
- 08-query-chroma: test to verify how Chroma retrieves knowledge based on queries and filters.
- 09-refiner: utilisation of LLMs to re-rank results from the vector database.
- 10-keywords-extraction: methods to extract keywords (or key-phrases) from a text.
- 11-query-chroma-with-kw: use keywords to pre-filter the nodes returned by a query.
- 12-faiss: alternative indexing techniques using the FAISS library.
- 13-ingest-ebook: comparison between two extractors in order to parse a medical book in PDF.
- 14-smarter-ingest: extension of the
SimpleDirectoryReader
with enhanced PDF processing. - 15-diagnosis: attempts to define the probability of a diagnosis based on dialogs.
- 16-relevant: find relevant questions to diagnose a disease.
- 17-better-dialogs: attempt to improve the dialogs with RAG.
- 18-translate-in-you-form: translate a diagnosis into a dialog directed to the patient.
- 19-elastic: multilevel indexing of PDFs storing the embeddings in Elasticsearch.
- bio: the bio of a fictional woman generated by an 05-create-a-bio.
- bio-single-file: like bio but in a single file.
- dementia-wiki-txt: an extract of the Wikipedia page about dementia.
- dementia-wiki-polluted: same as dementia-wiki-txt but polluted by a sentence affirming that there exists a relation between dementia and alien kidnapping (to study hallucinations).
- TwentyThousandLeaguesUnderTheSea: Twenty Thousand Leagues Under the Seas by Jules Verne. Source: https://www.gutenberg.org/
- gutenberg: five books from https://www.gutenberg.org/. On the Origin of Species By Means of Natural Selection by Charles Darwin, Paradise Lost by John Milton, The Fall of the House of Usher by Edgar Allan Poe, The Republic by Plato, and Twenty Thousand Leagues under the Sea by Jules Verne.