TaoGPT-7B is a pioneering project blending technology with the innovative field of Tao Science. The objective is to fine-tune the Mistral 7B Language Model (LLM) on Tao Science data to enhance its proficiency in this novel area. A Retrieval Augmentation pipeline is implemented to enrich outputs with relevant information.
To start with TaoGPT-7B, follow these steps:
- Fine-tune Mistral 7B on Tao Science data.
- Utilize the Colab notebooks for training and inference.
- Fine-Tuning Mistral 7B with TaoScience Dataset
- Instructed Fine-Tuning of Mistral 7B with TaoScience Dataset
- Inference with TaoGPT-7B on Google Colab
The project is structured into several components:
-
Data:
/unstructured
: Contains PDFs and unstructured data on Tao Science./structured
: Datasets derived from PDFs.
-
Data Preparation:
dataprep.ipynb
: Transforms unstructured data into a structured format.
-
Fine-Tuning:
finetuning.ipynb
: Focuses on the supervised fine-tuning of Mistral 7B using Tao Science datasets.
-
Inference:
inference.ipynb
: Tests the capabilities of the fine-tuned model using RAG and Gradio for user interaction.
TaoGPT-7B employs various technologies:
- Mistral 7B (LLM): Central model, tailored for Tao Science.
- Langchain: Ensures seamless project component integration.
- Transformers Library: Provides LLM fine-tuning and inference tools.
- Weaviate: Manages efficient data retrieval.
- Gradio: Creates an interactive interface for model engagement.
Contributions to TaoGPT-7B are appreciated. For contribution guidelines, see CONTRIBUTING.md.
For a list of contributors, visit:
TaoGPT-7B is under the MIT License. See LICENSE.md for more details.