# LlamaDocIndexer

## Project Description
LlamaDocIndexer is an innovative tool designed to streamline and enhance the way large volumes of documents are managed and accessed. Built on the robust LlamaIndex framework, this project serves as a vital bridge between extensive language model applications and specific, often private, data repositories.

### Purpose:
The primary purpose of LlamaDocIndexer is to simplify the process of indexing a diverse array of documents stored in various formats, such as PDFs, text files, and slides. By automating the indexing process, this tool aims to make the retrieval of information both efficient and accurate.

### Objectives:
1. Automated Indexing: To recursively scan and index documents in a specified folder, enabling a structured and searchable database.
2. Change Detection and Updating: To automatically detect changes in the documents and update the index accordingly, ensuring that the most current information is always available.
3. Ease of Integration: To provide seamless integration with existing language model applications, enhancing their capability to process and understand domain-specific data.
4. User Accessibility: To create an intuitive interface that allows users to easily query and retrieve information from the indexed data.
5. Scalability and Flexibility: To ensure that the tool can handle a growing volume of data and adapt to various data formats and structures.

## Author Information
- Name: Jiayi Chen
- Contact: chenjiayi_344@hotmail.com


## Requirements and Dependencies
This project relies on several external libraries and dependencies, which are listed in the requirements.txt file. To ensure a smooth setup and functioning of LlamaDocIndexer, follow these steps to install the required dependencies:

1. Ensure Python is Installed: First, make sure you have Python installed on your system. You can download it from python.org.

2. Recommended: Create a Virtual Environment (Optional but recommended): It's a good practice to create a virtual environment for your Python projects. This keeps dependencies required by different projects separate and organized. Use the following commands to create and activate a virtual environment:


* For Windows:
```
python -m venv venv
venv\Scripts\activate
```
* For macOS and Linux:

```
python3 -m venv venv
source venv/bin/activate
```
3. Install Dependencies: Run the following command in your terminal to install all the required dependencies listed in requirements.txt:

```
pip install -r requirements.txt
```
Once these steps are completed, you'll have all the necessary libraries installed, and the LlamaDocIndexer will be ready to run.

## Dataset Description
The LlamaDocIndexer is designed to efficiently handle and index a diverse collection of documents, catering specifically to folders containing a variety of document types and structures. The primary focus of this tool is to work with the following formats:

* Plain Text Files: Simple text documents, typically with a .txt extension. These files are straightforward, containing unformatted text, and are easy to parse and index.

* PDF Files: Portable Document Format files, commonly known as .pdf. These files are widely used for distributing read-only documents that preserve the layout of a page. They are more complex due to their potential for containing a mix of text, images, and other media.

* Excel Files: Spreadsheet files with a .xlsx extension. These are part of the Microsoft Office Suite and are used for organizing data in tabular form, which may include numbers, text, and formulas.

The indexer is capable of navigating not just individual documents but also entire directories that may contain sub-folders. This recursive capability ensures that no matter how the documents are organized—whether they are all in a single folder or spread out across multiple sub-folders—the LlamaDocIndexer can systematically scan, index, and update the index for each supported file type found within these directories.

**Note**: The current version of LlamaDocIndexer supports only the above-mentioned file formats. Future updates may include support for additional document types to broaden the scope of the indexing capabilities.

## Structure of the Notebook
The main.ipynb notebook for the LlamaDocIndexer project is meticulously organized into several key sections, each dedicated to a specific function within the overall indexing process. The notebook comprises the following parts:

1. Introduction and Setup:
- Overview of the project.
- Setting up the environment and importing necessary libraries.

2. Reading the File System:
- Code and methodology for traversing the target directory.
- Methods for identifying supported file formats (plain text, PDF, and XLSX).

3. Checking for Changes:
- Implementing a two-stage change detection system:
  - First, checking the modification time of each file.
  - Then, using hash values for a more thorough comparison.
- Explanation of how these checks contribute to efficient indexing.

4. Generating Unique Identifiers (UIDs):
- Procedure for generating UIDs based on the file path.
- Ensuring that each document in the index is uniquely identifiable.

5. Updating Current States in SQLite Database:
- Integrating with an SQLite database to store and update file states.
- Detailed steps for recording and updating modification times and hash values.

6. Rebuilding Indices:
- Mechanism to trigger index rebuilding when changes are detected.
- Strategies to ensure efficient updating with minimal overhead.

7. Querying the LLMs (e.g., ChatGPT):
- Establishing communication with language model APIs.
- Methods for querying the indexed data using natural language processing.
- Demonstrating how LLMs can extract and interpret information from the indexed documents.

8. Conclusion and Future Work:
- Summarizing the achievements of the notebook.
- Discussing potential improvements and expansions for future iterations.


## Date and Version
+ Last Updated: 2023-11-14
+ Version: 0.0.1


