Skip to content

Commit

Permalink
Merge branch 'sphinx' of https://github.com/arjbingly/Capstone_5 into…
Browse files Browse the repository at this point in the history
… sphinx
  • Loading branch information
arjbingly committed Apr 23, 2024
2 parents 7bec7c1 + 6546f52 commit c94ad28
Show file tree
Hide file tree
Showing 49 changed files with 497 additions and 116 deletions.
27 changes: 27 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: documentation

on: [ push, pull_request, workflow_dispatch ]

permissions:
contents: write

jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
- name: Install dependencies
run: |
pip install sphinx sphinx_rtd_theme myst_parser
- name: Sphinx build
run: |
sphinx-build doc _build
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
with:
publish_branch: gh-pages
# github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: _build/
force_orphan: true
Binary file modified src/docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified src/docs/_build/doctrees/get_started.doctree
Binary file not shown.
Binary file modified src/docs/_build/doctrees/get_started.introduction.doctree
Binary file not shown.
Binary file modified src/docs/_build/doctrees/get_started.llms.doctree
Binary file not shown.
Binary file not shown.
Binary file modified src/docs/_build/doctrees/get_started.vectordb.doctree
Binary file not shown.
Binary file modified src/docs/_build/doctrees/grag.components.doctree
Binary file not shown.
Binary file modified src/docs/_build/doctrees/grag.components.vectordb.doctree
Binary file not shown.
Binary file modified src/docs/_build/doctrees/grag.rag.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion src/docs/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 33176d1a0fbc2e489b6d5201070d328e
config: 1ced34aae86d195057701cf655c56180
tags: 645f666f9bcd5a90fca523b33c5a78b7
21 changes: 17 additions & 4 deletions src/docs/_build/html/_sources/get_started.introduction.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,22 @@ GRAG Overview

GRAG provides an implementation of Retrieval-Augmented Generation that is completely open-sourced.
Since it does not use any external services or APIs, this enables a cost-saving solution as well a solution to data privacy concerns.
For more information, refer to :ref:`Test <Vector Stores>`.
For more information, refer to `our readme <https://github.com/arjbingly/Capstone_5/blob/main/README.md>`_.

Retrieval-Augmented Generation
##############################
Retrieval-Augmented Generation (RAG)
####################################

Re
Retrieval-Augmented Generation (RAG) is a technique in machine learning that helps to enhance large-language models (LLM) by incorporating external data.

In RAG, a model first retrieves relevant documents or data from a large corpus and then uses this information to guide the generation of new text. This approach allows the model to produce more informed, accurate, and contextually appropriate responses.

By leveraging both the retrieval of existing knowledge and the generative capabilities of neural networks, RAG models can improve over traditional generation methods, particularly in tasks requiring deep domain-specific knowledge or factual accuracy.

.. figure:: ../../_static/basic_RAG_pipeline.png
:width: 800
:alt: Basic-RAG Pipeline
:align: center

Illustration of a basic RAG pipeline

Traditionally, it uses a vector database/vector store for both retrieval and generation processes.
10 changes: 5 additions & 5 deletions src/docs/_build/html/_sources/get_started.llms.rst.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
`LLMs
LLMs
=====

GRAG offers two ways to run LLMs locally:
Expand All @@ -17,10 +17,10 @@ provide an auth token*
To run LLMs using LlamaCPP
#############################
LlamaCPP requires models in the form of `.gguf` file. You can either download these model files online,
or
or **quantize** the model yourself following the instructions below.

How to quantize models.
************************
How to quantize models
***********************
To quantize the model, run:
``python -m grag.quantize.quantize``

Expand All @@ -34,4 +34,4 @@ After running the above command, user will be prompted with the following:

* If the user has the model downloaded locally, then user will be instructed to copy the model and input the name of the model directory.

3.Finally, the user will be prompted to enter **quantization** settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check `llama.cpp/examples/quantize/quantize.cpp <https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19>`_.
3. Finally, the user will be prompted to enter **quantization** settings (recommended Q5_K_M or Q4_K_M, etc.). For more details, check `llama.cpp/examples/quantize/quantize.cpp <https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19>`_.
61 changes: 61 additions & 0 deletions src/docs/_build/html/_sources/get_started.parse_pdf.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Parse PDF
=========

The parsing and partitioning were primarily done using the unstructured.io library, which is designed for this purpose. However, for PDFs with complex layouts, such as nested tables or tax forms, the pdfplumber and pytesseract libraries were employed to improve the parsing accuracy.

The class has several attributes that control the behavior of the parsing and partitioning process.

Attributes
##########

- single_text_out (bool): If True, all text elements are combined into a single output document. The default value is True.

- strategy (str): The strategy for PDF partitioning. The default is "hi_res" for better accuracy

- extract_image_block_types (list): A list of elements to be extracted as image blocks. By default, it includes "Image" and "Table".The default value is True.

- infer_table_structure (bool): Whether to extract tables during partitioning. The default value is True.

- extract_images (bool): Whether to extract images. The default value is True.

- image_output_dir (str): The directory to save extracted images, if any.

- add_captions_to_text (bool): Whether to include figure captions in the text output. The default value is True.

- add_captions_to_blocks (bool): Whether to add captions to table and image blocks. The default value is True.

- add_caption_first (bool): Whether to place captions before their corresponding image or table in the output. The default value is True.

- table_as_html (bool): Whether to represent tables as HTML.

Parsing Complex PDF Layouts
###########################

While unstructured.io performed well in parsing PDFs with straightforward layouts, PDFs with complex layouts, such as nested tables or tax forms, were not parsed accurately. To address this issue, the pdfplumber and pytesseract libraries were employed.

Table Parsing Methodology
=========================

For each page in the PDF file, the find_tables method is called with specific table settings to find the tables on that page. The table settings used are:

- ``"vertical_strategy": "text"``: This setting tells the function to detect tables based on the text content.

- ``"horizontal_strategy": "lines"``: This setting tells the function to detect tables based on the horizontal lines.

- ``"min_words_vertical": 3``: This setting specifies the minimum number of words required to consider a row as part of a table.

**For each table found on the page, the following steps are performed:**

1. The table area is cropped from the page using the crop method and the bbox (bounding box) of the table.

2. The text content of the cropped table area is extracted using the `extract_text` method with `layout=True`.

3. A dictionary is created with the `table_number` and `extracted_text` of the table, and it is appended to the `extracted_tables_in_page` list.
After processing all the tables on the page, a dictionary is created with the `page_number` and the list of `extracted_tables_in_page`, and it is appended to the `extracted_tables` list.
Finally, the extracted_tables list is returned, which contains all the extracted tables from the PDF file, organized by page and table number.

Limitations
===========

While the table parsing methodology using `pdfplumber` could process most tables, it could not parse every table layout accurately. The table settings need to be adjusted for different types of table layouts. Additionally, pdfplumber could not extract figure captions, whereas `unstructured.io` could.
Future work may involve developing a more robust and flexible table parsing algorithm that can handle a wider range of table layouts and integrate seamlessly with the ParsePDF class to leverage the strengths of both unstructured.io and pdfplumber libraries.
1 change: 1 addition & 0 deletions src/docs/_build/html/_sources/get_started.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Get Started

get_started.introduction
get_started.installation
get_started.parse_pdf
get_started.llms
get_started.vectordb

12 changes: 8 additions & 4 deletions src/docs/_build/html/_sources/get_started.vectordb.rst.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
.. _Vector Stores:

Vector Stores
===============

Expand Down Expand Up @@ -28,7 +26,14 @@ Since Chroma is a server-client based vector database, make sure to run the serv
* If Chroma is not run locally, change ``host`` and ``port`` under ``chroma`` in `src/config.ini`, or provide the arguments
explicitly.

For non-supported vectorstores, (...)
Once you have chroma running, just use the Chroma Client class.

DeepLake
*********
Since DeepLake is not a server based vector store, it is much easier to get started.

Just make sure you have DeepLake installed and use the DeepLake Client class.


Embeddings
###########
Expand All @@ -52,4 +57,3 @@ For more details on data ingestion, refer to our `cookbook <https://github.com/a


retriever.ingest(dir_path)

21 changes: 15 additions & 6 deletions src/docs/_build/html/get_started.html
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,10 @@
<li class="toctree-l1 current"><a class="current reference internal" href="#">Get Started</a><ul>
<li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html">GRAG Overview</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.installation.html">Installation</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">LLMs</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a></li>
</ul>
</li>
Expand Down Expand Up @@ -91,13 +93,20 @@ <h1>Get Started<a class="headerlink" href="#get-started" title="Link to this hea
<div class="toctree-wrapper compound">
<ul>
<li class="toctree-l1"><a class="reference internal" href="get_started.introduction.html">GRAG Overview</a><ul>
<li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html#retrieval-augmented-generation">Retrieval-Augmented Generation</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html#retrieval-augmented-generation-rag">Retrieval-Augmented Generation (RAG)</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="get_started.installation.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
<li class="toctree-l1"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a><ul>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#how-to-quantize-models">How to quantize models.</a></li>
<li class="toctree-l1"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a><ul>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#attributes">Attributes</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#parsing-complex-pdf-layouts">Parsing Complex PDF Layouts</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
<li class="toctree-l1"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
<li class="toctree-l1"><a class="reference internal" href="get_started.llms.html">LLMs</a><ul>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-huggingface">To run LLMs using HuggingFace</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a><ul>
Expand Down
10 changes: 6 additions & 4 deletions src/docs/_build/html/get_started.installation.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
<script src="_static/js/theme.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="To run LLMs using HuggingFace" href="get_started.llms.html" />
<link rel="next" title="Parse PDF" href="get_started.parse_pdf.html" />
<link rel="prev" title="GRAG Overview" href="get_started.introduction.html" />
</head>

Expand Down Expand Up @@ -53,8 +53,10 @@
<li class="toctree-l1 current"><a class="reference internal" href="get_started.html">Get Started</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="get_started.introduction.html">GRAG Overview</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">Installation</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">LLMs</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a></li>
</ul>
</li>
Expand Down Expand Up @@ -103,7 +105,7 @@ <h1>Installation<a class="headerlink" href="#installation" title="Link to this h
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="get_started.introduction.html" class="btn btn-neutral float-left" title="GRAG Overview" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="get_started.llms.html" class="btn btn-neutral float-right" title="To run LLMs using HuggingFace" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
<a href="get_started.parse_pdf.html" class="btn btn-neutral float-right" title="Parse PDF" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>

<hr/>
Expand Down
25 changes: 18 additions & 7 deletions src/docs/_build/html/get_started.introduction.html
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,14 @@
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal" href="get_started.html">Get Started</a><ul class="current">
<li class="toctree-l2 current"><a class="current reference internal" href="#">GRAG Overview</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#retrieval-augmented-generation">Retrieval-Augmented Generation</a></li>
<li class="toctree-l3"><a class="reference internal" href="#retrieval-augmented-generation-rag">Retrieval-Augmented Generation (RAG)</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="get_started.installation.html">Installation</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">To run LLMs using HuggingFace</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html#to-run-llms-using-llamacpp">To run LLMs using LlamaCPP</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html">Parse PDF</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#table-parsing-methodology">Table Parsing Methodology</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.parse_pdf.html#limitations">Limitations</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.llms.html">LLMs</a></li>
<li class="toctree-l2"><a class="reference internal" href="get_started.vectordb.html">Vector Stores</a></li>
</ul>
</li>
Expand Down Expand Up @@ -94,10 +96,19 @@
<h1>GRAG Overview<a class="headerlink" href="#grag-overview" title="Link to this heading"></a></h1>
<p>GRAG provides an implementation of Retrieval-Augmented Generation that is completely open-sourced.
Since it does not use any external services or APIs, this enables a cost-saving solution as well a solution to data privacy concerns.
For more information, refer to <a class="reference internal" href="get_started.vectordb.html#vector-stores"><span class="std std-ref">Test</span></a>.</p>
<section id="retrieval-augmented-generation">
<h2>Retrieval-Augmented Generation<a class="headerlink" href="#retrieval-augmented-generation" title="Link to this heading"></a></h2>
<p>Re</p>
For more information, refer to <a class="reference external" href="https://github.com/arjbingly/Capstone_5/blob/main/README.md">our readme</a>.</p>
<section id="retrieval-augmented-generation-rag">
<h2>Retrieval-Augmented Generation (RAG)<a class="headerlink" href="#retrieval-augmented-generation-rag" title="Link to this heading"></a></h2>
<p>Retrieval-Augmented Generation (RAG) is a technique in machine learning that helps to enhance large-language models (LLM) by incorporating external data.</p>
<p>In RAG, a model first retrieves relevant documents or data from a large corpus and then uses this information to guide the generation of new text. This approach allows the model to produce more informed, accurate, and contextually appropriate responses.</p>
<p>By leveraging both the retrieval of existing knowledge and the generative capabilities of neural networks, RAG models can improve over traditional generation methods, particularly in tasks requiring deep domain-specific knowledge or factual accuracy.</p>
<figure class="align-center" id="id1">
<a class="reference internal image-reference" href="../../_static/basic_RAG_pipeline.png"><img alt="Basic-RAG Pipeline" src="../../_static/basic_RAG_pipeline.png" style="width: 800px;" /></a>
<figcaption>
<p><span class="caption-text">Illustration of a basic RAG pipeline</span><a class="headerlink" href="#id1" title="Link to this image"></a></p>
</figcaption>
</figure>
<p>Traditionally, it uses a vector database/vector store for both retrieval and generation processes.</p>
</section>
</section>

Expand Down
Loading

0 comments on commit c94ad28

Please sign in to comment.