# Pydoxtools: Automated, LLM based article writing with information retrieval from a directory of files in fewer than 100 lines of code!

At Xyntopia, we are excited to introduce our latest creation, Pydoxtools - a versatile Python library designed to streamline AI-powered document processing and information retrieval. This library is perfect for users who are new to programming and want to harness the power of AI in their projects.

This article showcases an efficient method to automate article writing for your project using ChatGPT and Pydoxtools in fewer than 100 lines of code. The script demonstrates the following key functionalities:

- Indexing a directory containing files with PyDoxTools
- Employing an agent for information retrieval within those files
- Auto-generating a text based on a set objective

You can execute this notebook or simply refer to our concise script, which executes these steps in less than 100 lines of code:

https://github.com/Xyntopia/pydoxtools/blob/main/examples/automatic_project_writing.py

or open this notebook in colab:  <a target="_blank" href="https://colab.research.google.com/github/Xyntopia/pydoxtools/blob/main/examples/automated_blog_writing.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Costs and API Key

Please note that ChatGPT is a paid service, and running the script once will cost you about 2-5 cents. Pydoxtools automatically caches all calls to ChatGPT. So subsequent runs usually turn out to be a little cheaper. To use ChatGPT, you will need to generate an OpenAI API key by registering an account at [https://platform.openai.com/account/api-keys](). Remember to keep your API key secure and do not share it with anyone.

Pydoxtools already includes open source LLM-models which can do the same for free, locally on your computer. As of May 2023 this is being tested.

### Safeguarding Your API Key in Google Colab

When working with sensitive information like API keys, it's crucial to ensure their security. In Google Colab, you can save your API key in a separate file, allowing you to share the notebook without exposing the key. To do this, follow these simple steps:

1. Execute the cell below to create a new file in your Colab environment. This file will store your API key, and it will be deleted automatically when the Colab runtime is terminated.

In [None]:
!touch /tmp/openai_api_key

2. Click on the following link to open the newly created file by clicking on the following link in colab: /tmp/openai_api_key

3. Copy and paste your API key into the file, then save it.

By following these steps, you can ensure the security of your API key while still being able to share your notebook with others. Happy coding!

## Installation

Follow these simple steps to install and configure Pydoxtools for your projects:

1. Install the Pydoxtools library by running the following command:

In [None]:
%%capture
# if we want to install directly from our repository:
#!pip install -U --force-reinstall --no-deps "pydoxtools[etl,inference] @ git+https://github.com/xyntopia/pydoxtools.git"
!pip install -U pydoxtools[etl,inference]==0.6.3

After installation, restart the runtime to load the newly installed libraries into Jupyter.

2. Now we are loading the OPENAI_API_KEY from our file.

In [None]:
#load the key as an environment variable:
import os
# load the key
with open('/tmp/openai_api_key') as f:
  os.environ['OPENAI_API_KEY']=f.read()

3. now we can initialize pydoxtools which will automatically make use of the OPENAI_API_KEY

In [None]:
try:
  import logging

  import dask
  from chromadb.config import Settings

  import pydoxtools as pdx
  from pydoxtools import agent as ag
  from pydoxtools.settings import settings

  logger = logging.getLogger(__name__)
  logging.basicConfig(level=logging.INFO)
  logging.getLogger("pydoxtools.document").setLevel(logging.INFO)
except RuntimeError:
    print(f"\n\n\n{'!'*70}\n\nplease, restart the notebook. The error was probably caused by not"
   f"\nrestarting the notebook after installing the libraries\n\n{'!'*70}")

## configuration

Pydoxtools can be configured in various ways for this example we are using two settings:

- Pydoxtools uses dask in the background to handle large indexes and databases. We could even index terabytes of data this way! For this example though, we are setting the dask scheduler to "synchronous" so that we can see everything thats happening locally and make it easy to debug the script.
- Pydoxtools has a caching mechanism which caches calls to pydoxtoos.Document. This helps during development for much faster execution on subsequent runs (for example the vector index creation or extraction of other information from documents).

In [None]:
# pydoxtools.DocumentBag uses a dask scheduler for parallel computing
# in the background. For easier debugging, we set this to "synchronous"
dask.config.set(scheduler='synchronous')
# dask.config.set(scheduler='multiprocessing') # can als be used...

settings.PDX_ENABLE_DISK_CACHE = True  # turn on caching for pydoxtools

## download our project

In order for our program to work, we need to provide the AI with information. In this case we are using files in a directory as a source of information! We are simply downloading the "Pydoxtools" project from github. Essentialy pydoxtools is writing about itself :-). You could also mount a google drive here or simply load a folder on your computer if you're running this notebook locally on your computer.'

In [None]:
!cd /content
!git clone https://github.com/Xyntopia/pydoxtools.git

## Index initialization

In order for an LLM like ChatGPT to retrieve the information it needs to be saved in a "vectorformat". This way we can retrieve relevant information using nearest neighbour search. We are using ChromaDB here for this purpose, but there are many other choices available.

In [None]:
##### Use chromadb as a vectorstore #####
chroma_settings = Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=str(settings.PDX_CACHE_DIR_BASE / "chromadb"),
    anonymized_telemetry=False
)

# create our source of information. It creates a list of documents
# in pydoxtools called "pydoxtools.DocumentBag" (which itself holds a list of pydoxtools.Document) and
# here we choose to use pydoxtools itself as an information source!
root_dir = "/content/pydoxtools"
ds = pdx.DocumentBag(
    source=root_dir,
    exclude=[  # ignore some files which make the indexing rather inefficient
        '.git/', '.idea/', '/node_modules', '/dist',
        '/__pycache__/', '.pytest_cache/', '.chroma', '.svg', '.lock',
        "/site/"
    ],
    forgiving_extracts=True
)


## Initialize agent, give it a writing objective and compute the index

Now that we have everything setup, we can initialize our LLM Agent with the provided information.
For the [pydoxtools](https://github.com/Xyntopia/pydoxtools) project in this example, computing the index will take about 5-10 minutes. In total there will be about 4000 text snippets in the vector index for the  project after finishing the computation.. When using the pydoxtools cache, subsequent calculations will be much faster (~1 min).

In [None]:
final_result = []

agent = ag.LLMAgent(
    vector_store=chroma_settings,
    objective="Write a blog post, introducing a new library (which was developed by us, "
              "the company 'Xyntopia') to "
              "visitors of our corporate webpage, which might want to use the pydoxtools library but "
              "have no idea about programming. Make sure, the text is about half a page long.",
    data_source=ds
)
agent.pre_compute_index()

## Search for relevant Information

The agent is able to store information as question/answer pairs and makes use of that information when executing tasks. In order to get our algorithm running a bit more quickly, we answer a basic question manually, to get the algorithm started more quickly... In a real app you could ask questions like this in a user-dialog.

In [None]:
agent.add_question(question="Can you please provide the main topic of the project or some primary "
                            "keywords related to the project, "
                            "to help with identifying the relevant files in the directory?",
                    answer="python library, AI, pipelines")

Having this information, we ask the agent to come up with a few more questions that it needs to answer before being able to write the article.

In [None]:
# first, gather some basic information...
questions = agent.execute_task(
  task="What additional information do you need to create a first, very short outline as a draft? " \
        "provide it as a ranked list of questions", save_task=True)

In [None]:
print("\n".join(questions))

Having created this list of questions, we can now ask the agent to research them by itself. It will automatically
use the index we computed above for this task.

In [None]:
# we only use the first 5 provided questions to make it faster ;).
agent.research_questions(questions[:5], allowed_documents=["text/markdown"])

After retrieving this information we can tell the agent t write the text. We tell it to automatically make use of the information by setting the "context_size" parameter to a value greater than 0. This represents the pieces of stored information that it will use to fulfill the task.

In [None]:
txt = agent.execute_task(task="Complete the overall objective, formulate the text "
                              "based on answered questions and format it in markdown.",
                          context_size=20, max_tokens=1000, formatting="txt")
final_result.append(txt)  # add a first draft to the result

Having our first draft of the text, lets critize it to improve the quality! Then with this
critique create a new list of tasks that we can give to the agent to execute one-by-one. Gradually improving our text. 

In [None]:
critique = agent.execute_task(task="Given this text:\n\n```markdown\n{txt}\n```"
                                    "\n\nlist 5 points of critique about the text",
                              context_size=0, max_tokens=1000)

tasks = agent.execute_task(
    task="Given this text:\n\n```markdown\n{txt}\n```\n\n"
          f"and its critique: {critique}\n\n"
          "Generate instructions that would make it better. "
          "Sort them by importance and return it as a list of tasks",
    context_size=0, max_tokens=1000)

for t in tasks:
    task = "Given this text:\n\n" \
            f"```markdown\n{txt}\n```\n\n" \
            f"Make the text better by executing this task: '{t}' " \
            f"and integrate it into the given text, but keep the overall objective in mind."
    txt = agent.execute_task(task, context_size=10, max_tokens=1000, formatting="markdown")
    final_result.append([task, txt])

In [None]:
print("\n".join(str(t) for t in tasks))

In [None]:
# for debugging, you can see all intermediate results, simply uncomment the variable to check:

#final_result  # for the evolution of the final text
#agent._debug_queue  # in order to check all requests made to llms and vectorstores etc...

## Final text

After all the processing is finally done, here is the final text:

In [None]:
from IPython.display import display, Markdown, Latex
Markdown(txt.strip("`").replace("markdown",""))

## Conclusion

Pydoxtools is a powerful and user-friendly Python library that makes it easy to harness the power of AI for document processing and information retrieval. Whether you are new to programming or an experienced developer, Pydoxtools can help you streamline your projects and achieve your goals. Give it a try today and experience the benefits for yourself!

Get more information under the following links:

- [https://pydoxtools.xyntopia.com]()
- [https://github.com/xyntopia/pydoxtools]()
- [https://www.xyntopia.com]()