# Creating a tool

My experiences in using AI models and tools. The aim is to build a system that enables the searching of information in google docs from google drive using RAG (Retrieval Augmented Generation). This could be extended to include searching for information in images, pdfs, spreadsheets, and microsoft documents. I want to create summaries and keywords for the documents which can be used to classify documents. 

The steps to do this are

* access the google drive documents and for each document
* generate a summary and list of keywords the for the document using an LLM
* create an embedding for the documents with summary, keywords and metadata.
* store this information in a vector database.
* create a simple tool to use the database to find information from it for a user query.

As a side effect of the work create tools that

* will check that related documents are located together.
* that folder containing the document is one of the keywords.
* that duplicates files do not exist
* that will update the vector database for documents that have changed since they were loaded.


It will use Langgraph for agentic workflow, ollama:nomic-embed-text for embedding and ollama:llama3.2 as the tool model. The embeddings will be into a Chroma vector database. It will be in python.


| Model | Source | Description | Usage |
| :----- | :----- | :----- | :------ |
| "sentence-transformers/all-MiniLM-L6-v2" | HuggingFace | Free popular, lightweight model for embeddings | Embedding |
| "facebook/bart-large-cnn" | HuggingFace | Free. Use a pipeline to summarize | Summarization |
| "ollama:nomic-embed-text" | Ollama | Local Embedding | Embedding |
| "ollama:llama3.2" | Ollama | Local Chat model | Summarization |
| "llama-3.3-70b-versatile" | Groq | Rate limited via API calls. Has tool capability | Tools, Summarization |



**Example of Metadata from a folder in gdrive**

```json
{
    "id": "19..",
    "name": "Coaching",
    "parents": ["14...."]
}
```

**Example of Metadata from a file in gdrive**

```json
{
    "createdTime": "2025-07-18T19:50:24.713Z",
    "id": "1...",
    "mimeType": "application/vnd.google-apps.document",
    "modifiedTime": "2025-07-18T21:26:16.372Z",
    "name": "Coaching Notes",
    "owners": [{"displayName": "Phillip Lee",
               "emailAddress": "phlee0@gmail.com",
               "kind": "drive#user",
               "me": True,
               "permissionId": "166...",
               "photoLink": "..."}],
    "size": "267115",
    "webViewLink": "https://docs.google.com/document/d/..."
}
```

## Gdrive/gdrive2.py

It chunks documents and creates embeddings of summary and chunks for the docs in google drive. It ignores them if they exist already in the vector database. It creates metadata for the summary and chunks. The vector database uses the folder as a collection ID. This enables the searching of a particular folder only.

### Summary Metadata

```python
        page_content=summary,   # text
        id=f"doc_{file['id']}",
        metadata={
            "type": "file",
            "source": file["name"],
            "keywords": ', '.join(keywords),
            "update_time": now_str,
            "num_chunks": len(summary_chunks),
            "id": file.get("id"),
            "name": file.get("name"),
            "mimeType": file.get("mimeType"),
            "webViewLink": file.get("webViewLink"),
            "modifiedTime": file.get("modifiedTime"),
            "createdTime": file.get("createdTime"),
            "size": file.get("size"),
            "owner_email" = owners[0].get("emailAddress"),
            "owner_name" = owners[0].get("displayName"),
        },
```

### Chunk Metadata

```python
            page_content=chunk,  # text
            id=f"{file['id']}_chunk_{i}",
            metadata={
                "type": "chunk",
                "source": file["name"],
                "update_time": now_str,
                "chunk_size": len(chunk.split()),
                "chunk_index": i,
            },
```

### Folder Metadata

```python
            page_content=chunk,  # text
            id=f"{file['id']}_chunk_{i}",
            metadata={
                "type": "chunk",
                "source": file["name"],
                "update_time": now_str,
                "chunk_size": len(chunk.split()),
                "chunk_index": i,
            },
```


A simple rag query is used to ask questions of the documents. Will return it does not know if cannot find the information.


