This notebook shows how to build a simple RAG pipeline with **ragpipe** for the 'Founders Mode' essay by Paul Graham.

In [1]:
%reload_ext autoreload
%autoreload 2

Scrape and parse text from the 'Founders Mode' essay html by Paul Graham, using BeautifulSoup.

In [3]:

def get_doc_text():
    url = 'https://paulgraham.com/foundermode.html'
    import requests
    from bs4 import BeautifulSoup

    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    for br in soup.find_all("br"):
            br.replace_with("\n")
    text = soup.get_text()
    return text


Here's the parsed document text.

In [6]:
dtext = get_doc_text()[:500]
print(dtext, '....')

Founder Mode





September 2024

At a YC event last week Brian Chesky gave a talk that everyone who
was there will remember. Most founders I talked to afterward said
it was the best they'd ever heard. Ron Conway, for the first time
in his life, forgot to take notes. I'm not going to try to reproduce
it here. Instead I want to talk about a question it raised.

The theme of Brian's talk was that the conventional wisdom about
how to run larger companies is mistaken. As Airbnb grew, well-meaning
pe ....


#### Build Data Model

Now let's chunk the document text into paragraphs and store it in a (nested) dictionary *data model*. 

We will use a simple text splitter here.

In [4]:

import sys
sys.path.insert(0, '..')

from ragpipe.common import DotDict
def build_data_model(text):
    paragraphs = text.split('\n\n')
    paragraphs = [dict(text=p.strip()) for p in paragraphs if p.strip()]

    return DotDict(documents=paragraphs) #

# We refer to D as the hierarchical document model. 
# It consists of a list of documents, where each document is a dictionary with a 'text' field.

D = build_data_model(dtext) 

#### Create a config yaml with representations and bridges.

To find document chunks *relevant* to the user query, we will match the field `query.text` with `text` field of each document. The *config yaml* specifies how to match the query with doc's text fields.

docpath notation:
- we write hierarchical document fields in `jq` notation, call them *docpath*.
- for example, in above data dictionary, the docpath for the `text` fields is `documents[].text`.

In the config below, we specify how to match the query with doc's text fields.

1. **Specify *representations* (reps) for both query and document fields.** using a dense encoder.
    - Here, we define the representation (rep) named `dense` for both `query.text` and all chunk fields `documents[].text`. 
    - The reps are denoted as `query.text#dense` and `documents[].text#dense`.
    - Both reps are created using the `BAAI/bge-small-en-v1.5` text encoder.

2. **Specify one or more *bridge*s** over the reps.
    - Here, we define bridge **b_dense** which matches `query.text#dense` and `documents[].text#dense` reps to find relevant doc paragraphs. 
    - The *output* of evaluating a bridge is a ranked list of documents ([(`documents.<number>.text`, `score`)]), containing up to `limit` documents. 

3. **Specify one or more *pipelines* (also called *merges*) over bridges**
    - Enables building sequence-parallel combination of bridges. Not required for this simple example.



In [8]:

config_yaml = ''' 
encoders:
  bge_small: #dense encoder
    name: BAAI/bge-small-en-v1.5
    query_instruction: "Represent this sentence for searching relevant passages:"
  bm25: #not used in this example.
    name: bm25

representations:
    query.text:
        dense: {encoder: bge_small}

    .documents[].text:
        dense: {encoder: bge_small}

bridges:
  b_dense: #bridge over dense reps
      repnodes: query.text#dense, .documents[].text#dense
      limit: 10
'''



Load the config yaml and view the full config.

In [6]:
from ragpipe.config import load_config

config = load_config(config_yaml, is_file=False, show=True)


{'bridges': {'b_dense': {'enabled': True,
                         'limit': 10,
                         'repnodes': ['query.text#dense',
                                      '.documents[].text#dense']}},
 'config_fname': '37230f31-f99f-5ae9-a53f-0e170b063e54',
 'dbs': {'__default_multi_vector__': {'name': 'tensordb',
                                      'options': {},
                                      'path': '/tmp/ragpipe/'},
         '__default_single_vector__': {'name': 'chromadb',
                                       'options': {},
                                       'path': '/tmp/ragpipe/'}},
 'enabled_merges': ['_rp_m1_'],
 'encoders': {'bge_small': {'dtype': '',
                            'name': 'BAAI/bge-small-en-v1.5',
                            'query_instruction': 'Represent this sentence for '
                                                 'searching relevant passages:',
                            'with_index': False},
              'bm25': {'dtype': '', '

Retrieve docs relevant to the query by evaluating the bridge `b_dense` on the data model `D`.
View the retrieved docs in ranked order.

In [9]:
from ragpipe import Retriever

query_text = 'What is Founder mode?'

docs_retrieved = Retriever(config).eval(query_text, D)

print(f'\nquery: {query_text}')
print(f'\ndocuments retrieved:\n')
for doc in docs_retrieved: doc.show()

Start eval Bridge(b_dense): repnodes=['query.text#dense', '.documents[].text#dense'] limit=10 enabled=True evalfn=None matchfn=None
computing reps for query.text#dense...
computing reps for .documents[].text#dense...
Retrieving docs for Bridge b_dense...
ENCODER =  name='BAAI/bge-small-en-v1.5' mo_loader=<function FastEmbedEncoder.from_config.<locals>.<lambda> at 0x104fd60c0> rep_type='single_vector' config=EncoderConfig(name='BAAI/bge-small-en-v1.5', prompt=None, query_instruction='Represent this sentence for searching relevant passages:', with_index=False, module=None, dtype='', size=None, shape=None)
get_similarity_fn: BAAI/bge-small-en-v1.5
Computing merge _rp_m1_...

query: What is Founder mode?

documents retrieved:

 ðŸ‘‰ 0.902  (documents.0.text) ðŸ‘‰  Founder Mode
 ðŸ‘‰ 0.731  (documents.7.text) ðŸ‘‰  There are as far as I know no books specifically about founder mode.
Business schools don't know it exists. All we have so far are the
experiments of individual founders who've b

In [17]:
prompt_templ = '''
    The following are snippets from a document in markdown format.
    # documents

    {{documents}}

    Answer the following query based on the above document snippets.

    {{query}}
    Answer:
'''
from ragpipe.llms import respond_to_contextual_query
resp = respond_to_contextual_query(query_text, docs_retrieved, prompt_templ)
print(resp)


Based on the document snippets provided, "Founder mode" can be inferred to be a unique way of running a company, characterized by the continued involvement and leadership of the original founder(s), even as the company grows. It differs from "manager mode" which is typically associated with traditional business management practices. Founder mode emphasizes a more personal and direct engagement of the founder with various levels of the company, and may involve unconventional methods that may not fit into traditional business norms. However, the specifics of founder mode are not yet well-defined or understood, as it's a relatively new concept and there's little formal literature about it. It's also noted that as the concept becomes more established, there's a risk of it being misused or misunderstood.


#### Additional features

That's it! This was a quick demo of how to build a naive RAG pipeline with ragpipe. For additional features, read below.

We do a few more things to structure the code better:
- specify one or more prompts under the `prompts` field in config.yml. Use them whenever interacting with an LLM.
- similarly specify in config.yml:
    - multiple queries with the `queries` field.
    - multiple LLM providers under the `llm_models` field.
- put together all the above functional units into a `Workflow` class - see `quickstart/project.py`

Finally, if you are unhappy with the above answer, you can build upon this notebook in many ways to improve results:
- add hybrid search (add sparse rep using bm25, add bridge b_sparse, merge using reciprocal rank fusion). See `examples/startups.yml`.
- add storage, switch encoder to other models, add eval to compare.
- for most practical usecases, we need at least two bridges over `..#dense` and `...#sparse` representations, whose outputs are merged to create the final ranked list.

