In [4]:
%load_ext autoreload
%autoreload 2

from typing import List
from preprocessing import FileIO
import tiktoken # bad ass tokenizer library for use with OpenAI LLMs 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## As a reminder, in practice you would likely create "parent chunks" as part of your preprocessing step and store the larger context directly in your database.  Text is cheap to store.  

### Load data from disk

In [5]:
#data needs to already have doc_ids included
data = FileIO().load_parquet('../impact-theory-new-ft-model-256.parquet')

Shape of data: (26448, 12)
Memory Usage: 2.42+ MB


### Regroup Episodes using `video_id`
Groups data into discrete episodes using the video_id key

In [6]:
from itertools import groupby

def groupby_episode(data: List[dict], key_field: str='video_id') -> List[List[dict]]:
    '''
    Separates entire Impact Theory corpus into individual 
    lists of discrete episodes.
    '''
    episodes = []
    for key, group in groupby(data, lambda x: x[key_field]):
        episode = [chunk for chunk in group]
        episodes.append(episode)
    return episodes

In [7]:
all_episodes = groupby_episode(data)

### Combine episode chunks into Parent Chunks one for each doc_id

In [8]:
def create_parent_chunks(episode_list: List[list], window_size: int=2) -> List[dict]:
    '''
    Creates parent chunks from original chunk of text, for use with 
    small to big retrieval.  Window size sets number of chunks before
    and after the original chunk.  For example a window_size of 2 will 
    return five joined chunks.  2 chunks before original, the original, 
    and 2 chunks after the original.  Chunks are kept in sequence by 
    using the doc_id field. 
    '''
    parent_chunks = []
    for episode in episode_list:
        contents = [d['content'] for d in episode]
        for i, d in enumerate(episode):
            doc_id = d['doc_id']
            start = max(0, i-window_size)
            end = i+window_size+1
            chunk = ' '.join(contents[start:end])
            parent_chunks.append({doc_id:chunk})
    return parent_chunks

In [10]:
parent_chunks = create_parent_chunks(all_episodes, window_size=1)

### Create in-memory cache and save back to disk (for use with RAG app)

In [15]:
def create_parent_chunk_cache(parent_chunks: List[dict]) -> dict:
    '''
    Creates a simple in-memory cache for quick parent chunk lookup.
    Used for small-to-big retrieval in a RAG system.
    '''
    content_cache = {}
    for chunk in parent_chunks:
        for k,v in chunk.items():
            content_cache[k] = v
    return content_cache

Cache can now be used as a lookup table using doc_id as the primary key.  Use the smaller content to display in the UI but feed the LLM a larger context window by using the cache.  
In your cache, you should have the same number of keys as there were chunks in the original data, in this case **26,448** keys/doc_ids.

In [19]:
cache = create_parent_chunk_cache(parent_chunks)

### Example: Original "small" data

In [32]:
sample_doc_id = 'nXJBccSwtB8_50'

In [33]:
[d['content'] for d in data if d['doc_id'] == sample_doc_id]

["And that means getting a handle on this technology, and working to help it work for humanity with humanity, as opposed to, you know, not against it, but, you know, kind of irrelevant to it. We don't want technology that does not consider human beings as relevant on the planet. You can reboot your life, your health, even your career, anything you want. All you need is discipline. I can teach you the tactics that I learned while growing a billion dollar business that will allow you to see your goals through. Whether you want better health, stronger relationships, a more successful career, any of that is possible with the mindset and business programs in Impact Theory University. Join the thousands of students who have already accomplished amazing things. Tap now for a free trial and get started today. No, I agree with that. The thing that I think that we're going to have to contend with, though, is what is a governmental response going to be to the potential of their weakened power? So

### Example: Expanded context ("large" data)

In [34]:
cache[sample_doc_id]

"Like, I mean, honestly, I haven't really said this publicly, but we're having a broad enough discussion. Like, I'm, how old are you? 47. Okay, I'm 53. I think that, knock on wood, I don't think that either of us are likely to die of natural causes. I think at our age, we are probably either going to blow ourselves up, you know, as humans, or we're going to have such extraordinary technological advances, that we will be able to dramatically extend lifespans to in ways that are I mean, you know, dealing with with cell death and, and molecular destruction and genetic engineering. And I mean, just looking at what is ahead of us over the next 10-20 years, this does not feel remotely sustainable. But that doesn't mean it's horrible. That means it's one of two tail risks. And I just can't tell if it's the great one or the bad one. But to the extent that I have any role on this planet, I'd like to nudge us, as I know you would do, in the better direction. And that means getting a handle on th

### Example: Highlighting where the original context is

"Like, I mean, honestly, I haven't really said this publicly, but we're having a broad enough discussion. Like, I'm, how old are you? 47. Okay, I'm 53. I think that, knock on wood, I don't think that either of us are likely to die of natural causes. I think at our age, we are probably either going to blow ourselves up, you know, as humans, or we're going to have such extraordinary technological advances, that we will be able to dramatically extend lifespans to in ways that are I mean, you know, dealing with with cell death and, and molecular destruction and genetic engineering. And I mean, just looking at what is ahead of us over the next 10-20 years, this does not feel remotely sustainable. But that doesn't mean it's horrible. That means it's one of two tail risks. And I just can't tell if it's the great one or the bad one. But to the extent that I have any role on this planet, I'd like to nudge us, as I know you would do, in the better direction. **And that means getting a handle on this technology, and working to help it work for humanity with humanity, as opposed to, you know, not against it, but, you know, kind of irrelevant to it. We don't want technology that does not consider human beings as relevant on the planet. You can reboot your life, your health, even your career, anything you want. All you need is discipline. I can teach you the tactics that I learned while growing a billion dollar business that will allow you to see your goals through. Whether you want better health, stronger relationships, a more successful career, any of that is possible with the mindset and business programs in Impact Theory University. Join the thousands of students who have already accomplished amazing things. Tap now for a free trial and get started today. No, I agree with that. The thing that I think that we're going to have to contend with, though, is what is a governmental response going to be to the potential of their weakened power? So we know how China is dealing with it. So it was really amazing to watch China open up the capital markets and really just explode. And in your book, you talk about this.** And I found it a really interesting insight that that forced me to reorient my thinking about what China did. And so, you know, if you've read Mao, The Untold Story, it's like it's just devastating to see how much death and destruction came out of an authoritarian government. And then at the same time, you're like, I don't know that America's approach is always the right, the most optimal answer. I forget the exact words you used to every problem. And what you pointed out with China when they opened up, like just the growth rate was pure insanity and is really pretty breathtaking. But they learned from the collapse of Russia exactly what not to do. And now they're clamping back down. Now, as somebody that grew up in the U.S., man, I look at that. I'm just like, dude, that I don't like that. That freaks me out. The thought of always being on that razor's edge of like the individual doesn't matter and we can just completely obliterate you. But then I watch not even the government necessarily in the U.S., but the people in the U.S."

#### As a reminder, in practice you would likely create "parent chunks" as part of your preprocessing step and store the larger context directly in your database.  Text is cheap to store.  