# pre-process data

With permission of the author, we will demonstrate how to quary the book [myths of automations](https://anonette.net), written by Denisa Kera

To achieve this, we will first split the book into chunks, each roughly a page long, then ask fr summary of each pragaraph and suggest three questions .

In [16]:
import os,json, warnings

## 1. Read in the data

In [17]:
with open("data\BookFinal2023_noref.txt", "r", encoding='utf8') as f:
    text = f.read()

### 1.1 Count the tokens in each chunk

In [18]:
import openai
from transformers import GPT2Tokenizer
# OpenAI GPT-2 tokenizer is the same as GPT-3 tokenizer
# we use it to count the number of tokens in the text
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

chunks = text.split('  ')
ntokens = []
for chunk in chunks:
    ntokens.append(len(tokenizer.encode(chunk)))
max(ntokens)

335

split the data to chunks, but dont split in middle of paragraphs and try to balances token number with number of paragraph in chunks.

In [19]:
def split_text_keep_format(text, delimiter, split_count):
    chunks = []
    count = 0
    start = 0

    # Iterate over the characters in the text
    for i in range(len(text) - len(delimiter) + 1):
        # If the current slice of the text equals the delimiter
        if text[i:i+len(delimiter)] == delimiter:
            count += 1
            # If the count is equal to the split count
            if count == split_count:
                # Append the current chunk to the list
                chunks.append(text[start:i])
                # Reset the start and count
                start = i + len(delimiter)
                count = 0

    # Append the last chunk to the list
    chunks.append(text[start:])

    return chunks

chunks = split_text_keep_format(text, '  ', 3)


ntokens = []
for chunk in chunks:
    ntokens.append(len(tokenizer.encode(chunk)))
    # print (len(tokenizer.encode(chunk)))

max_tokens = max(ntokens)
print (chunks[0])
print (chunks[1])


1. Myth of automation

1.0 Rituals, Instruments, and Prototypes 
   Algorithms and data in models and ledgers enact an old fantasy of a future governed by rituals that become instruments, machines, and infrastructures. It is a fantasy of time control and automation that work as a deus ex machina or devil's bridge, offering miracles that turn into curses. Examples go back to the 'predictive analytics' with olive presses in the 6th century BCE, the complaints about the merciless water clock in the 4th century BCE, and Plautus' famous curse of the sundial (Chapter 2). These classic loci show how the fantasies of automation and control over time quickly turn into anxieties about bias, precarity, loss of agency, and sovereignty. 

 Dreams and fears of automation emerge with every new instrument and infrastructure. From the early calendars and clocks to today's reputation and scoring systems, predictive AI, or smart contracts on trustless blockchain ledgers, automation promises a frictionles

It turns out that a double newline is a good separator in this case, in order not to break the flow of the text. Also no individual chunk is larger than 1500 tokens. The model we will use is text-davinci-002, which has a limit of 4096 tokens, so we don't need to worry about breaking the chunks down further.

We will group the shorter chunks into chunks of around 1000 tokens, to increase the coherence of the text, and decrease the frequency of breaks within the text.

In [20]:
def group_chunks(chunks, ntokens, max_len=1777, hard_max_len=3000):
    """
    Group very short chunks, to form approximately page long chunks.
    """
    batches = []
    cur_batch = ""
    cur_tokens = 0
    
    # iterate over chunks, and group the short ones together
    for chunk, ntoken in zip(chunks, ntokens):
        # discard chunks that exceed hard max length
        if ntoken > hard_max_len:
            print(f"Warning: Chunk discarded for being too long ({ntoken} tokens > {hard_max_len} token limit). Preview: '{chunk[:50]}...'")
            continue

        # if room in current batch, add new chunk
        if cur_tokens + 1 + ntoken <= max_len:
            cur_batch += "\n\n" + chunk
            cur_tokens += 1 + ntoken  # adds 1 token for the two newlines
        # otherwise, record the batch and start a new one
        else:
            batches.append(cur_batch)
            cur_batch = chunk
            cur_tokens = ntoken
            
    if cur_batch:  # add the last batch if it's not empty
        batches.append(cur_batch)
        
    return batches


chunks = group_chunks(chunks, ntokens)
len(chunks)

68

In [21]:
print(chunks[0])



﻿1. Myth of automation

1.0 Rituals, Instruments, and Prototypes 
   Algorithms and data in models and ledgers enact an old fantasy of a future governed by rituals that become instruments, machines, and infrastructures. It is a fantasy of time control and automation that work as a deus ex machina or devil's bridge, offering miracles that turn into curses. Examples go back to the 'predictive analytics' with olive presses in the 6th century BCE, the complaints about the merciless water clock in the 4th century BCE, and Plautus' famous curse of the sundial (Chapter 2). These classic loci show how the fantasies of automation and control over time quickly turn into anxieties about bias, precarity, loss of agency, and sovereignty. 


 Dreams and fears of automation emerge with every new instrument and infrastructure. From the early calendars and clocks to today's reputation and scoring systems, predictive AI, or smart contracts on trustless blockchain ledgers, automation promises a frictio

In [22]:
prompt = f'''you are a text to json converter. you are given a text with following rules
                * titles start with no space and ends with newline
                * paragraphs are lines starting with two spaces and end with two spaces
                add each paragraph, unmodified and uncut, to a list of paragraphs
                add summary using the words of the author to each paragraph
                add three questions to each paragraph , in the voice of a layman 
                return response as valid JSONL.
                '''


def chat_oracle_json(chunk, prompt=prompt, indexc=0):
    # Format the index as a string with 4 digits, adding leading zeros as needed
    index_str = str(indexc).zfill(4)

    # Define the file path
    file_path = os.path.join('chunks/', f'chunk_{index_str}.json')

    # Check if the file already exists
    if not os.path.isfile(file_path):

        completion = openai.ChatCompletion.create(
            model="gpt-4",
            temperature=0.77,
            messages=[
                {
                    "role": "system",
                    "content": prompt
                },
                {
                    "role": "user",
                    "content": chunk,
                },
            ],
        )

        response=completion.choices[0].message

        # out_jsonl = json.dumps(response, indent=2)

        # # Parse the provided JSON content
        # json_content = json.loads(out_jsonl)

        # # Extract the 'content' field and parse it as JSON
        # json_content_inner = json.loads(json_content['content'])

        # Save the output to a file
        with open(file_path, 'w') as outfile:
            json.dump(response, outfile, indent=2)
    else:
        print(f'File {file_path} already exists. Skipping.')



In [23]:
# Before the long-running cell, insert this code to ask for user confirmation
user_confirmation = input("Do you want to start running the gpt cell? (yes/no): ")

if user_confirmation.lower() == 'yes':
    # Put your long-running code here
    # This code will only run if the user enters 'yes' at the prompt
    print("Running the long cell...")

    for i, chunk in enumerate(chunks):
        print(f"item {i}")
        chat_oracle_json(chunk=chunk, indexc=i)
        
    print("Long cell execution completed.")
else:
    print("Cell execution skipped.")


Running the long cell...
item 0
File chunks/chunk_0000.json already exists. Skipping.
item 1
File chunks/chunk_0001.json already exists. Skipping.
item 2
File chunks/chunk_0002.json already exists. Skipping.
item 3
File chunks/chunk_0003.json already exists. Skipping.
item 4
File chunks/chunk_0004.json already exists. Skipping.
item 5
File chunks/chunk_0005.json already exists. Skipping.
item 6
File chunks/chunk_0006.json already exists. Skipping.
item 7
File chunks/chunk_0007.json already exists. Skipping.
item 8
File chunks/chunk_0008.json already exists. Skipping.
item 9
File chunks/chunk_0009.json already exists. Skipping.
item 10
File chunks/chunk_0010.json already exists. Skipping.
item 11
File chunks/chunk_0011.json already exists. Skipping.
item 12
File chunks/chunk_0012.json already exists. Skipping.
item 13
File chunks/chunk_0013.json already exists. Skipping.
item 14
File chunks/chunk_0014.json already exists. Skipping.
item 15
File chunks/chunk_0015.json already exists. Ski

cleaning the procesed chunks at `/chunks`

In [24]:
def convert_content_to_json(input_dir, output_dir):
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Get all files in the input directory
    for root, dirs, files in os.walk(input_dir):
        for file in files:
            # Construct full file path
            file_path = os.path.join(root, file)
            
            # Load the dictionary from the file
            with open(file_path, "r") as input_file:
                dictionary = json.load(input_file)
            
            # Get the content and convert it to JSON
            content = dictionary["content"]
            content_json = json.dumps(content, indent=4)
            
            # Parse the provided JSON content
            json_content = json.loads(content_json)
            
            # Extract the 'content' field and parse it as JSON
            json_content_inner, end = json.JSONDecoder().raw_decode(json_content)

            # Warn if there's more data after the first JSON object
            # if end < len(json_content):
            #     warnings.warn(f"Extra data found in file {file} after the first JSON object, which was ignored.")
            
            # Construct output file path
            output_file_path = os.path.join(output_dir, file)
            
            # Write the converted content to the output file
            with open(output_file_path, "w") as output_file:
                json.dump(json_content_inner, output_file, indent=4)  # Using indent for pretty-printing

# Set the input and output directories
input_dir = "chunks/"
output_dir = "preprocess/jsons/"

# Call the function
convert_content_to_json(input_dir, output_dir)
