# Thought Notebook
New project, quick proof of concept notebook for artificial thought.

Goal here is to turn audio transcriptions into focused summaries. Let's see if I can turn unstructured text into structured text without any added context or input parameters.

THIS NOTEBOOK IN PARTICULAR: Is a concept for using LCEL to process text data.

## Notes
### 20240124:
Working on learning to build chains with LangChain's LCEL. Want to take these to new level of complexity to allow for data processing, forks, parallel excution. 

What I want to show near term (for this dev notebook) is a linear serial flow of raw transcript to structured summary.

1) pre-label -> clean up -> summary

Then add complexity:
2.1) pre-label -> clean up -> store
2.2) pre-label -> clean up -> store
2.3) retrieve (2.1 and 2.2) -> label -> summary

REMEMBER THIS IS AN ART PROJECT THAT NEEDS TO BE FUN AND INTERESTING AND REQUIRE MINIMAL DEVELOPMENT EFFORT.



## Imports

In [2]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    MessagesPlaceholder,
    HumanMessagePromptTemplate,
    PromptTemplate
)

from langchain.output_parsers import StructuredOutputParser, CommaSeparatedListOutputParser, PydanticOutputParser, StringOutputParser
from langchain_core.runnables import RunnableLambda
from pydantic import BaseModel, Field
from typing import List


## Initial Label

In [21]:
role = """ 
    Your receive transcriptions of audio messages. Your job is to label the subject
    of the message. You tag the message with multiple lables. You are both general and 
    specific. Your tags generally belong to a category, theme, or topic. 

    Messages could be about anything. Messages could be a boult multiple topics. In the 
    case of multiple topics tag with lables that fit the main topic. Ensure that a 
    general theme is tagged. All labels should have a general association with each other.
    
    It is likely that there are errors in the transcription, don't let that distract you.

    -- EXAMPLE 1 --
    Message: So I had this idea about a new app. The app will be a social media app that tracks
    users across the platform to determine their interests. Profile information will be used to
    present the user hyper-targeted ads.

    Output: technology, social media, advertising, computer science, software development,
    artificial intelligence, machine learning, data science, data analytics, data engineering

    -- EXAMPLE 2 --
    Message: I have a new idea for a business. I want to start a new business that sells a sweet
    and tangy beverage. The beverage will be made from a fruit that is grown in the tropics called
    lemon. Before starting the business I will need to create a financial model and marketing plan.

    Output: business, entrepreneurship, finance, marketing, economics, accounting, management

    FORMAT INSTRUCTIONS:
    You must follow these instructions for formating output....

"""

output_parser = CommaSeparatedListOutputParser()

system_prompt = PromptTemplate(
    template=role + "\n{format_instructions}",
    input_variables=[],
    partial_variables={"format_instructions": output_parser.get_format_instructions()}
)

query = "{transcription}"

prompt = ChatPromptTemplate(messages=[
            SystemMessagePromptTemplate(prompt=system_prompt), 
            HumanMessagePromptTemplate.from_template(query)
        ])

model = ChatOpenAI(model="gpt-3.5-turbo")
# model = ChatOpenAI(model="gpt-4")

label_chain = prompt | model | output_parser



## Cleanup

In [50]:
###### CLEAN UP ########
from operator import itemgetter
# Prompt Template
# ---------------
class CleanUp():
    def __init__(self):
        role = """ 
            Your receive transcriptions of audio messages. It is likely that there are
            errors in the transcription. Your job is to correct these errors. 

            You will do this by identifying the errors and correcting them. There is no 
            need to retype the entire message. You should simple output text that can 
            be used to identify the error and the correction.

            You will be provided with labels that describe the content of the message generally.
            Some labels will be more applicable than others. These labels should help you
            identify the errors by providing context.

            Acronyms and abbreviations are commonly transcribed incorrectly. The labels should 
            be very helpful in identifying these errors. If label is helpful and an acronym of 
            the raw transcription does not make sense, then use the label to correct the error.

            REMINDER: Use minimal viable text to uniquely identify the error. Punctuation, grammar,
            spelling, capitalization, and acronyms are all fair game.

            -- EXAMPLE 1 --
            Labels: shopping, produce, fruit, grocery store, supermarket, food, shopping list
            Transcription: I went to the store early and bought fruit. Except I forgot to 
            buy orangutans.

            Error: Except I forgot to buy orangutans.

            Correction: Except I forgot to buy oranges.

            -- EXAMPLE 2 --
            Labels: technology, social media, advertising, computer science, software development
            Transcription: So I had this idea about a new app. The app will be a social media app
            that tracks users across the platform to determine their interests. The app will utilize
            a new type of AI model called LOLms that can be used to interpret users posts. I will use
            one from a company called OpenAI called GBTT-4.

            Error: AI model called LOLms

            Correction: AI model called LLMs

            Error: OpenAI called GBTT-4

            Correction: OpenAI called GPT-4

            FOLLOW OUTPUT INSTRUCTIONS CAREFULLY

        """

        # Define a custom Pydantic model for error correction pairs
        class ErrorCorrectionPair(BaseModel):
            error: str = Field(description="The incorrect part of the transcription")
            correction: str = Field(description="The corrected part of the transcription")

        class ErrorCorrectionContainer(BaseModel):
            error_correction_pairs: List[ErrorCorrectionPair] = Field(description="A list of error correction pairs")

        # Define a PydanticOutputParser with the custom Pydantic model
        correction_parser = PydanticOutputParser(pydantic_object=ErrorCorrectionContainer)

        query = "Labels: {labels}\n Transcription: {transcription}"

        system_prompt = PromptTemplate(
            template=role + "\n{format_instructions}",
            input_variables=[],
            partial_variables={"format_instructions": correction_parser.get_format_instructions()}
        )

        prompt = ChatPromptTemplate(messages=[
                    SystemMessagePromptTemplate(prompt=system_prompt), 
                    HumanMessagePromptTemplate.from_template(query)
                ])


        model = ChatOpenAI(model="gpt-3.5-turbo")
        # model = ChatOpenAI(model="gpt-4")

        # Update the correction_chain to include the parser
        self.chain = (
            {
            'transcription' : itemgetter('transcription'),
            'labels' : itemgetter('labels'),
            'error_correction_pairs' : prompt | model | correction_parser
            }
            | RunnableLambda(self.apply_corrections)
        )

    # Define the apply_corrections method inside the CleanUp class
    @staticmethod
    def apply_corrections(input_dict):
        # from pprint import pprint
        # pprint(input_dict)
        transcription = input_dict['transcription']
        error_correction_pairs = input_dict['error_correction_pairs'].error_correction_pairs
        for pair in error_correction_pairs:
            transcription = transcription.replace(pair.error, pair.correction)

        output_dict = input_dict
        output_dict['transcription'] = transcription
        return output_dict
        
    def invoke(self, input_dict):
        return self.chain.invoke(input_dict)



## Summary

In [55]:

# Prompt Template
# ---------------
role = """ 
    You have mastery of the English language. You specialize in helping writers and speakers better
    communicate there ideas. You are a professional editor. You are a master of grammar and spelling.
    
    Your receive transcriptions of audio messages. You will be provided with labels that describe the
    audio message. These labels should make sense in the context of the message. 
    
    Your job is to summarize the transcription. You should summarize the message in context to one
    or more of the labels. The summary should be contain all important details of the original message.
    However, the summary should be concise. Please keep the summary to be the same length or shorter
    than the original message.

    Keep in mind that the transcription may not be concise or struggle to properly articulate idea. This
    is the crux of you task. Interpret the messages intent without introducing new ideas. Interpret and 
    concolidate.

"""
query = "Labels: {labels}\n Transcription: {transcription}"

prompt = ChatPromptTemplate(messages=[
            SystemMessagePromptTemplate.from_template(role),
            HumanMessagePromptTemplate.from_template(query)
        ])

# model = ChatOpenAI(model="gpt-3.5-turbo")
model = ChatOpenAI(model="gpt-4")

final_chain = prompt | model



## -- Execute Chain --    

In [63]:
import yaml

# load a yaml file
with open('thought.yaml') as file:
    thought_file = yaml.load(file, Loader=yaml.FullLoader)


In [7]:
transcription = """
OK, so I had this idea where I can take audio to text and store inside a AI brain that will process the text and turn it into more concise summarizations. The idea really stems out of the inability to concisely accurately and effectively describe plants and intention when recording audio off the cuff. For example, this sort of ramble here is a attempted articulation at a more sophisticated idea. it can potentially generally get a point across but will never contain all of the necessary items to complete set project or task contradictory statements. Any who the idea here is to have the LLMBA chain of multiple LM models with specific prompts for handling with The large rambles that come in in firstly, raw audio to text then it will be the LOM, responsibilities to first correct most basic errors, then perform summaries, and then extract useful information. It will also be the LLM or the AI discretion to extract certain concepts. This is going to be a majority automated process, wish more of potentially an art project than an affective tool however, if properly interesting useful results could come out. 
"""

# print(label_chain.invoke({"transcription": transcription}).content)
labels = label_chain.invoke({"transcription": transcription})
print(f'Labels: {labels}')



Labels: ['technology', 'artificial intelligence', 'machine learning', 'data processing', 'language processing', 'automation', 'summarization', 'information extraction', 'audio-to-text conversion', 'natural language processing']


In [65]:

# Invoke the chain with the labels and transcription
c = CleanUp()
# c.invoke({"labels": labels, "transcription": transcription})

# big_chain = label_chain | c.chain
big_chain = (
    {
        "transcription": itemgetter('transcription'), 
        "labels": label_chain
    }
    | c.chain
    | final_chain

)

print(thought_file)

output_file = thought_file.copy()
for key, value in thought_file.items():
    if 'thought' in key.lower():
        summary = big_chain.invoke({"transcription": value}).content

        output_key = key.replace('thought', 'summary')
        output_file[output_key] = summary

        # write to yaml file
        with open('thought.yaml', 'w') as file:
            yaml.dump(output_file, file)

{'thought_0': 'OK, so I had this idea where I can take audio to text and store inside a AI brain that will process the text and turn it into more concise summarizations. The idea really stems out of the inability to concisely accurately and effectively describe plants and intention when recording audio off the cuff. For example, this sort of ramble here is a attempted articulation at a more sophisticated idea. it can potentially generally get a point across but will never contain all of the necessary items to complete set project or task contradictory statements. Any who the idea here is to have the LLMBA chain of multiple LM models with specific prompts for handling with The large rambles that come in in firstly, raw audio to text then it will be the LOM, responsibilities to first correct most basic errors, then perform summaries, and then extract useful information. It will also be the LLM or the AI discretion to extract certain concepts. This is going to be a majority automated proc