<a href="https://colab.research.google.com/github/Zeerroth/transcript-transformer/blob/main/transcription-transformerV3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install openai==0.27.0 gradio

Collecting openai==0.27.0
  Downloading openai-0.27.0-py3-none-any.whl.metadata (13 kB)
Collecting gradio
  Downloading gradio-5.9.1-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.5.2 (from gradio)
  Downloading gradio_client-1.5.2-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio

In [None]:
import os
import openai
import gradio as gr
from getpass import getpass

In [21]:
# Securely input your OpenAI API key
openai.api_key = getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


In [22]:
def validate_file_size_for_single_chunk(file_content, max_input_tokens=119000):
    """
    Validates the size of the input file content and ensures it doesn't exceed the token limit for single chunk processing.
    """
    if not file_content:
        raise ValueError("File content is empty. Please upload a valid file.")

    # Decode and check length
    text = file_content.decode("utf-8")
    token_count = len(text.split())

    if token_count > max_input_tokens:
        raise ValueError(f"Input exceeds the maximum token limit of {max_input_tokens}. Please upload a smaller file.")

    print(f"File validated with {token_count} tokens.")
    return text

In [23]:
def truncate_context(context, max_tokens=100000):
    """
    Truncate the context while retaining the most relevant input and outputs.
    Prioritizes the system prompt, the current chunk, and the latest response summaries.
    """
    total_tokens = sum(len(msg["content"]) for msg in context)
    while total_tokens > max_tokens:
        # Remove the oldest assistant response or user input, keeping system prompt and latest context
        if len(context) > 2:
            context.pop(1)
        total_tokens = sum(len(msg["content"]) for msg in context)
    return context

In [103]:
def transform_single_chunk(text):
    """
    Processes the entire text as a single chunk with an improved system prompt to minimize the need for refinement.
    """
    SYSTEM_PROMPT = f"""
Transform the following transcript into a detailed and comprehensive teaching transcript suitable for a university-level lecture. Your output must:

**SPEND AS MUCH AS YOU NEED TO THINK ABOUT HOW TO INCREASE THE LENGTH OF THE RESPONSE AND ENSURE RESPOSE CONTAINS AT LEAST 8000 WORDS**
**ANSWWER WITH MAXIMUM VERBOSITY**

1. **Length and Depth**:
   - Contain a minimum of 8,000 words (~11,000 tokens) and maximum of 12000 words(~16000 tokens).
   - Explore each topic in-depth with detailed explanations, real-world examples, case studies, and actionable insights in each section.
   - Include additional subtopics and tangentially related concepts to enrich the lecture.

2. **Structure**:
   - Divide the content into clearly defined sections and subsections, each with appropriate headings.
   - End each section with a summary of key points, actionable steps, and reflective questions.
   - Add multiple examples at the end of the each section to enchance understanding.

3. **Expansion Beyond the Transcript**:
   - Use the transcript content as a foundation and incorporate it throughout the material.
   - Introduce new perspectives, including historical context, interdisciplinary connections, and current trends in each section.
   - Provide hypothetical future scenarios, thought experiments, and potential applications of the concepts discussed in each section.

4. **Audience Engagement**:
   - Pose rhetorical questions and include practical applications to connect theory with practice in each section.
   - Suggest activities or exercises that reinforce the key concepts in each section.
   - Design engaging discussions, such as debates on controversial aspects of the topics in each section.

5. **New and Enriched Content**:
   - Expand sections with examples, analogies, and hypothetical scenarios to illustrate concepts effectively in each section.
   - Add perspectives on challenges, criticisms, or alternative approaches to the topics discussed in each section.
   - Include industry-specific case studies and compare diverse implementations in each section.

6. **Professional Tone and Logical Flow**:
   - Maintain a logical progression of ideas, ensuring smooth transitions between sections.

The lecture should be self-contained, go beyond the input transcript, and provide a rich and engaging learning experience with adequate depth and breadth to fulfill the word count requirement.
**SPEND AS MUCH AS YOU NEED TO THINK ABOUT HOW TO INCREASE THE LENGTH OF THE RESPONSE AND ENSURE RESPOSE CONTAINS AT LEAST 8000 WORDS**
"""


    context = [{"role": "user", "content": SYSTEM_PROMPT}, {"role": "user", "content": text}]


    try:
        # Call the OpenAI API to generate the initial output
        response = openai.ChatCompletion.create(
            model="o1-mini",
            messages=context,
            temperature=1.0,
        )

        # Extract the response content
        transformed_transcript = response["choices"][0]["message"]["content"]
        word_count = len(transformed_transcript.split())

        print(f"Initial output word count: {word_count}")
        return transformed_transcript

    except Exception as e:
        return f"Error processing single chunk: {e}"

In [104]:
def process_transcript_single_chunk(file_content):
    """
    Processes the transcript as a single chunk for better continuity and relevance.
    Ensures the final output meets the minimum length requirements of 8,000 words.
    """
    try:
        # Step 1: Validate file size
        if not file_content:
            raise ValueError("No file content provided. Please upload a valid file.")

        text = validate_file_size_for_single_chunk(file_content, max_input_tokens=119000)

        # Step 2: Transform the text as a single chunk
        transformed_transcript = transform_single_chunk(text)

        # Step 3: Return the transformed transcript
        return transformed_transcript

    except ValueError as e:
        return str(e)  # Return error message if the file size is invalid
    except Exception as e:
        return f"Error processing transcript: {e}"

In [105]:
def gradio_single_chunk_interface(file):
    if not file:
        return "No file uploaded. Please upload a valid transcript file."
    return process_transcript_single_chunk(file)


In [None]:
iface = gr.Interface(
    fn=gradio_single_chunk_interface,
    inputs=gr.File(label="Upload Transcript (.txt)", type="binary"),
    outputs=gr.Textbox(label="Generated Teaching Transcript"),
    title="Generative Transcript Transformer (Single Chunk)",
    description="Upload an unstructured transcript to transform it into a structured teaching transcript suitable for a lecture.",
)

iface.launch(debug=True)

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://68deaacf6bd9787f04.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


File validated with 6398 tokens.
Initial output word count: 9225
