# Connecting Ollama using Lanchain & Clean transcript

This Jupyter notebook demonstrates how to leverage Ollama, a powerful local large language model, in conjunction with Langchain to clean and process transcription data. Here, we assume that you run model on Ollama in the background i.e. `ollama serve`

In [None]:
%%capture
!pip install langchain
!pip install langchain_community

In [None]:
import io
import pandas as pd
from langchain_community.llms import Ollama
from langchain import PromptTemplate # Added

llm = Ollama(model="llama3.1", stop=["<|eot_id|>"]) # Added stop token

def get_model_response(user_prompt, system_prompt):
    # NOTE: No f string and no whitespace in curly braces
    template = """
        <|begin_of_text|>
        <|start_header_id|>system<|end_header_id|>
        {system_prompt}
        <|eot_id|>
        <|start_header_id|>user<|end_header_id|>
        {user_prompt}
        <|eot_id|>
        <|start_header_id|>assistant<|end_header_id|>
        """

    # Added prompt template
    prompt = PromptTemplate(
        input_variables=["system_prompt", "user_prompt"],
        template=template
    )
    
    # Modified invoking the model
    response = llm(prompt.format(system_prompt=system_prompt, user_prompt=user_prompt))
    
    return response

In [None]:
def dataframe_to_csv_string(df):
    # Convert DataFrame to CSV string
    csv_buffer = io.StringIO()
    df.to_csv(csv_buffer, index=False)
    csv_string = csv_buffer.getvalue()
    return csv_string


def get_cleanup_prompt(csv_content):
    prompt = f"""
    You are an expert in processing and cleaning transcription data. Your task is to analyze and improve the following CSV content from a Whisper transcription, which contains columns: text, start, stop.

    Here's a sample of the CSV content:
    {csv_content}  # Show first 500 characters as an example

    Please perform the following tasks:
    1. Clean up the text:
       - Correct any obvious spelling or grammatical errors in both Thai and English.
       - Remove any unnecessary symbols or characters.
       - For technical terms, in general, spell it in English
       - Ensure proper capitalization and punctuation.

    2. Cluster the text:
       - Group consecutive sentences that belong together semantically in Thai.
       - For Thai text, consider the following:
         * Thai doesn't use spaces between words, so focus on complete ideas rather than word count.
         * Look for natural breaks in speech, such as pauses or topic changes.
       - Aim for shorter clusters:
         * Target 2-5 seconds of speech per cluster.
         * Limit each cluster to a maximum of 50-70 Thai characters.
         * If a single sentence or idea exceeds this limit, consider breaking it at a logical point.
         * Something like ๆ should not locate at the beginning of sentence.
       - Merge the 'start' times of the first item in a cluster and the 'stop' time of the last item.

    3. Handle bilingual content:
       - If a sentence switches between Thai and English, keep them together in the same cluster.
       - Ensure that Thai and English text are properly separated and not mixed within words.

    4. Timing adjustments:
       - Round 'start' and 'stop' times to the nearest tenth of a second.
       - Ensure there are no overlaps or gaps in timing between entries.

    5. Format the output:
       - Maintain the CSV format with columns: text, start, stop.
       - Ensure all text is properly enclosed in quotes if necessary.

    Please provide the cleaned and clustered CSV content, maintaining the original column structure. Do not include any explanations or comments in the output, just the processed CSV data.
    """
    return prompt


def process_transcription(csv_content):
    system_prompt = "You are a helpful assistant designed to process and clean up transcription data."
    user_prompt = get_cleanup_prompt(csv_content)
    response = get_model_response(user_prompt, system_prompt)
    return response

In [None]:
# Read CSV to pandas, process transcription, get the cleaned transcript --> You may need to chunk transcripts
transcripts = pd.read_csv("content.csv")
csv_content = dataframe_to_csv_string(transcripts)
cleaned_csv = process_transcription(csv_content)
transcripts_cleaned = pd.read_csv(io.StringIO(cleaned_csv))
transcripts_cleaned.iloc[:50]

### TODOs: Write a prompt to summarize the transcribed text

- Reference: https://www.reddit.com/r/ChatGPT/comments/11twe7z/prompt_to_summarize/