# Sample Minutes Structure

## 1. Opening Details
This section sets the groundwork for the document, providing a snapshot of the meeting’s logistical details. It includes:

- **Meeting Title:** The official name or subject of the meeting.
- **Date and Time:** When the meeting took place.
- **Location:** Where the meeting was held, including virtual meeting links if applicable.
- **Attendees:** A list of individuals present at the meeting, including their titles or roles. Distinguish between members, guests, and absentees.
- **Chairperson:** The individual presiding over the meeting.
- **Secretary:** The person responsible for taking minutes.

## 2. Agenda Items
Following the opening details, this section outlines all the items discussed during the meeting. Each agenda item should be presented as a separate sub-section that includes:

- **Item Title:** A brief title describing the agenda item.
- **Presenter:** The name of the person who presented the item.
- **Discussion Summary:** A concise summary of the discussion points, capturing the essence of what was talked about without delving into excessive detail.
- **Action Items:** Specific actions to be taken, who is responsible for them, and any deadlines.

## 3. Decisions Made
This critical section documents the decisions reached during the meeting. For each decision, include:

- **Decision Title:** A short, descriptive title of the decision.
- **Description:** A brief explanation of the decision made.
- **Responsible Party:** The person(s) or department(s) responsible for implementing the decision.
- **Deadline:** If applicable, the timeline for implementation or review.

## 4. Action Items
Building on the decisions made, this section lists all the action items identified during the meeting, including those mentioned under agenda items but providing more detail. Each action item should specify:

- **Action to be Taken:** A clear description of the task.
- **Assigned To:** The individual(s) responsible for completing the action.
- **Deadline:** The date by which the action should be completed.
- **Status:** Initial status (typically "Assigned" or "In Progress").

## 5. Closing Summary
The final section wraps up the meeting minutes, providing a succinct summary of the meeting's outcomes and highlighting any next steps. This section includes:

- **a. Meeting Adjournment:** The time the meeting concluded.
- **b. Next Meeting:** Date, time, and location of the next scheduled meeting, if known.
- **c. Closing Remarks:** Any final thoughts or comments from the chairperson, emphasizing the importance of the decisions made and the next steps.


In [1]:
!pip install openai python-dotenv faster-whisper



In [2]:
import threading
import time
import dotenv
from openai import AzureOpenAI

DEPLOYMENT = dotenv.get_key(dotenv.find_dotenv(), "DEPLOYMENT")
ENDPOINT = dotenv.get_key(dotenv.find_dotenv(), "ENDPOINT")
KEY = dotenv.get_key(dotenv.find_dotenv(), "KEY")
VERSION = dotenv.get_key(dotenv.find_dotenv(), "VERSION")
# gets the API Key from environment variable AZURE_OPENAI_API_KEY
client = AzureOpenAI(
	# https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
	api_version=VERSION,
	# https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
	azure_endpoint=ENDPOINT,
	api_key=KEY

)

In [3]:
# System prompts for each agent

section_1 = \
	"""
	Instructions for Writing Section 1: Opening Details with Detailed Information
	
	1. **Document Title and Header**: Start your document by labeling it as "Meeting Minutes" followed by the specific name of the meeting. This title should be bold and centered at the top of the page. For example:
	- Meeting Minutes: Quarterly Budget Review Meeting
	
	
	2. **Meeting Title**:
	- Below the document title, write "Meeting Title:" and then specify the exact purpose of the meeting. Choose a title that reflects the main objective or the broad area of discussion. This helps in identifying the focus of the meeting at a glance.
	
	3. **Date and Time**:
	- Next line, label "Date and Time:". Here, you will provide when the meeting occurred. Use the format "Month Day, Year, from HH:MM AM/PM to HH:MM AM/PM." For virtual meetings across different time zones, specify the primary time zone and consider noting a few others for reference.
	- Example: "March 12, 2024, from 10:00 AM to 12:00 PM EST (GMT-5)"
	
	4. **Location**:
	- On the following line, write "Location:". Indicate the physical location with a full address if it's an in-person meeting. For virtual meetings, state the platform used (e.g., Zoom, Microsoft Teams) and include the link or meeting ID. It's helpful to mention whether it's a virtual or hybrid meeting.
	- Example for a physical meeting: "City Hall, Room 101, 123 Main St., Springfield"
	- Example for a virtual meeting: "Virtual - Zoom Meeting, ID: 123-456-789, Link: [Zoom Meeting Link]"
	
	5. **Attendees**:
	- Introduce a new section titled "Attendees". Divide this section into three parts: "Members Present," "Guests," and "Absentees."
	- For each person listed, include their full name, title, or role within the organization, and their department if applicable. This ensures clarity on who was involved and their capacity.
	- Example:
	  ```
	  Members Present: John Doe, Finance Director; Jane Smith, Budget Analyst
	  Guests: Alex Johnson, External Auditor
	  Absentees: Mike Ross, Assistant Finance Director
	  ```
	
	6. **Chairperson and Secretary**:
	- Conclude this section by identifying the Chairperson and the Secretary of the meeting. Write "Chairperson:" followed by the name and title of the individual who led the meeting. Then, write "Secretary:" followed by the name and title of the person responsible for recording the minutes.
	- These roles are crucial for accountability and reference, as the Chairperson guides the meeting's flow, and the Secretary ensures all discussions and decisions are accurately documented.
	
	7. **Formatting Tips**:
	- Use bullet points or a numbered list for the Attendees section to enhance readability.
	- Maintain consistency in font and formatting throughout the document. Use a clear, professional font like Times New Roman or Arial, size 12.
	- Ensure all names and titles are accurately spelled, and double-check the date, time, and location for correctness.
	
	By following these detailed instructions, your Opening Details section will provide a comprehensive and clear overview of the foundational aspects of the meeting. This meticulous approach ensures that anyone reading the minutes can immediately understand the essential logistics of the meeting, setting a professional tone for the document.
	
	"""

section_2 = \
	"""
	Instructions for Writing Section 2: Agenda Items with Detailed Information
	
	1. **Section Header**:
	   - Start this section with a header titled "Agenda Items." This indicates you're moving into the core content of the meeting, focusing on the topics discussed.
	
	2. **Itemizing Each Agenda Item**:
	   - List each agenda item as discussed during the meeting. Use a new sub-header for each item, which can be numbered or bulleted, depending on your preference. The title of each agenda item should be concise but descriptive enough to understand the topic at a glance.
	   - Example:
		 ```
		 1. Budget Overview for Q2
		 ```
	
	3. **Presenter**:
	   - For each agenda item, clearly identify the presenter or lead discussant. Write "Presenter:" followed by the individual's name and title. This assigns responsibility and provides a point of contact for follow-up questions.
	   - Example:
		 ```
		 Presenter: Jane Doe, Chief Financial Officer
		 ```
	
	4. **Discussion Summary**:
	   - Beneath the presenter, provide a summary of the discussion for that agenda item. Start with "Discussion Summary:" and then bullet or paragraph your summary. This should capture the key points discussed, significant viewpoints expressed, and any rationale behind decisions or opinions.
	   - Keep it concise but informative. Avoid unnecessary detail but ensure that someone who wasn't at the meeting can grasp what was discussed.
	   - Example:
		 ```
		 Discussion Summary:
		 - Reviewed the Q2 budget projections and identified potential overspending areas.
		 - Discussed reallocating funds from underutilized programs to areas of higher need.
		 ```
	
	5. **Action Items**:
	   - If any action items arise from the discussion, list these under the sub-heading "Action Items:". For each action, specify the task, who is responsible (assignee), and the deadline.
	   - Action items should be clear and actionable, with a specific outcome in mind. This clarity helps in follow-up and accountability.
	   - Example:
		 ```
		 Action Items:
		 - John Smith to prepare a detailed report on potential overspending areas by April 15, 2024.
		 - Finance Department to review and propose reallocation strategies by April 30, 2024.
		 ```
	
	6. **Repeat for Each Agenda Item**:
	   - Repeat steps 2 through 5 for each agenda item discussed during the meeting. Ensure each item is clearly separated and labeled for easy navigation through the document.
	
	7. **Formatting Tips**:
	   - Use headings and subheadings to organize the section and each agenda item. Consider using bold or italicized text to differentiate between titles, names, and the body text.
	   - Bullet points are useful for summarizing discussion points and listing action items, as they enhance readability.
	   - Maintain a consistent structure for each agenda item to help readers quickly find information.
	
	By following these detailed instructions, you’ll be able to accurately document the Agenda Items section of your meeting minutes. This section is vital for capturing the essence of what was discussed and ensuring that all participants and stakeholders are aware of the discussions and decisions.
	
	"""

section_3 = \
	"""
	Instructions for Writing Section 3: Decisions Made with Detailed Information
	
	1. **Section Header**:
	   - Begin this section with a bold header titled "Decisions Made" to clearly indicate that this part of the document will cover the conclusive outcomes of the discussions.
	
	2. **Listing Decisions**:
	   - For each decision made during the meeting, create a new sub-section. You can number these decisions for ease of reference. Each decision should have a brief, descriptive title that encapsulates the outcome.
	
	3. **Decision Title**:
	   - Start with "Decision Title:" followed by a succinct title that captures the essence of the decision. This helps readers quickly identify the decision’s subject matter.
	   - Example:
		 ```
		 Decision Title: Approval of Q2 Budget Reallocation
		 ```
	
	4. **Description of the Decision**:
	   - Under the title, provide a detailed description of the decision. Begin with "Description:" and elaborate on what was decided, including any specifics that give clarity to the decision's intent, scope, and impact.
	   - Be precise in detailing the decision to ensure there's no ambiguity about what was agreed upon.
	   - Example:
		 ```
		 Description: The committee unanimously approved the reallocation of funds from the marketing budget to the research and development budget for Q2, increasing R&D funding by 15%.
		 ```
	
	5. **Responsible Party**:
	   - Identify who is responsible for implementing the decision with "Responsible Party:". List the person or department tasked with carrying out the decision, providing clear accountability.
	   - Example:
		 ```
		 Responsible Party: John Doe, Director of Finance
		 ```
	
	6. **Deadline**:
	   - If applicable, specify a deadline for when the decision needs to be implemented or when a follow-up is required. Use "Deadline:" followed by the specific date or timeframe.
	   - This ensures that there is a clear timeline for action and review.
	   - Example:
		 ```
		 Deadline: Implementation by April 30, 2024
		 ```
	
	7. **Repeat for Each Decision**:
	   - Repeat steps 2 through 6 for each decision that was made during the meeting. Ensure that each decision is clearly delineated and detailed for easy understanding and reference.
	
	8. **Formatting Tips**:
	   - Use consistent formatting for each decision to help the reader navigate through the section easily. Consistent headers for decision title, description, responsible party, and deadline aid in readability.
	   - Keep the language clear and direct to avoid any confusion about what was decided.
	   - Bullet points or numbered lists can be effective for separating different aspects of the decision (e.g., description, responsible party, deadline).
	
	By meticulously following these instructions, you'll create a comprehensive and clear "Decisions Made" section. This part of the meeting minutes is crucial for documenting the outcomes of discussions and ensuring that all attendees and relevant stakeholders are aware of the actions to be taken.
	
	"""

section_4 = \
	"""
	Instructions for Writing Section 4: Action Items with Detailed Information
	
	1. **Section Header**:
	   - Start with a clear header titled "Action Items" to indicate this section will enumerate specific tasks to be completed as a result of the meeting's discussions and decisions.
	
	2. **Itemizing Action Items**:
	   - List each action item that was identified during the meeting. You can use a bullet list or a numbered list for clarity. Each action item should be concise yet descriptive enough to convey the task fully.
	
	3. **Action Description**:
	   - For each action item, begin with "Action:" followed by a detailed description of the task. This description should be clear and specific, outlining what needs to be done. Avoid vague language to ensure that the action can be executed without additional clarification.
	   - Example:
		 ```
		 Action: Prepare a detailed report comparing Q2 budget projections against actual spending, highlighting areas of concern and potential savings.
		 ```
	
	4. **Assigned To**:
	   - Specify who is responsible for completing the action item with "Assigned To:". This should be an individual's name or, if applicable, a department/team name. Assigning responsibility is crucial for accountability and follow-up.
	   - Example:
		 ```
		 Assigned To: Jane Doe, Budget Analyst
		 ```
	
	5. **Deadline**:
	   - Set a clear deadline for the action item with "Deadline:". Provide a specific date to ensure there is a timeframe for completion. Deadlines help in prioritizing tasks and monitoring progress.
	   - Example:
		 ```
		 Deadline: May 15, 2024
		 ```
	
	6. **Status (Optional)**:
	   - You may choose to include a "Status:" field to note the current progress of the action item at the time of writing the minutes. This can be particularly useful for ongoing tasks or for the next meeting’s follow-up.
	   - Example:
		 ```
		 Status: Assigned
		 ```
	
	7. **Repeat for Each Action Item**:
	   - Follow steps 2 through 6 for each action item that arises from the meeting’s agenda. Ensure that each action is clearly defined with a responsible party and a deadline for a structured follow-up process.
	
	8. **Formatting Tips**:
	   - Consistent use of bold or italics for key terms (e.g., Action, Assigned To, Deadline) helps distinguish the essential elements of each action item.
	   - Consider using tables or a structured layout to organize the action items, especially if there are many. This can improve readability and make the document easier to scan.
	   - Maintain a concise, action-oriented language to ensure each task is clearly understood and actionable.
	
	By adhering to these detailed instructions, you will create a comprehensive "Action Items" section in your meeting minutes. This part is pivotal for tracking the progress of tasks, ensuring accountability, and facilitating the execution of decisions made during the meeting.
	
	"""

section_5 = \
	"""
	Instructions for Writing Section 5: Closing Summary with Detailed Information
	
	1. **Section Header**:
	   - Begin with a bold header titled "Closing Summary" to clearly delineate this concluding section from the rest of the document.
	
	2. **Meeting Adjournment**:
	   - Start by noting the time the meeting officially ended with "Meeting Adjourned:". This helps to document the meeting's duration and marks the formal conclusion of the session.
	   - Example:
		 ```
		 Meeting Adjourned: 12:00 PM
		 ```
	
	3. **Summary of Decisions and Action Items**:
	   - Provide a brief overview of the key decisions made and the action items assigned during the meeting. This recap is crucial for reinforcing the outcomes and ensuring that all participants are aligned on the next steps.
	   - Keep this summary concise and focused on the outcomes that have a significant impact or require immediate attention.
	   - Example:
		 ```
		 Summary of Decisions and Action Items:
		 - Approved Q2 budget reallocation, increasing R&D funding by 15%.
		 - Assigned John Doe to prepare a report on budget projections vs. actual spending by May 15, 2024.
		 ```
	
	4. **Next Meeting**:
	   - If the date, time, and location of the next meeting have already been determined, include this information here. This helps in ensuring that participants can schedule accordingly and prepares them for the next session.
	   - Example:
		 ```
		 Next Meeting: June 10, 2024, at 10:00 AM - Virtual via Zoom
		 ```
	
	5. **Closing Remarks**:
	   - Conclude with any final remarks or comments from the chairperson. This might include a thank you to the participants, a brief reflection on the meeting's productivity, or encouragement towards the execution of the decided actions.
	   - Example:
		 ```
		 Closing Remarks: The chairperson thanked all participants for their constructive discussions and emphasized the importance of the agreed-upon actions in achieving the department's objectives. All members were encouraged to prioritize the completion of their assigned tasks before the next meeting.
		 ```
	
	6. **Formatting Tips**:
	   - Use clear, concise language to ensure the closing summary is easily digestible and emphasizes the key takeaways.
	   - Consider bullet points for the summary of decisions and action items to enhance readability.
	   - Maintain consistent formatting with the rest of the document, using similar fonts, headings, and layout styles.
	
	By meticulously following these instructions, you'll create an effective Closing Summary for your meeting minutes. This section not only provides a succinct recap of the meeting's outcomes but also sets the stage for ongoing collaboration and accountability among participants.
	
	"""


# Experiment Setup

## Audio Corpus
We will be using an English excerpt from the IMDA National Speech Corpus (NSC). The aim of this transcription, and subsequent minutes generation, is to test the latest `large-v3` Whisper model on the __Singlish__ Accent.

We will be testing it on the sample ID `3030` from the `NSC` dataset. This sample contains a conversation between a Singaporean Male and Singaporean Female. The exact lexicon used in the conversation recording includes Singlish phrases i.e. 'aiya', 'leh', 'lah'. We will be testing the model's ability to transcribe these phrases accurately. NSC provides:
- Speaker 1's Audio
- Speaker 2's Audio
- Overall Audio

## Transcription Corpus
The NSC also provides a transcription of the audio corpus. We will be using this to compare the model's transcription accuracy. NSC provides:
- Speaker 1's Transcription
- Speaker 2's Transcription

This will greatly aid our efforts when comparing the efficacy of the text-level speaker diarization later on. Do note that the transcription is given using the TextGrid format. From my initial analysis, it seems to conform with some variation of SSML. 

## Model
We will be using the latest `large-v3` model offered by OpenAI, running it on the CPU (due to lack of CUDA GPU on my current system), and analysing the transcription accuracy. If we deem that it is up to standard, then we can proceed to __Speaker Diarization__. Else, we may put plans in place to retrain the model on the Singlish Accent. As mentioned above, we are currently using the IMDA NSC as our audio corpus, and it's roughly 890Gb of data. We will be using a small subset of this data for the initial testing.

## Future Plans
If the model is up to standard, we will proceed to __Speaker Diarization__ and __Minutes Generation__.

In [4]:
import whisper
import torch
torch.cuda.init()

model = whisper.load_model("medium")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)

Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(1024, 1024, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-23): 24 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=1024, out_features=1024, bias=True)
          (key): Linear(in_features=1024, out_features=1024, bias=False)
          (value): Linear(in_features=1024, out_features=1024, bias=True)
          (out): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (attn_ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (mlp_ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((

## Faster-Whisper
Tried the `large-v3` model on float16, the transcription was inconsistent AND not accurate. Reverting back to vanilla whisper (on cuda)

In [5]:
# Load the model and the processor
# from faster_whisper import WhisperModel
# import torch
# torch.cuda.init()
# 
# model_size = "large-v3"

# Run on GPU with FP16
# model = WhisperModel(model_size, device="cuda", compute_type="float16")

# segments, info = model.transcribe("./content/atc_train.mp3", beam_size=5)
# 
# print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
# 
# for segment in segments:
#     print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

# Meeting-Specific Prompts and Phrases


In [6]:
general = ['Singaporean Singlish Government Business Meeting Transcription Recording']

singlish_phrases = [
	"ah", "lah", "aiya", "leh", "aiyo", "can or not", "on the ball", "makan session", "pow-wow",
	"kiasu", "bo jio", "sian", "shiok", "jialat", 'saigang',
	"talk cock", "wayang", "kena", "chop-chop", "steady",
	"own time own target (OTOT)", "kopi talk", "catch up", "brainstorm", "align",
	"lobang", "paiseh", "action", "agaration", "angkat bola",
	"bao ga liao", "buay pai", "cheem", "chio", "garang",
	"goondu", "kaypoh", "leh", "lor", "nia",
	"one corner", "open table", "pai seh", "relak one corner", "sabo",
	"sai kang", "shiok", "siam", "sikit-sikit", "suay",
	"tabao", "talk shop", "tan tio", "up lorry", "wa kau"
]

singlish_business_phrases = [
	"lah", "can or not?", "on the ball", "kiasu", "shiok",
	"talk cock", "steady pom pi pi", "own time own target", "bo jio", "catch no ball",
	"chiong", "chop chop", "die die must do", "eat snake", "gostan",
	"jialat", "kaypoh", "leh", "lor", "makan",
	"nabei", "paiseh", "sabo", "sian", "suay",
	"walao eh", "wayang", "win already lor", "yaya papaya", "zi high",
	"send it", "check back next week", "let’s touch base on this", "circle back on that", "park this for now",
	"align our ducks", "low key", "see how", "can make it", "noted with thanks",
	"bo bian", "anyhow", "confirm plus chop", "got chance", "mai tu liao",
	"double confirm", "one shot", "over already", "swee", "talk later"
]

# singlish_more_formal_phrases = [
# 	'appreciate your feedback', 'before we proceed', "let's take this offline", 'moving forward', 'on the same page',
# 	'per your suggestion', 'please advise', 'point of contact', 'scope of work', 'stakeholder engagement',
# 	'strategic priorities', 'target milestones', 'thank you for your patience', 'timeline for completion',
# 	'touch base next week',
# 	"we'll circle back on this", 'action items', 'align our strategies', 'benchmark for success',
# 	'best practices in the industry',
# 	'client satisfaction', 'competitive advantage', 'comprehensive review', 'cost-effective solutions',
# 	'cross-functional collaboration',
# 	'due diligence', 'enhance our capabilities', 'feedback loop', 'forward-thinking approach', 'holistic strategy',
# 	'implementation phase', 'key takeaways', 'leverage our strengths', 'maximize efficiency', 'ongoing support',
# 	'optimize performance', 'proactive measures', 'quality assurance', 'risk management strategies',
# 	'seamless integration',
# 	'stakeholder feedback', 'sustainable growth', 'tailored solutions', 'value proposition', 'win-win situation',
# 	'workflow optimization', 'zero in on the details', 'drive innovation', 'escalate this issue', 'monitor progress'
# ]
# common_words = [
#     'agenda', 'align', 'benchmark', 'best practice', 'bottom line',
#     'brainstorm', 'brand', 'budget', 'buy-in', 'capacity',
#     'capital', 'collaborate', 'competitive', 'compliance', 'deliverable',
#     'disruptive', 'diversify', 'efficiency', 'engagement', 'execution',
#     'forecast', 'growth', 'innovate', 'insight', 'investment',
#     'KPI (Key Performance Indicator)', 'leverage', 'metrics', 'milestone', 'networking',
#     'objective', 'optimize', 'outcome', 'outsourcing', 'performance',
#     'prioritize', 'profitability', 'project', 'ROI (Return on Investment)', 'scalability',
#     'stakeholder', 'strategy', 'synergy', 'target', 'timeline',
#     'traction', 'value', 'vision', 'workflow', 'yield'
# ]
# additional_business_words = [
#     "touch base", "heads up", "debrief", "downtime", "feedback", 
#     "game plan", "goal", "hangout", "initiative", "kickoff", 
#     "loop in", "milestones", "network", "nitty-gritty", "onboard", 
#     "ping", "pivot", "proactive", "ramp up", "reach out", 
#     "recap", "roadmap", "run-through", "scope", "sidebar", 
#     "silo", "sprint", "stakeholders", "stand-up", "startup", 
#     "strategy session", "streamline", "touchpoint", "track", "upskill", 
#     "value-add", "win-win", "workaround", "workshop", "zoom in", 
#     "bandwidth", "deep dive", "ecosystem", "empower", "granular",
#     "holistic", "ideate", "iterate", "low-hanging fruit", "paradigm shift"
# ]
collated_list = general + singlish_phrases + singlish_business_phrases #+ singlish_more_formal_phrases  # +common_words + additional_business_words

collated_list_string = ' '.join(collated_list)

# Performing Transcription

In [7]:
import os
import re
import ast
from datetime import datetime

# Specify the directory path
transcriptions_dir = "./transcriptions"

# Check if the directory exists
if not os.path.exists(transcriptions_dir):
	# If the directory does not exist, create it
	os.makedirs(transcriptions_dir)


def find_latest_transcription(directory):
	# Regex pattern for matching the filename
	pattern = re.compile(r'transcription_(\d{2})(\d{2})(\d{2})\.txt')
	latest_file = None
	latest_date = None

	for filename in os.listdir(directory):
		match = pattern.match(filename)
		if match:
			# Extract day, month, year from the filename
			day, month, year = match.groups()
			file_date = datetime.strptime(f'20{year}{month}{day}', '%Y%m%d')

			# Update the latest file based on date
			if not latest_date or file_date > latest_date:
				latest_date = file_date
				latest_file = filename

	return latest_file if latest_file else None


# Attempt to find the latest transcription file
latest_transcription_file = find_latest_transcription(transcriptions_dir)

if latest_transcription_file:
	# Full path for the latest file including the directory
	file_path = os.path.join(transcriptions_dir, latest_transcription_file)
	try:
		with open(file_path, "r") as stored_result:
			# If the file exists and is opened successfully, read the content
			temp_result = stored_result.read()
			result = ast.literal_eval(temp_result)
			# result = dict(result)
		print('\033[92mTranscription located:\033[0m')
		print(result['text'])

	except FileNotFoundError:
		print("File not found, although it was expected to exist.")
else:
	# No transcription file matching the pattern was found
	print('\033[91mNo matching transcription files found.\033[0m')
	# If the file does not exist, execute the transcription process and create the file
	temp_result = str(model.transcribe("./content/singlish_accent/3030_trimmed/3030-combined_trimmed.mp3", verbose=True,
	                              language="en", prompt=collated_list_string))
	file_path = os.path.join(transcriptions_dir, f'transcription_{datetime.now().strftime("%y%m%d")}.txt')
	with open(file_path, "w") as f:
		f.write(temp_result)
	result = ast.literal_eval(temp_result)
	print(f'\033[92mTranscription completed and saved at {file_path}.\033[0m')


[91mNo matching transcription files found.[0m
[00:00.000 --> 00:06.000]  or the caribor, when I get you up especially when you look like a piggy
[00:06.000 --> 00:07.000]  Who said so?
[00:07.000 --> 00:09.000]  I said, oh very nice man
[00:09.000 --> 00:11.000]  Oh so do you drive there?
[00:11.000 --> 00:13.000]  No I walk over there
[00:13.000 --> 00:14.000]  Really? Huh?
[00:14.000 --> 00:15.000]  I walk over there
[00:15.000 --> 00:17.000]  How do you walk?
[00:17.000 --> 00:19.000]  I can walk no problem at all
[00:19.000 --> 00:21.000]  How many kilometer you walk?
[00:21.000 --> 00:23.000]  I can't remember
[00:23.000 --> 00:26.000]  Because I walk until my legs don't fit
[00:26.000 --> 00:28.000]  So will you go there?
[00:28.000 --> 00:30.000]  Will you go back again?
[00:30.000 --> 00:32.000]  I don't think so, one season down
[00:32.000 --> 00:33.000]  Why?
[00:33.000 --> 00:35.000]  Because it's too much
[00:35.000 --> 00:38.000]  So which is your next destination?
[00:3

# Transcription Cleanup and Diarization

In [13]:
per_line = []
for segment in result['segments']:
	text_to_append = segment['text']
	text_to_append = text_to_append[1:]
	per_line.append(text_to_append)

per_line

['or the caribor, when I get you up especially when you look like a piggy',
 'Who said so?',
 'I said, oh very nice man',
 'Oh so do you drive there?',
 'No I walk over there',
 'Really? Huh?',
 'I walk over there',
 'How do you walk?',
 'I can walk no problem at all',
 'How many kilometer you walk?',
 "I can't remember",
 "Because I walk until my legs don't fit",
 'So will you go there?',
 'Will you go back again?',
 "I don't think so, one season down",
 'Why?',
 "Because it's too much",
 'So which is your next destination?',
 'Maybe next one go to another part of Africa',
 'Oh',
 "Western Africa, no it's more challenging",
 'Oh is it?',
 "Because it's more fun than what you expect to be",
 "Then why don't you go to South Australia?",
 'South Australia',
 'Yeah, southern lights',
 'So Australia is nothing compared to Africa',
 'Because Africa is a very wild place to go',
 'Yeah I mean',
 "Until you go until you say, I don't want to go again",
 'Because Australia you go you want to go 

## Naive Diarization

For this, we input the whole transcript into the model, but we run the risk of running out of context.

In [14]:
diarization = client.chat.completions.create(
	model=DEPLOYMENT,
	messages=[
		{"role": "system",
		 "content": "You are a linguistics expert with 50 years of experience. You will be given a list of sentence, and you are to assign the Speaker label to each sentence PER line. I.e. Given ['It wasn't my fault', 'I didn't say it was', 'Don't accuse me'], you will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']"},
		{"role": "user",
		 "content": f"Here is the list of sentences: {per_line}. You will diarize ALL the words/phrases in the list. JUST RETURN ME THE LIST. You WILL ensure that you labeled EVERY line. You will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']"}
	],
	max_tokens=2000,
	stream=False,
	temperature=0.5,
)
end_result = diarization.choices[0].message.content
end_result

"['Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker... 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2']"

As we can see, the end result is broken and unreliable. As such, we will need to perform a more reliable diarization method.

We will proceed to a Chunk+Stride diarization method.

## Chunk+Stride Diarization

For the Chunk+Stride Diarization method, we will chunk the list into `x` number of chunks, and for each chunk, we will also add a stride of `a` behind and in front of the chunk. This will allow us to capture the context of the conversation better, and thus, provide a more accurate diarization.

\begin{equation}
N = \text{length of list}
\end{equation}
\begin{equation}
c = \text{chunk size}, \, a = \text{stride length}
\end{equation}
\begin{equation}
x = \left\lceil \frac{N}{c} \right\rceil, \, \text{number of chunks}
\end{equation}
\begin{equation}
\text{Initial chunks definition:}
\end{equation}
\begin{equation}
S_i = i \cdot c, \, E_i = \min((i+1) \cdot c - 1, N-1), \, 0 \leq i < x
\end{equation}
\begin{equation}
\text{Stride modifications:}
\end{equation}
\begin{equation}
\text{For } i = 1 \text{ to } x-2:
\end{equation}
\begin{equation}
\text{Prepend } \text{sort}(\text{last } a \text{ elements of } \text{chunk}_{i-1}, \text{desc}) \text{ to } \text{chunk}_i
\end{equation}
\begin{equation}
\text{Append } \text{sort}(\text{first } a \text{ elements of } \text{chunk}_{i+1}) \text{ to } \text{chunk}_i
\end{equation}
\begin{equation}
\text{For } i = 0 \text{ and } i = x-1, \text{ chunks remain unchanged.}
\end{equation}



In [155]:
from collections import deque
STRIDE = 2
NUMBER_OF_CHUNKS = 8

def chunk_with_stride_and_indices(initial_list: list, stride: int, number_of_chunks: int):
	N = len(initial_list)
	chunk_indices = list(range(0, N))

	# Create initial chunks
	initial_chunks = [initial_list[i * number_of_chunks:(i + 1) * number_of_chunks] for i in
	                  range((N + number_of_chunks - 1) // number_of_chunks)]
	initial_chunk_indices = [chunk_indices[i * number_of_chunks:(i + 1) * number_of_chunks] for i in
	                         range((N + number_of_chunks - 1) // number_of_chunks)]

	stride_chunks = []
	stride_chunk_indices = []  # Track indices for each strided chunk

	for chunk_index in range(len(initial_chunks)):
		current_indices = deque(initial_chunk_indices[chunk_index])

		if chunk_index == 0:
			stride_chunks.append(initial_chunks[chunk_index])
			stride_chunk_indices.append(list(current_indices))
		elif 0 < chunk_index < len(initial_chunks) - 1:
			current_chunk = deque(initial_chunks[chunk_index])

			# Retrieve elements and their indices from the previous chunk to perform backwards stride
			previous_chunk_elements = initial_chunks[chunk_index - 1][-stride:]
			previous_chunk_elements.sort(reverse=True)
			previous_indices = initial_chunk_indices[chunk_index - 1][-stride:]
			previous_indices.sort(reverse=True)

			for past_element, past_index in zip(previous_chunk_elements, previous_indices):
				current_chunk.appendleft(past_element)
				current_indices.appendleft(past_index)

			future_chunk_elements = initial_chunks[chunk_index + 1][:stride]
			future_indices = initial_chunk_indices[chunk_index + 1][:stride]

			for future_element, future_index in zip(future_chunk_elements, future_indices):
				current_chunk.append(future_element)
				current_indices.append(future_index)

			stride_chunks.append(list(current_chunk))
			stride_chunk_indices.append(list(current_indices))

		elif chunk_index == len(initial_chunks) - 1:
			current_chunk = deque(initial_chunks[chunk_index])
			previous_chunk_elements = initial_chunks[chunk_index - 1][-stride:]
			previous_chunk_elements.sort(reverse=True)
			previous_indices = initial_chunk_indices[chunk_index - 1][-stride:]
			previous_indices.sort(reverse=True)

			for past_element, past_index in zip(previous_chunk_elements, previous_indices):
				current_chunk.appendleft(past_element)
				current_indices.appendleft(past_index)

			stride_chunks.append(list(current_chunk))
			stride_chunk_indices.append(list(current_indices))
		else:
			raise ValueError("Invalid chunk index")

	print("Strided Chunks:", stride_chunks)
	print("Strided Chunk Indices:", stride_chunk_indices)
	return stride_chunks, stride_chunk_indices


strided_chunks, strided_chunk_indices = chunk_with_stride_and_indices(per_line, STRIDE,NUMBER_OF_CHUNKS)


Strided Chunks: [['or the caribor, when I get you up especially when you look like a piggy', 'Who said so?', 'I said, oh very nice man', 'Oh so do you drive there?', 'No I walk over there', 'Really? Huh?', 'I walk over there', 'How do you walk?'], ['How do you walk?', 'I walk over there', 'I can walk no problem at all', 'How many kilometer you walk?', "I can't remember", "Because I walk until my legs don't fit", 'So will you go there?', 'Will you go back again?', "I don't think so, one season down", 'Why?', "Because it's too much", 'So which is your next destination?'], ["I don't think so, one season down", 'Why?', "Because it's too much", 'So which is your next destination?', 'Maybe next one go to another part of Africa', 'Oh', "Western Africa, no it's more challenging", 'Oh is it?', "Because it's more fun than what you expect to be", "Then why don't you go to South Australia?", 'South Australia', 'Yeah, southern lights'], ["Because it's more fun than what you expect to be", "Then why

In [156]:
def chunk_with_stride_and_indices(initial_list: list, stride: int, number_of_chunks: int):
	N = len(initial_list)
	chunk_indices = list(range(0, N))

	# Calculate the size of each chunk
	chunk_size = (N + number_of_chunks - 1) // number_of_chunks

	# Create initial chunks
	initial_chunks = [initial_list[i * chunk_size:(i + 1) * chunk_size] for i in range(number_of_chunks)]
	initial_chunk_indices = [chunk_indices[i * chunk_size:(i + 1) * chunk_size] for i in range(number_of_chunks)]

	stride_chunks = []
	stride_chunk_indices = []  # Track indices for each strided chunk

	for chunk_index in range(len(initial_chunks)):
		current_chunk = deque(initial_chunks[chunk_index])
		current_indices = deque(initial_chunk_indices[chunk_index])

		if chunk_index > 0:  # For all chunks except the first, prepend elements from the previous chunk
			previous_chunk_elements = initial_chunks[chunk_index - 1][-stride:]
			previous_indices = initial_chunk_indices[chunk_index - 1][-stride:]

			for past_element, past_index in zip(reversed(previous_chunk_elements), reversed(previous_indices)):
				current_chunk.appendleft(past_element)
				current_indices.appendleft(past_index)

		if chunk_index < len(initial_chunks) - 1:  # For all chunks except the last, append elements from the next chunk
			future_chunk_elements = initial_chunks[chunk_index + 1][:stride]
			future_indices = initial_chunk_indices[chunk_index + 1][:stride]

			for future_element, future_index in zip(future_chunk_elements, future_indices):
				current_chunk.append(future_element)
				current_indices.append(future_index)

		stride_chunks.append(list(current_chunk))
		stride_chunk_indices.append(list(current_indices))

	print("Strided Chunks:", stride_chunks)
	print("Strided Chunk Indices:", stride_chunk_indices)
	return stride_chunks, stride_chunk_indices


strided_chunks, strided_chunk_indices = chunk_with_stride_and_indices(per_line, STRIDE, NUMBER_OF_CHUNKS)

Strided Chunks: [['or the caribor, when I get you up especially when you look like a piggy', 'Who said so?', 'I said, oh very nice man', 'Oh so do you drive there?', 'No I walk over there', 'Really? Huh?', 'I walk over there', 'How do you walk?', 'I can walk no problem at all', 'How many kilometer you walk?', "I can't remember", "Because I walk until my legs don't fit", 'So will you go there?', 'Will you go back again?', "I don't think so, one season down", 'Why?', "Because it's too much"], ['Will you go back again?', "I don't think so, one season down", 'Why?', "Because it's too much", 'So which is your next destination?', 'Maybe next one go to another part of Africa', 'Oh', "Western Africa, no it's more challenging", 'Oh is it?', "Because it's more fun than what you expect to be", "Then why don't you go to South Australia?", 'South Australia', 'Yeah, southern lights', 'So Australia is nothing compared to Africa', 'Because Africa is a very wild place to go', 'Yeah I mean', "Until you 

In [157]:
def chunk_with_stride_and_indices(initial_list: list, stride: int, number_of_chunks: int):
	stride -= 1
	N = len(initial_list)

	# Calculate base chunk size without considering stride for simplicity
	base_chunk_size = (N + number_of_chunks - 1) // number_of_chunks

	# Prepare initial chunks without stride
	initial_chunks = [initial_list[i * base_chunk_size:(i + 1) * base_chunk_size] for i in range(number_of_chunks)]
	initial_chunk_indices = [list(range(i * base_chunk_size, min((i + 1) * base_chunk_size, N))) for i in
	                         range(number_of_chunks)]

	stride_chunks = []
	stride_chunk_indices = []

	for i in range(number_of_chunks):
		# Calculate the effective start and end, incorporating stride where applicable
		start = max(0, i * base_chunk_size - stride)
		end = min(N, (i + 1) * base_chunk_size + stride if i < number_of_chunks - 1 else N)

		# Slice the original list and indices accordingly
		current_chunk = initial_list[start:end]
		current_indices = list(range(start, end))

		stride_chunks.append(current_chunk)
		stride_chunk_indices.append(current_indices)

	return stride_chunks, stride_chunk_indices


strided_chunks, strided_chunk_indices = chunk_with_stride_and_indices(per_line, stride=STRIDE, number_of_chunks=NUMBER_OF_CHUNKS)
# Demonstrating output
for i, (chunk, indices) in enumerate(zip(strided_chunks, strided_chunk_indices), start=1):
	# print(f"Chunk {i}: {chunk}")
	print(f"Indices {i}: {indices}")

Indices 1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Indices 2: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
Indices 3: [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]
Indices 4: [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]
Indices 5: [59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75]
Indices 6: [74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]
Indices 7: [89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105]
Indices 8: [104, 105, 106, 107, 108, 109, 110, 111, 112]


In [158]:
strided_chunks

[['or the caribor, when I get you up especially when you look like a piggy',
  'Who said so?',
  'I said, oh very nice man',
  'Oh so do you drive there?',
  'No I walk over there',
  'Really? Huh?',
  'I walk over there',
  'How do you walk?',
  'I can walk no problem at all',
  'How many kilometer you walk?',
  "I can't remember",
  "Because I walk until my legs don't fit",
  'So will you go there?',
  'Will you go back again?',
  "I don't think so, one season down",
  'Why?'],
 ["I don't think so, one season down",
  'Why?',
  "Because it's too much",
  'So which is your next destination?',
  'Maybe next one go to another part of Africa',
  'Oh',
  "Western Africa, no it's more challenging",
  'Oh is it?',
  "Because it's more fun than what you expect to be",
  "Then why don't you go to South Australia?",
  'South Australia',
  'Yeah, southern lights',
  'So Australia is nothing compared to Africa',
  'Because Africa is a very wild place to go',
  'Yeah I mean',
  "Until you go unti

In [159]:
strided_chunk_indices

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
 [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45],
 [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
 [59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75],
 [74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
 [89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105],
 [104, 105, 106, 107, 108, 109, 110, 111, 112]]

## Naive Strided Chunk Diarization (without speaker label history)
We will now perform diarization on the strided chunks, without incorporating the speaker labels from the previous strides. This will allow us to gauge the accuracy of the model in diarizing the text without any historical context.

This method introduces an approach to update audio chunks by incorporating strides, adding context from both past and future segments, to enhance the diarization process.

- **Backstride**: Elements from preceding segments
- **Forwardstride**: Elements from subsequent segments

An improvement involves adjusting each chunk to include additional context from preceding elements, thereby enriching the information available for more accurate speaker diarization.

### Example:
Let's examine how the chunked list is altered with a stride of `a=2`:

- **Initial chunked list**: `[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21]]`
- **Updated with stride**: `[[0, 1, 2, 3, 4, 5], [4, 5, 6, 7, 8, 9, 10, 11, 12, 13], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [16, 17, 18, 19, 20, 21]]`

In the diarization process, starting with the second chunk `[4, 5, 6, 7, 8, 9, 10, 11, 12, 13]`, context is enriched by incorporating sentences from the `backstride`. For example, if the elements `4` and `5` offer specific lexicon information pertinent to a prior diarization session, this contextual information aids the LLM in generating more accurate diarization outcomes without explicitly relying on speaker labels.


In [160]:
# Strided Chunk-based diarization without speaker labels

speaker_labels = []

for strided_chunk in strided_chunks:
	strided_formatted = '\n'.join(strided_chunk)
	diarization = client.chat.completions.create(
		model=DEPLOYMENT,
		messages=[
			{"role": "system",
			 "content": "You are a linguistics expert with 50 years of experience. You will be given a list of sentences seperated by newlines, and you are to assign the Speaker label to each sentence PER line. I.e. Given the prompt, you will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']. There is a possibility that a speaker may speak for more than 1 line at time."},
			{"role": "user",
			 "content": f"Here is the list of sentences: {strided_formatted}. You will diarize ALL the sentences in the list. JUST RETURN ME THE LIST. You WILL ensure that you labeled EVERY line. You will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']"}
		],
		max_tokens=2000,
		stream=False,
		temperature=0.5,
	)
	end_result = diarization.choices[0].message.content
	speaker_labels.append(end_result)

print(speaker_labels)

["['Speaker 1', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2']", "['Speaker 1', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2', 'Speaker 2']", "['Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 1', 'Speaker 2', '

As you can see, there are occurrences of a `Speaker 3`, which is incorrect. This is due to the model not being able to accurately diarize the text without historical context. We will now proceed to the next step, where we will incorporate the speaker labels from the previous strides into the diarization context.

## Label-Aware Adaptive Diarization

With the strategy of updating chunks to include strides (for incorporating past and future context), the next step involves implementing a function to iterate through this modified chunk list for diarization purposes.

- **Backstride**: Elements from the past stride
- **Forwardstride**: Elements from the future stride

__Additionally__, an enhancement will be integrated where each chunk, upon submission, will carry the respective speaker labels for the preceding `a` elements, enriching the context for improved accuracy.

### Example:
Consider the initial chunked list and its updated version with a stride `a=2`:

- **Initial chunked list**: `[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21]]`
- **Updated with stride**: `[[0, 1, 2, 3, 4, 5], [4, 5, 6, 7, 8, 9, 10, 11, 12, 13], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [16, 17, 18, 19, 20, 21]]`

For diarization, beginning with chunk 1 (`0<=i<x`), specifically the second chunk `[4, 5, 6, 7, 8, 9, 10, 11, 12, 13]`, the last `a=2` speaker labels from the `backstride` will be incorporated into the context. 

If, for instance, the speaker labels for elements `4` and `5` are `Speaker 1`, this provides the Language Model (LLM) with valuable historical context from the previous diarization session, hypothetically enabling more precise diarization results.


In [164]:
import pandas as pd
import numpy as np

comparison_df = pd.DataFrame({'original': per_line})
# Calculate the min and max to define the range of the DataFrame index
min_val = min(min(sublist) for sublist in strided_chunk_indices)
max_val = max(max(sublist) for sublist in strided_chunk_indices)

# Create a DataFrame with the correct indices
strided_chunk_df = pd.DataFrame(index=np.arange(min_val, max_val + 1))

# Populate the DataFrame with the data
for i, sublist in enumerate(strided_chunk_indices):
	strided_chunk_df[f'strided_chunk_{i}'] = pd.Series(index=sublist, data=strided_chunks[i])

combined_df = pd.concat([comparison_df, strided_chunk_df], axis=1)

combined_df

Unnamed: 0,original,strided_chunk_0,strided_chunk_1,strided_chunk_2,strided_chunk_3,strided_chunk_4,strided_chunk_5,strided_chunk_6,strided_chunk_7
0,"or the caribor, when I get you up especially w...","or the caribor, when I get you up especially w...",,,,,,,
1,Who said so?,Who said so?,,,,,,,
2,"I said, oh very nice man","I said, oh very nice man",,,,,,,
3,Oh so do you drive there?,Oh so do you drive there?,,,,,,,
4,No I walk over there,No I walk over there,,,,,,,
...,...,...,...,...,...,...,...,...,...
108,On the bed,,,,,,,,On the bed
109,Hey,,,,,,,,Hey
110,You can't say that sir,,,,,,,,You can't say that sir
111,On the water bed,,,,,,,,On the water bed


## Performing Label-Aware Adaptive Diarization
Doing the initial chunk: `i=0`

In [165]:
# FIRST RUN

# start with first chunk (no frills, no labels)

def label_aware(stride:int, number_of_chunks:int, combined_chunk_df):
	first_chunk_check = True
	total_label_list = []
	speaker_labels = None
	while first_chunk_check:
		first_chunk = combined_chunk_df['strided_chunk_0'].tolist()
		cleaned_first_chunk = [x for x in first_chunk if str(x) != 'nan']
		number_of_lines = len(cleaned_first_chunk)
		# first_chunk_formatted = '\n'.join(cleaned_first_chunk)
		
		diarization = client.chat.completions.create(
			model=DEPLOYMENT,
			messages=[
				{"role": "system",
				 "content": "You are a linguistics expert with 100 years of experience. You will be given a transcription of a CONVERSATION between an unknown number of speakers, and you are to assign the Speaker label to each sentence PER line. I.e. Given the prompt, you will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']. There is a possibility that a speaker may speak for more than 1 line at time. You will DO YOUR JOB WELL."},
				{"role": "user",
				 "content": f"Here is the list of sentences: {cleaned_first_chunk}, where there are {number_of_lines} exchanges. You will diarize ALL the sentences in the list. You WILL ensure that you label ALL {number_of_lines} lines. JUST RETURN ME THE LIST."}
			],
			max_tokens=2000,
			stream=False,
			temperature=0.2,
		)
		speaker_labels = diarization.choices[0].message.content
		first_chunk_check = False
		
	# convert labels into list
	speaker_labels_converted = ast.literal_eval(speaker_labels)
	
	# store labels aside
	total_label_list.append(speaker_labels_converted)
	# print(f'Automated Diarization for Chunk 0: \n{speaker_labels_converted}')
	
	# get stride info (transcription and label)
	def get_stride_info(chunk_number:int,total_label_list:list):
		speaker_labels_converted = total_label_list[-1]
		chunk_string = 'strided_chunk_'+str(chunk_number)
		stride_info_full = pd.DataFrame(combined_chunk_df[chunk_string].copy()).dropna()
		# print(stride_info_full)
		stride_info_full['speaker_labels'] = speaker_labels_converted
		stride_info = stride_info_full.copy().tail(stride).to_dict()
		labelled_stride_info = {stride_info[chunk_string][k]: stride_info['speaker_labels'][k] for k in stride_info[chunk_string]}
		return labelled_stride_info
	
	labelled_stride_info = get_stride_info(0,total_label_list)
	# print(f'Stride for Chunk 0: \n{labelled_stride_info}')
	
	
	
	
	# start loop from chunk i > 1 to x-2
	# grab stride 
	for j in range(1,number_of_chunks):
		# print(total_label_list)
		# get column string
		column_name = 'strided_chunk_'+str(j)
		if j>1:
			labelled_stride_info = get_stride_info(j,total_label_list)
			# print(labelled_stride_info)
		
		current_chunk = combined_chunk_df[column_name].tolist()
		cleaned_current_chunk = [x for x in current_chunk if str(x) != 'nan']
		# cleaned_current_chunk = cleaned_current_chunk[stride:]
		number_of_lines = len(cleaned_current_chunk)
		
		
		print(number_of_lines)
		print(labelled_stride_info)
		print(cleaned_current_chunk)

		# print(cleaned_current_chunk)
		diarization = client.chat.completions.create(
			model=DEPLOYMENT,
			messages=[
				{"role": "system",
				 "content": "You are a linguistics expert with 100 years of experience. You will be given a transcription of a CONVERSATION between an unknown number of speakers, and you are to assign the Speaker label to each sentence PER line. I.e. Given the prompt, you will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']. There is a possibility that a speaker may speak for more than 1 line at time. You will DO YOUR JOB WELL."},
				{"role": "user",
				 "content": f"Here are the previous exchanges RIGHT before this followed by their respective speaker(s):{labelled_stride_info}. Here is the list of sentences: {cleaned_current_chunk}. I MUST RECEIVE ALL {number_of_lines} exchanges"}
			],
			max_tokens=2500,
			stream=False,
			temperature=0,
		)
		speaker_labels = diarization.choices[0].message.content

		speaker_labels_converted = ast.literal_eval(speaker_labels)
		print(speaker_labels_converted)
		total_label_list.append(speaker_labels_converted)
		
		# print(len(total_label_list))
		# TODO: I'm getting cucked here because GPT always seems to return the list with the length always being -1 for some odd reason and i want to jump off a roof
		# TODO: But the basic implementation of the label-aware diarization seems to be working. the algorithm just needs some ironing out and then a shit ton of cleaning
		# TODO: making it OOP might be the way to go (if you have time)
	
	
label_aware(STRIDE,NUMBER_OF_CHUNKS,combined_df)
# speaker_labels

17
{"I don't think so, one season down": 'Speaker 1', 'Why?': 'Speaker 2'}
["I don't think so, one season down", 'Why?', "Because it's too much", 'So which is your next destination?', 'Maybe next one go to another part of Africa', 'Oh', "Western Africa, no it's more challenging", 'Oh is it?', "Because it's more fun than what you expect to be", "Then why don't you go to South Australia?", 'South Australia', 'Yeah, southern lights', 'So Australia is nothing compared to Africa', 'Because Africa is a very wild place to go', 'Yeah I mean', "Until you go until you say, I don't want to go again", 'Because Australia you go you want to go again and again and again']
['Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 1', 'Speaker 2', 'Speaker 2', 'Speaker 1', 'Speaker 2']
17
{'Do you feel like camping over there?': 'Speaker 1', 'Not really': 'Speaker 2'}
["Until you go until you say

ValueError: Length of values (16) does not match length of index (17)

## Analysing Speaker Labels


20


Unnamed: 0,strided_chunk_0,speaker_labels
0,"or the caribor, when I get you up especially w...",Speaker 1
1,Who said so?,Speaker 2
2,"I said, oh very nice man",Speaker 1
3,Oh so do you drive there?,Speaker 2
4,No I walk over there,Speaker 1
5,Really? Huh?,Speaker 2
6,I walk over there,Speaker 1
7,How do you walk?,Speaker 2
8,I can walk no problem at all,Speaker 1
9,How many kilometer you walk?,Speaker 2


18    Maybe next one go to another part of Africa
19                                             Oh
Name: strided_chunk_0, dtype: object