# Phrase grouping concatentation into editable segments

The last step generated a groupings JSON, e.g.:

```json
[
  [0, 1, 3],
  [5],
  [6, 7, 8]
]
```

In this step, we're taking that and going back to the source transcript to output a TSV file like this:

```tsv
0-3	Thank you and good afternoon. It's great to be back in Chicago. Um And thanks for that kind introduction.
5-5	Um, so I'm, I'm looking forward to our conversation, my conversation, uh, with, uh, uh, Professor Rajan Rahu, but first I'll briefly discuss the outlook for the economy and monetary policy.
6-8	So at the Fed, we are always focused on the dual mandate goals that Congress has given us maximum employment and stable prices. Despite heightened uncertainty and downside risks, the US economy is still in a solid position. The labor market is at or near maximum employment. Inflation has come down a great deal, but it's still running a bit above our 2% objective.
```

Each of the lines above should be a compmlete thought; it should be something that can be cut and edited as needed.

In [1]:
import json
import csv

VIDEO_ID = 'gYXAulePuXY'

transcription_file = f"content/{VIDEO_ID}-transcript.json"
groupings_file = f"content/{VIDEO_ID}-groupings.json"
output_file = f"content/{VIDEO_ID}-segments.tsv"

# Load the transcription from AWS Transcribe
with open(transcription_file, 'r') as f:
    transcription = json.load(f)

# Load the phrase groupings (from an LLM)
with open(groupings_file, 'r') as f:
    groupings = json.load(f)

# Get the phrases portion of the transcription_file
audio_segments = transcription.get('results', {}).get('audio_segments', [])

# Create a lookup dictionary from source
id_to_transcript = {item["id"]: item["transcript"] for item in audio_segments}

# Expand each selection to a full range from first to last ID
expanded_selections = [
    list(range(group[0], group[-1] + 1)) if group else []
    for group in groupings
]

# Replace IDs in selections with transcripts if available
editable_segments = [
    [f"{group[0]}-{group[-1]}", ' '.join([id_to_transcript.get(id_, id_) for id_ in group])]
    for group in expanded_selections
]

# Write the segments to a new file
with open(output_file, 'w', newline='') as tsvfile:
   tsv_writer = csv.writer(tsvfile, delimiter='\t')
   tsv_writer.writerows(editable_segments)

print(f"Data written to {output_file}")


Data written to content/gYXAulePuXY-segments.tsv
