# YouTube Transcript Fetcher with 30-Second Segments - Version 1.3
## Author: Your Name
## Date: 2025-01-06

This script fetches the transcript of a YouTube video (if available),
groups the transcript into 30-second segments, and formats it into 
SubRip Subtitle (SRT) format, saves it as a JSON file, and also as a CSV file.

## Dependencies:
 - youtube-transcript-api

## Usage:
 - Run the script and input a YouTube video URL when prompted.
 - The transcript will be displayed in the terminal, saved as a .srt file, a .json file, and a .csv file.

In [32]:
!pip install youtube-transcript-api pytube



# from youtube_transcript_api.formatters import SRTFormatter
This formatter converts the transcript data into the SubRip Subtitle (SRT) format. The SRTFormatter class is used to format the fetched transcript into SRT-compliant subtitle files.

# import sys
Description:
The sys module provides access to some variables used or maintained by the interpreter. It is commonly used for system-specific parameters and functions, such as handling command-line arguments or exiting the script.

# import json
Description:
The json module is used for parsing JSON data. It is essential when working with JSON-formatted responses from APIs, such as fetching the transcript from YouTube.

# import csv
Description:
The csv module allows you to read and write CSV files in Python. It is helpful for exporting or processing transcript data in tabular format.

In [33]:
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import SRTFormatter
import sys
import json
import csv
from pytube import YouTube  # Import pytube to fetch the video title

# Function Documentation: `group_transcript_into_segments`

This function groups a YouTube video transcript into segments of a specified duration (default is 30 seconds). 
It takes a transcript as input, processes it, and returns a list of segments, each containing transcript entries that fit 
within the given duration.

# Function Documentation: `format_transcript_as_srt`

This function formats a YouTube video transcript into the SubRip Subtitle (SRT) format, which is commonly used for video subtitles. The function uses the `SRTFormatter` class from the `youtube_transcript_api` library to convert the transcript into a time-coded subtitle format.

# Function Documentation: `create_srt_file`

This function generates an SRT (SubRip Subtitle) file from a grouped transcript. It processes each segment of the grouped transcript, formats the time codes, joins the transcript text, and writes the result to an SRT file.

# Function Documentation: `create_json_file`

This function generates a JSON file from a grouped transcript. It processes each segment of the grouped transcript, compiles the start and end times, joins the transcript text, and writes the resulting data into a JSON file.

# Function Documentation: `create_csv_file`

This function generates a CSV file from a grouped transcript. It processes each segment of the grouped transcript, extracts the start and end times, combines the transcript text, and writes the resulting data into a CSV file.

# Function Documentation: `format_time`

This helper function formats a given time in seconds into the SubRip Subtitle (SRT) time format (`HH:MM:SS,MMM`). It converts the time into hours, minutes, seconds, and milliseconds, and returns it as a string in the required format for subtitles.

# Function Documentation: `get_transcript`

This function retrieves the transcript for a YouTube video using the `YouTubeTranscriptApi`. It handles errors if the transcript cannot be fetched and returns the transcript if successful.

# Function Documentation: `main`

The `main` function is the entry point of the script. It prompts the user to input a YouTube video URL, extracts the video ID, fetches the transcript using the `YouTubeTranscriptApi`, groups the transcript into 30-second segments, and processes the transcript into different formats (SRT, JSON, CSV). It also handles user input validation for the YouTube URL.

In [46]:
# Function to get transcript
def get_transcript(video_id):
    try:
        # Fetch transcript for the given video ID
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return transcript
    except Exception as e:
        print(f"Error fetching transcript: {e}")
        return None

# Function to get video title using pytube
def get_video_title(video_url):
    try:
        # Create a YouTube object and fetch the title
        yt = YouTube(video_url)
        return yt.title
    except Exception as e:
        #To account for a known issue with pytube
        title = ""
        while not title:
            title = input("Enter the YouTube video Title: ").strip()
            if not title:
                print("Title cannot be empty. Please try again.")
                
        return title

# Function to group transcript into 30-second intervals
def group_transcript_into_segments(transcript, segment_duration=30):
    segments = []
    current_segment = []
    current_start_time = 0
    current_end_time = segment_duration

    for entry in transcript:
        start_time = entry['start']
        end_time = start_time + entry['duration']
        
        # If the current transcript entry fits into the current 30-second window
        if start_time < current_end_time:
            current_segment.append(entry)
        else:
            # Push the current segment into segments list and reset for new window
            segments.append(current_segment)
            current_segment = [entry]
            current_start_time = start_time
            current_end_time = current_start_time + segment_duration

    # Add the last segment if any
    if current_segment:
        segments.append(current_segment)

    return segments

# Function to format transcript as SRT (SubRip Subtitle)
def format_transcript_as_srt(transcript):
    formatter = SRTFormatter()
    return formatter.format_transcript(transcript)

# Main function
def main():
    # Input: Provide the YouTube video URL
    video_url = input("Enter the YouTube video URL: ")

    # Extract the video ID from the URL
    if "youtube.com/watch?v=" in video_url:
        video_id = video_url.split("v=")[1]
        if "&" in video_id:
            video_id = video_id.split("&")[0]  # Get only the video ID part
    else:
        print("Invalid YouTube URL")
        sys.exit(1)

    print(f"Fetching transcript for video ID: {video_id}...\n")

    # Get the transcript
    transcript = get_transcript(video_id)

    if transcript:

        # Retrieve video title after fetching the transcript
        title = get_video_title(video_url)

        formatted_title = ''
        if title:
            print(f"Video Title: {title}")
            # Replace spaces with underscores in the title
            formatted_title = title.replace(" ", "_")
        else:
            print("Could not retrieve video title.")
        
        # Group the transcript into 30-second intervals
        grouped_transcript = group_transcript_into_segments(transcript)

        # Format each segment into SRT format
        srt_transcript = ""
        counter = 1
        for segment in grouped_transcript:
            srt_transcript += f"{counter}\n"
            segment_start = segment[0]['start']
            segment_end = segment[-1]['start'] + segment[-1]['duration']
            srt_transcript += f"{format_time(segment_start)} --> {format_time(segment_end)}\n"
            
            # Join all the texts in this segment into one string
            srt_transcript += " ".join([entry['text'] for entry in segment]) + "\n\n"
            counter += 1

        # Print out the formatted transcript
        print("Transcript:\n")
        print(srt_transcript)

        # Optionally save the transcript as an SRT file
        with open(f"./data/txt/{video_id}_{formatted_title}_transcript.txt", "w") as f:
            f.write(srt_transcript)
        
        print(f"Transcript saved as {video_id}_transcript.srt")

        transcript_data = []
        for segment in grouped_transcript:
            segment_data = {
                "start_time": segment[0]['start'],
                "end_time": segment[-1]['start'] + segment[-1]['duration'],
                "text": " ".join([entry['text'] for entry in segment])
            }
            transcript_data.append(segment_data)
        
        # Write the JSON to a file
        with open(f"./data/json/{video_id}_{formatted_title}_transcript.json", "w") as f:
            json.dump(transcript_data, f, indent=4)
        
        print(f"JSON file saved as {video_id}_transcript.json")

        # Open the CSV file for writing
        with open(f"./data/csv/{video_id}_{formatted_title}_transcript.csv", "w", newline='') as csvfile:
            fieldnames = ['start_time', 'end_time', 'text']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
            # Write the header
            writer.writeheader()
    
            # Write the transcript segments
            for segment in grouped_transcript:
                segment_start = segment[0]['start']
                segment_end = segment[-1]['start'] + segment[-1]['duration']
                segment_text = " ".join([entry['text'] for entry in segment])
    
                writer.writerow({
                    'start_time': format_time(segment_start),
                    'end_time': format_time(segment_end),
                    'text': segment_text
                })
        
        print(f"CSV file saved as {video_id}_transcript.csv")
        
    else:
        print("No transcript available for this video.")

def format_time(seconds):
    """ Helper function to format time in SRT format (HH:MM:SS,MMM) """
    millisec = int((seconds % 1) * 1000)
    seconds = int(seconds)
    minutes = (seconds // 60) % 60
    hours = seconds // 3600
    return f"{hours:02}:{minutes:02}:{seconds % 60:02},{millisec:03}"

if __name__ == "__main__":
    main()


Enter the YouTube video URL:  https://www.youtube.com/watch?v=CBqBNCHOe2Q


Fetching transcript for video ID: CBqBNCHOe2Q...



Enter the YouTube video Title:  263 ‒ Concussions and head trauma: symptoms, treatment, and recovery | Micky Collins, Ph.D.


Video Title: 263 ‒ Concussions and head trauma: symptoms, treatment, and recovery | Micky Collins, Ph.D.
Transcript:

1
00:00:00,179 --> 00:00:33,840
there's a lot of misinformation out there about concussion and I think it actually hurts outcomes a lot of times clinicians that aren't aware of the recent advances in knowing how to treat this clinicians that don't know how to do the right evaluation and there's a lot of mismanagement and mistreatment of this injury that leads to very poor outcomes again and you're going to hear it from me over and over again if you bring me a patient with concussion I I can pretty much tell you I can treat that and get get that patient better and get them

2
00:00:31,439 --> 00:01:04,439
back to the sports they love there are highly effective treatments with this injury hey everyone welcome to the drive podcast I'm your host Peter attia well Mickey thanks so much for uh for making time to sit down I know you're particularly busy today so I really apprec