**YOUTUBE TRANSCRIPT GENERATOR**
1. *By: Allaudin Ansari*







**FEATURES:**


---


1. Transcript Retrieval:Fetches the transcript of a YouTube video using the youtube_transcript_api without requiring a YouTube API key.

2. URL Parsing: Extracts the video ID from various YouTube URL formats, ensuring versatility in input.

3. Chapter Parsing: Identifies and organizes chapters from the video description, allowing for structured transcripts that reflect the video’s content hierarchy.

4. Text Cleaning: Cleans the transcript text by removing unwanted elements such as:
Music tags (e.g., “[Music]”).
Escape sequences (e.g., newline characters).
Double greater-than symbols (e.g., “>>”).
Periods following hashtags.

5. Punctuation Restoration: Optionally restores punctuation in the transcript using a pre-trained punctuation model, enhancing readability.
Sentence Capitalization:

6. Sentence Capitalization: Automatically capitalizes the first letter of each sentence for improved presentation.

7. Flexible Output:Saves the processed transcript to a Markdown file, allowing for easy sharing and formatting. Supports specifying the output directory and filename, making organization simple.

8. Verbose Logging: Provides detailed logging throughout the process, helping users track progress and diagnose potential issues.
9. Customizability: Allows users to customize various parameters, including language for the transcript, punctuation model, and output options.

10. Natural Language Processing Integration:Utilizes the Natural Language Toolkit (nltk) for text tokenization, facilitating advanced text processing capabilities.

11. Error Handling: Includes mechanisms to handle errors gracefully, providing feedback for troubleshooting.


**Runtime**


---


Please go to Runtime > Change runtime type > Select T4 GPU.  This will ensure
best performance. Without a gpu, the punctuation will be very slow and can take minutes.

**Connect With Me**


---


*   Email: allu456654ansari@gmail.com
*   Github: https://github.com/allu0786ansari


# Let's Begin the Journey of Transcripting....

First install the required libraries required for the project.






In [14]:
pip install nltk youtube-transcript-api deepmultilingualpunctuation




# Set the youtube videos url and other as per the requirements
language use the language code to get the video. By default this module always picks manually created transcripts over automatically created ones, if a transcript in the requested language is available both manually created and generated.

punctuation_model values can be found at https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large#languages

In [11]:
url = 'https://youtu.be/KO1PH1CXb6s?si=PPW2XOuwEVYzRYxi'
language = 'en'
punctuated = True
output_dir = '.'
filename = ''
punctuation_model = ''
verbose = True

# Libraries or Modules required:

*   re: Regular expressions for string manipulation.

*   math: Mathematical functions (used for time calculations).


*   logging: Logging messages for debugging and information.

*   os: Operating system functionalities (e.g., file paths).
*   nltk: Natural Language Toolkit for text processing.


*   YouTubeTranscriptApi: API to fetch transcripts from YouTube videos.


*   PunctuationModel: A model for restoring punctuation in text.












In [7]:
import re
import math
import logging
import os
import nltk
from youtube_transcript_api import YouTubeTranscriptApi
from deepmultilingualpunctuation import PunctuationModel
import warnings

# NLTK Tokenization Setup and Logging Configuration:

In [8]:
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

logging.basicConfig(level=logging.INFO, force=True)
warnings.filterwarnings('ignore')

# Functions for Processing YouTube Video Transcripts


---
1. clean_for_filename(title): Cleans the provided title to create a valid filename by removing unwanted characters.
2. remove_music_tags(text): Strips [music] tags from the transcript to improve readability.
3. remove_period_after_hashes(text): Removes periods that follow hashtags, ensuring proper formatting.
4. remove_escape_sequences(text):Cleans escape sequences (like \n, \r\n) from the text for a smoother output.
5. remove_double_greater_than(text):Eliminates occurrences of ">>" from the text.
6. add_punctuation(text, punctuation_model):Restores punctuation in the text using a specified punctuation model.
8. capitalize_sentences(sentences):Capitalizes the first letter of each sentence in the provided list.
9. parse_youtube_url(url): Extracts the YouTube video ID from a given URL.
10. parse_chapters(description): Parses chapters from the video description, identifying timestamps and titles.
11. get_transcript(video_id, language, video_info, verbose=True): Fetches the transcript of the YouTube video using the YouTubeTranscriptApi and processes it to include chapter markers and cleaned text.
11. process_and_save_transcript(video_id, video_info, language, generate_punctuated, output_dir, filename, verbose, punctuation_model): Orchestrates the retrieval and saving of the cleaned transcript to a markdown file. Returns the transcript for printing.


In [12]:
# Function to clean filenames
def clean_for_filename(title):
    cleaned_title = re.sub(r'[^\w\s\.\-\(\)\[\]]', '', title)
    return cleaned_title.strip()

# Remove music tags from transcripts
def remove_music_tags(text):
    updated_text = re.sub(r'\[music\]', '', text, flags=re.IGNORECASE)
    return updated_text

# Remove periods after hashtags
def remove_period_after_hashes(text):
    return re.sub(r'(#\.|##\.)', lambda match: match.group(1)[:-1], text)

# Remove escape sequences from text
def remove_escape_sequences(text):
    return re.sub(r'\\[nrtb]|\\r\n', '', text)

# Remove occurrences of ">>"
def remove_double_greater_than(text):
    return re.sub(r'>>', '', text)

# Add punctuation using a punctuation model
def add_punctuation(text, punctuation_model):
    model = PunctuationModel(model=punctuation_model) if punctuation_model != "" else PunctuationModel()
    punctuated_text = model.restore_punctuation(text)
    return punctuated_text

# Capitalize the first letter of sentences
def capitalize_sentences(sentences):
    capitalized_sentences = [sentence[0].upper() + sentence[1:] for sentence in sentences]
    return capitalized_sentences

# Extract YouTube video ID from URL
def parse_youtube_url(url):
    video_id_match = re.search(r'(?:youtube\.com\/.*?[?&]v=|youtu\.be\/)([^"&?\/\s]{11})', url)
    if video_id_match:
        return video_id_match.group(1)
    else:
        raise ValueError('Invalid YouTube URL')

# Parse chapters from a video description
def parse_chapters(description):
    lines = description.split("\n")
    regex = re.compile(r"(\d{0,2}:?\d{1,2}:\d{2})")
    chapters = []
    for line in lines:
        matches = regex.findall(line)
        if matches:
            ts = matches[0]
            title = line.replace(ts, "").strip()
            title = re.sub(r'\d{0,2}:?\d{1,2}:\d{2}', '', title).strip().strip('-').strip()
            chapters.append({
                "timestamp": ts,
                "title": title,
            })
    return chapters

# Get transcript from YouTube video using YouTubeTranscriptApi (no API key required)
def get_transcript(video_id, language, video_info, verbose=True):
    transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=[language])

    transcript = f'# {video_info["title"]}\n\n' if video_info["title"] else ""
    current_chapter_index = 0
    chapters = video_info.get("chapters", [])

    for i, line in enumerate(transcript_list):
        start_time = int(math.floor(line['start']))
        if 0 <= current_chapter_index < len(chapters):
            chapter_time = chapters[current_chapter_index]['timestamp']
            try:
                chapter_start_seconds = sum(int(x) * 60 ** i for i, x in enumerate(reversed(chapter_time.split(':'))))
                buffer_time = 2
                if start_time >= chapter_start_seconds - buffer_time:
                    transcript += f'\n\n## {chapters[current_chapter_index]["title"]}\n'
                    current_chapter_index += 1
            except Exception as e:
                logging.error(f"Error processing chapter timestamp: {chapter_time}")
                logging.error(f"Error details: {e}")

        line['text'] = remove_music_tags(line['text'])
        line['text'] = remove_escape_sequences(line['text'])
        line['text'] = remove_double_greater_than(line['text'])
        if line['text']:
            transcript += line['text'].strip() + ' '

        if verbose and i % 100 == 0:
            logging.info(f"Processed {i} lines out of {len(transcript_list)}")

    return transcript

# Process and save transcript to a file
def process_and_save_transcript(video_id, video_info, language, generate_punctuated, output_dir, filename, verbose, punctuation_model):
    try:
        raw_transcript = get_transcript(video_id, language, video_info, verbose)
        if generate_punctuated:
            with_punctuation = add_punctuation(raw_transcript, punctuation_model)
            with_punctuation = remove_period_after_hashes(with_punctuation)
            sentences = nltk.sent_tokenize(with_punctuation)
        else:
            sentences = nltk.sent_tokenize(raw_transcript)

        capitalized_sentences = capitalize_sentences(sentences)
        double_linesep = os.linesep + os.linesep
        capitalized_transcript = double_linesep.join(capitalized_sentences)
        output_path = os.path.join(output_dir, f'{filename}.md')

        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(capitalized_transcript)

        logging.info(f'Transcript saved to {output_path}')
        return capitalized_transcript  # Return the transcript for printing
    except Exception as e:
        logging.error(f'Error: {e}')
        return ""  # Ensure a return even if an error occurs




# Transcript Processing Workflow

---
1. Extract Video ID
2. Get Video Information
3. Retrieve Video Info
4. Clean and Determine Filename
5. Log Final Filename
6. Process and Save Transcript
7. Print the Transcript






In [13]:
# Extract video ID from the URL
video_id = parse_youtube_url(url)

# Get video information (title and chapters)
# Assuming you have a function to get video info, like this:
def getVideoInfo(video_id):
    # Placeholder: Replace with actual implementation
    return {
        "title": "Sample Title",
        "chapters": []  # Add chapters if available
    }

video_info = getVideoInfo(video_id)

# Clean and determine filename
filename = filename or clean_for_filename(video_info["title"]) or clean_for_filename(video_id)

# Log the final filename being used
logging.info(f'Using filename: {filename}')

# Process and save the transcript, and print it
transcript = process_and_save_transcript(video_id, video_info, language, punctuated, output_dir, filename, verbose, punctuation_model)

# Print the transcript
print(transcript)


INFO:root:Using filename: Sample Title
INFO:root:Processed 0 lines out of 434
INFO:root:Processed 100 lines out of 434
INFO:root:Processed 200 lines out of 434
INFO:root:Processed 300 lines out of 434
INFO:root:Processed 400 lines out of 434
INFO:root:Transcript saved to ./Sample Title.md


# Sample Title: hi guys, welcome to code Artisan lab.

This is video number on 104 for learning management system in Django with reactjs.

In this video we will add the Social Links to the teacher profile.

Okay, like we have mentioned in our in our feature list, we have mentioned that that the teacher should have the profile.

Teacher should have the profile with the social accounts, right, so let's do this.

So here I have opened the profile setting.

Okay, let's open the ah.

Let us open the DB browser, also right, and then open the database.

Here we are all right.

Now let's open the model, the teacher model, main models- okay, and we will go to the teacher model heavier.

Okay, here what we will do, we will add a Facebook URL, right, Model start URL field, okay, and we will say null, equal to 2, because we already have values, okay, and then we will say Twitter URL: right, and then we will say insta URL, Instagram URL- okay, and if any website, then website you are.

You can add 