# YouTube Audio Download and Transcription Script
This script downloads audio from YouTube videos, transcribes it using OpenAI's Whisper model, and saves the transcriptions to a CSV file. It’s designed for bulk processing, allowing concurrent downloads and transcriptions for faster results.

# User Requirements:

###Input CSV:
Prepare a CSV file with YouTube video URLs and titles (e.g., video_data.csv). Upload this file to Google Drive and specify its path in the code at csv_input_path.

###Output Location:
Set the output_path to a folder in Google Drive where you want to save the final CSV file containing the transcriptions.
This script will:

Download each video’s audio, transcribe it, and remove the audio file after transcription.
Save the transcription results in a CSV file, which is stored in your specified Google Drive location.

# Install what we need, just press play

In [None]:
!pip install pytubefix
!pip install pafy
!pip install -U yt-dlp
!pip install -U moviepy openai-whisper

#Connect your google drive

This code is essential because it mounts your Google Drive to the Colab environment, allowing the notebook to access files stored in your Drive.
Follow the instructions.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Main Code Function
This code downloads audio from YouTube videos, transcribes it using OpenAI's Whisper model, and saves the transcriptions in a CSV file in Google Drive.

###User Instructions
Before running the code, go through each section and check for lines marked as # ===== USER ACTION REQUIRED =====. These are lines where you need to provide input or make adjustments for the script to run correctly.

###The user is required to:

* Specify the Input CSV Path: Update csv_input_path to point to the location of your CSV file in Google Drive. This file should contain the YouTube video URLs and titles.

* Set the Output Path: Modify output_path to specify where you want the final CSV with transcriptions saved in your Google Drive.

* Ensure Drive Mounting and Folder Setup: Run the code to mount Google Drive, and confirm that your input file is uploaded and accessible at the specified path.


###Important Reminders
Run Time: If you have a large number of files or are running on a CPU, this process may take considerable time. Colab’s free tier has limited resources and can time out if left inactive or if your computer shuts down.
Upgrade for Better Performance: Consider using Colab Pro for access to faster GPUs, such as the A100, which can significantly reduce processing time. The A100 GPU is powerful and well-suited for intensive tasks like transcription.

Output: As the code runs, you’ll see messages for each video, indicating when the audio is downloading, transcribing, and when the process is complete for each video.

In [None]:
import yt_dlp  # YouTube downloader
import os
import whisper  # Whisper model for transcription
from concurrent.futures import ThreadPoolExecutor  # For concurrent processing
import warnings
import pandas as pd

# Suppress specific warnings to keep the output clean
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Function to download audio from YouTube using yt_dlp
def download_audio(url, title, output_folder):
    print(f"Downloading audio: {title} from {url}")

    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': f'{output_folder}/{title}.%(ext)s',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
            'preferredquality': '192',
        }],
        'noplaylist': True,
        'quiet': True,
    }

    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            ydl.download([url])
        return f"{output_folder}/{title}.wav"
    except Exception as e:
        print(f"Error downloading {title}: {e}")
        return None

# Function to transcribe audio using Whisper
def transcribe_audio(audio_path):
    if audio_path is None:
        return None

    try:
        model = whisper.load_model("base")
        transcription = model.transcribe(audio_path)
        return transcription['text']
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

# Function to handle downloading and transcribing a single row (video)
def download_and_transcribe(row):
    url = row['link']
    title = row['title']

    # Download audio
    audio_path = download_audio(url, title, output_folder)

    # Transcribe audio
    transcription = transcribe_audio(audio_path)
    row['transcription'] = transcription

    # Remove the audio file after transcription to save storage
    if audio_path and os.path.exists(audio_path):
        os.remove(audio_path)
        print(f"Deleted audio file: {audio_path}")

    return transcription

# Main script
if __name__ == "__main__":
    # Define paths
    output_folder = "youtube_mp4"  # Temporary folder for downloaded audio files
    output_path = '/content/drive/My Drive/your_output_folder/output_with_transcriptions.csv'  # Final CSV path in Google Drive

    # ===== USER ACTION REQUIRED =====
    # Set the path to the CSV file containing the video links and titles
    # Make sure this CSV file (e.g., 'video_data.csv') is uploaded to your Google Drive and replace the path below
    csv_input_path = "/content/drive/My Drive/your_input_folder/video_data.csv"

    # Ensure output and input folder paths are accessible and correct
    os.makedirs(output_folder, exist_ok=True)

    # Load your CSV file with video data
    try:
        df = pd.read_csv(csv_input_path)  # Load CSV file containing video URLs and titles
    except FileNotFoundError:
        print(f"Error: The file {csv_input_path} was not found. Please check the file path and ensure it is uploaded to Google Drive.")
        raise

    # Process each video concurrently
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = {executor.submit(download_and_transcribe, row): row for _, row in df.iterrows()}
        for future in futures:
            row = futures[future]
            try:
                future.result()  # Wait for each future to complete
            except Exception as e:
                print(f"Error processing {row['title']}: {e}")

    print("Processing complete.")


    # Save the final DataFrame, including transcriptions, to a CSV file
    df.to_csv(output_path, index=False)  # This saves the output CSV with the transcriptions
    print(f"Final output saved to {output_path}")
