<a href="https://colab.research.google.com/github/gisturiz/gpt4-vision/blob/main/GPT4_vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install required dependencies (ignore pip dependency errors for cohere and tiktoken)

In [1]:
!pip install -q pytube openai

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0m

Set OpenAI API Key as an environment variable. If you don't have one yet, you can get it here: https://platform.openai.com/

In [2]:
%env OPENAI_API_KEY=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

env: OPENAI_API_KEY=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


Import all of the dependencies we will be using

In [3]:
from IPython.display import display, Image, Audio

import cv2
import base64
import time
import os
import re
import requests
from openai import OpenAI
from pytube import YouTube

client = OpenAI()

Downloading & Video title helper functions

In [4]:
def format_video_title(title: str) -> str:
    """
    Formats the video title by replacing spaces with hyphens, converting to lowercase,
    and removing special characters including commas, hash symbols, and hyphens.

    Args:
        title (str): The original video title.

    Returns:
        str: The formatted video title.
    """
    title = title.lower()
    title = re.sub(r'[^\w\s-]', '', title)
    title = re.sub(r'[-\s]+', '-', title)
    return title

def download_youtube_video(url: str, output_path: str = '.'):
    """
    Downloads a YouTube video to a specified output path.

    Args:
        url (str): URL of the YouTube video.
        output_path (str): Path where the video will be saved. Defaults to the current directory.
    """
    try:
        yt = YouTube(url)
        video_stream = yt.streams.filter(
            progressive=True,
            file_extension='mp4').order_by('resolution').desc().first()

        if video_stream:
            formatted_title = format_video_title(yt.title) + '.mp4'
            video_stream.download(output_path=output_path, filename=formatted_title)
            print(f"Video downloaded successfully: {formatted_title}")
            return formatted_title
        else:
            print("No suitable video stream found.")
    except Exception as e:
         print(f"An error occurred: {e}")

Insert here the URL of the [Youtube](https://www.youtube.com/) video you'd like to download

In [5]:
YT_URL='https://www.youtube.com/watch?v=INcW26-iyqU'

Download the video

In [6]:
video_path = download_youtube_video(url=YT_URL)

Video downloaded successfully: lioness-chases-zebra-natures-great-events-bbc-one.mp4


Dividing video into frames, this will allow us to pass frames (images) to our GPT4 Vision endpoint.

In [7]:
video = cv2.VideoCapture(video_path)

base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

947 frames read.


*Optional* - here you can cut the length of your video by slicing the base64Frames list. This could be necessary as GPT4 Vision only has a 10,000 token per minute limit during preview.

In [8]:
base64Frames = base64Frames[475:775]
print(len(base64Frames), "frames read.")

300 frames read.


We can display the frames withing our base64Frames list to verify it was downloaded appropriately.

In [None]:
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
    time.sleep(0.025)

Here we will pass out base64Frames list along with our instructions to GPT4 Vision to output the text we want (narration, description, etc.)
Note that I have included several variables to define to make the prompt better:


*   *frame_step* - how often you want GPT4 to look at the list of frames to see an image (note that this will affect the amount of tokens per minute limit)
*   *video_length* - let the prompt know how long, more or less, the text should be for a given video lenght (if you want narration, if not, you can delete this part of the prompt.)
*   *prompt_instructions* - let GPT4 Vision what you would like as a response and include any other instructions for the model. Feel free to tinker here to get the best response.

In [13]:
frame_step = 35
video_length = 30
prompt_instructions = f"These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration. Make the text succint so that it can be read out in {video_length} seconds."

PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [prompt_instructions, *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::frame_step]),
        ],
    },
]
params = {
    "model": "gpt-4-vision-preview",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

In the great plains of Africa, a drama unfolds. The lioness, the quintessence of predatorial grace, locks onto her target – a zebra, striped sentinel of the savanna. With a burst of speed, the chase is on. Every muscle in the lioness's body propels her forward, the gap closing with each bound. The zebra, in a desperate bid for survival, twists and turns, its striped coat a blur against the green tapestry beneath. But nature's script is often written in the pursuit's final act. The lioness's power is overwhelming, and with a precision born of eons, she brings down her prey. Here, in this ancient dance of predator and prey, life and survival hinge on moments just like this.


OpenAI Voice Over

In [14]:
response = client.audio.speech.create(
    model="tts-1-hd",
    voice="alloy",
    input=result.choices[0].message.content,
)


Save audio to mp3 file and play it to verify

In [15]:
response.stream_to_file("output.mp3")
Audio("output.mp3")

Compare the video and audio lengths

In [12]:
!ffmpeg -i {video_path} 2>&1 | grep "Duration"
!ffmpeg -i output.mp3 2>&1 | grep "Duration"

  Duration: 00:00:37.89, start: 0.000000, bitrate: 963 kb/s
  Duration: 00:01:00.72, start: 0.000000, bitrate: 160 kb/s


Combine video and generated audio to overlay the newly created narration over the original video. *Note, there are some settings I tweaked for my particulat overlay, like lowering the original video's volume by half so the narration could be clearly heard. Here is [ffmpeg documentation](https://ffmpeg.org/documentation.html) should you want to make you're own changes.

In [None]:
!ffmpeg -i {video_path} -i output.mp3 -filter_complex "[0:a]volume=0.5[a0];[a0][1:a]amix=inputs=2:duration=longest[a]" -map 0:v -map "[a]" -c:v copy -c:a aac -strict experimental output.mp4