# Vid2Blog

This notebook was instigated by [a xeet posted by Andrej Karpathy](https://twitter.com/karpathy/status/1760740503614836917). Throughout this notebook, we will attempt to take a longform video and automatedly transform it into a blog post.

## "Business" Strategy
The strategy for enabling this will evolve over time, but here is what I am thinking as of right now:

- Use the OpenAI Whisper API to transcribe the audio from the video.
- Use LLM to break the transcription into logical segments.
- Determine the timestamp chunks from the previous step's segmentated script.
- Use something (there's gotta be something!) to automatedly break the video into chunks.
- Extract frames at regular intervals and save them to disk for later use (This is tough to get right...)
- Use LLM to summarize the segmented into an outline **in JSON form**. (Important to get it in a structured form for the next steps.)
- In parallel, produce each segment of the blog post by passing in the following info in each parallel call:
    - The full outline (Maybe this needs converted back from JSON into plaintext?)
    - The transcribed text from that particular section
    - GPT-V uses the images to give a better description of what's going on (Again, hard problem)
- Run the full blog post back through the LLM for "final clean up", ensuring cohesion and proper attributions are moved to the end of the post

## Technical Strategy
Let's takes the steps that we derived in the strategy above and start to write out a plan for making this happen:

1. Programmatically download the video from YouTube. (Idk how to do this, but I'm sure there's a Python client that'll do the job.)
    - Side step: Check to see if the video is already downloaded. If yes, use "cached" video.
2. Separate the audio from the video and save it as an mp3 (?) file.
3. Pass the audio into OpenAI's whisper API. (And save the transcript so I'm not breaking my bank by running that thing too much 😂)
4. Use GPT-4 to break the audio into segments (Will need to do a bit of prompt engineering here)
5. Somehow make that connection to determine how the timestamps align to each segment (tricky tricky...)

## Notebook Setup
Let's do our imports and such!

In [1]:
import os
import json
import yaml
from langchain_openai import ChatOpenAI
from langchain.prompts import HumanMessagePromptTemplate
from langchain_community.document_loaders import YoutubeLoader

In [2]:
# Loading my personal OpenAI API key
with open('../keys/api-keys.yaml', 'r') as f:
    API_KEYS = yaml.safe_load(f)

In [3]:
# Setting some constants
YOUTUBE_URL = 'https://youtu.be/zduSFxRajkE?si=4sptfAH4EQq4_-gW'

# LangChain Setup
Throughout this project, I'm going to attempt to make use of **LangChain**. Because we need to instantiate a bunch of stuff, let's just go ahead and knock that out now.

In [4]:
# Instantiating the OpenAI LLM client with LangChain
llm = ChatOpenAI(
    api_key = API_KEYS['OPENAI_API_KEY'],
    model_name = 'gpt-4',
)

## Downloading the Video from YouTube
Wouldn't you know, LangChain has an integration to support this! It is using **pytube** behind the scenes, so you will need to install that if you haven't already. (`pip install pytube`) We also need to figure out a way to save the video as a cache, just to save us a bit of a headache as we do our work.

*Actually...*

It looks like YouTube has a transcript API. I still think there's value in trying Whisper, but just for now, I'm curious how far I can get with this transcript.

In [5]:
yt_transcript = YoutubeLoader.from_youtube_url(youtube_url = YOUTUBE_URL)
yt_transcript.load()

