### Steps:
* Extract audio from the videos.
* Transcribe the audio to text.
* Create and save embeddings generated from transcriptions (Vector DB).
* Retrieve similar text from the Vector DB.
* Utilize the similar content as context for Large Language Model (LLM) response generation.

In [42]:
import pathlib
from yt_dlp import YoutubeDL
import glob
from tqdm.notebook import tqdm

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from langchain.document_loaders import JSONLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
import os

In [39]:
os.environ['SENTENCE_TRANSFORMERS_HOME'] = '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/huggingface-models'

### Download the videos (audios)

In [26]:
VIDEO_LINK = 'https://www.youtube.com/playlist?list=PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6'
AUDIO_INPUTS = '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs'
TRANSCRIPTIONS = '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/transcriptions'

cache_dir = '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/huggingface-models'

In [3]:
path = pathlib.Path(AUDIO_INPUTS)
# format 140: audio-only(m4a), paths: ... sets the download directory
with YoutubeDL(params={'format': '140', "paths": {"home": path.as_posix()}}) as ydl:
    ydl.download(VIDEO_LINK)

[youtube:tab] Extracting URL: https://www.youtube.com/playlist?list=PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6
[youtube:tab] PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6: Downloading webpage
[youtube:tab] PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6: Redownloading playlist API JSON with unavailable videos
[download] Downloading playlist: Sequence Models (Course 5 of the Deep Learning Specialization)
[youtube:tab] PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6 page 1: Downloading API JSON
[youtube:tab] Playlist Sequence Models (Course 5 of the Deep Learning Specialization): Downloading 6 items of 6
[download] Downloading item 1 of 6
[youtube] Extracting URL: https://www.youtube.com/watch?v=_i3aqgKVNQI
[youtube] _i3aqgKVNQI: Downloading webpage
[youtube] _i3aqgKVNQI: Downloading tv client config
[youtube] _i3aqgKVNQI: Downloading player 612f74a3-main
[youtube] _i3aqgKVNQI: Downloading tv player API JSON
[youtube] _i3aqgKVNQI: Downloading ios player API JSON
[youtube] _i3aqgKVNQI: Downloading m3u8 information
[info] _i3aqgKVNQI



[download] Downloading item 2 of 6
[youtube] Extracting URL: https://www.youtube.com/watch?v=Er2ucMxjdHE
[youtube] Er2ucMxjdHE: Downloading webpage
[youtube] Er2ucMxjdHE: Downloading tv client config
[youtube] Er2ucMxjdHE: Downloading tv player API JSON
[youtube] Er2ucMxjdHE: Downloading ios player API JSON
[youtube] Er2ucMxjdHE: Downloading m3u8 information
[info] Er2ucMxjdHE: Downloading 1 format(s): 140
[download] Destination: /Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L02 Picking the most likely sentence [Er2ucMxjdHE].m4a
[download] 100% of    8.12MiB in 00:00:03 at 2.52MiB/s   




[download] Downloading item 3 of 6
[youtube] Extracting URL: https://www.youtube.com/watch?v=DejHQYAGb7Q
[youtube] DejHQYAGb7Q: Downloading webpage
[youtube] DejHQYAGb7Q: Downloading tv client config
[youtube] DejHQYAGb7Q: Downloading tv player API JSON
[youtube] DejHQYAGb7Q: Downloading ios player API JSON
[youtube] DejHQYAGb7Q: Downloading m3u8 information
[info] DejHQYAGb7Q: Downloading 1 format(s): 140
[download] Destination: /Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L06 Bleu Score (Optional) [DejHQYAGb7Q].m4a
[download] 100% of   15.22MiB in 00:00:06 at 2.52MiB/s     




[download] Downloading item 4 of 6
[youtube] Extracting URL: https://www.youtube.com/watch?v=SysgYptB198
[youtube] SysgYptB198: Downloading webpage
[youtube] SysgYptB198: Downloading tv client config
[youtube] SysgYptB198: Downloading tv player API JSON
[youtube] SysgYptB198: Downloading ios player API JSON
[youtube] SysgYptB198: Downloading m3u8 information
[info] SysgYptB198: Downloading 1 format(s): 140
[download] Destination: /Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L07 Attention Model Intuition [SysgYptB198].m4a
[download] 100% of    8.81MiB in 00:00:02 at 3.08MiB/s   




[download] Downloading item 5 of 6
[youtube] Extracting URL: https://www.youtube.com/watch?v=quoGRI-1l0A
[youtube] quoGRI-1l0A: Downloading webpage
[youtube] quoGRI-1l0A: Downloading tv client config
[youtube] quoGRI-1l0A: Downloading tv player API JSON
[youtube] quoGRI-1l0A: Downloading ios player API JSON
[youtube] quoGRI-1l0A: Downloading m3u8 information
[info] quoGRI-1l0A: Downloading 1 format(s): 140
[download] Destination: /Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L08 Attention Model [quoGRI-1l0A].m4a
[download] 100% of   11.25MiB in 00:00:04 at 2.52MiB/s   




[download] Downloading item 6 of 6
[youtube] Extracting URL: https://www.youtube.com/watch?v=vm2SI8AJY0s
[youtube] vm2SI8AJY0s: Downloading webpage
[youtube] vm2SI8AJY0s: Downloading tv client config
[youtube] vm2SI8AJY0s: Downloading tv player API JSON
[youtube] vm2SI8AJY0s: Downloading ios player API JSON
[youtube] vm2SI8AJY0s: Downloading m3u8 information
[info] vm2SI8AJY0s: Downloading 1 format(s): 140
[download] Destination: /Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L09 SpeechRecog [vm2SI8AJY0s].m4a
[download] 100% of    8.24MiB in 00:00:03 at 2.37MiB/s   




[download] Finished downloading playlist: Sequence Models (Course 5 of the Deep Learning Specialization)


In [24]:
audio_paths = sorted(glob.glob(AUDIO_INPUTS + '/*.m4a'))
audio_paths

['/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L01 Basic Models [_i3aqgKVNQI].m4a',
 '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L02 Picking the most likely sentence [Er2ucMxjdHE].m4a',
 '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L06 Bleu Score (Optional) [DejHQYAGb7Q].m4a',
 '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L07 Attention Model Intuition [SysgYptB198].m4a',
 '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L08 Attention Model [quoGRI-1l0A].m4a',
 '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/audio_inputs/C5W3L09 SpeechRecog [vm2SI8AJY0s].m4a']

### Transcribe the audios

In [22]:
# default configuration based on distil-whisper/distil-small.en model card on huggingface

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-small.en"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True,
    cache_dir = cache_dir
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id, cache_dir = cache_dir)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

Device set to use cpu


In [29]:
# save the transcriptions in .txt file
for audio_path in tqdm(audio_paths):

    json_dict = {}

    transcriptions = pipe(audio_path, chunk_length_s=15, batch_size=32, return_timestamps=True)

    text = transcriptions['text']
    video_title = audio_path.split('/')[-1][:-4]

    file_path = TRANSCRIPTIONS + f'/{video_title}.txt'

    with open(file_path, "wb") as outfile:
        outfile.write(text.encode('utf-8'))

  0%|          | 0/6 [00:00<?, ?it/s]



### Split the documents

In [31]:
loader = DirectoryLoader(
    path = TRANSCRIPTIONS + '/',
    glob="./*.txt",
    loader_cls=TextLoader,
    show_progress=True
)

documents = loader.load()
print('Documents: ',len(documents))

100%|██████████| 6/6 [00:00<00:00, 671.41it/s]

Documents:  6





In [32]:
documents[0]

Document(metadata={'source': '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/transcriptions/C5W3L06 Bleu Score (Optional) [DejHQYAGb7Q].txt'}, page_content=" One of the challenges of machine translation is that given a French sentence there could be multiple English translations that are equally good translations in that French sentence. So how do you evaluate a machine translation system if there are multiple equally good answers? Unlike say image recognition where there's one right answer, if you just measure accuracy, If there are multiple great answers how do you measure accuracy? The way this is done conventionally is with something called the blue score. So in this optional video I want to share of you, I want to give you a sense of how the blue score works. Let's say you are given a French sentence, the shyest of the tabby, and you are given a reference human-generated translation of this, which is the cat is on the mat, but there are multiple pretty good translati

In [34]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 0
)
docs = text_splitter.split_documents(documents)

In [36]:
len(docs)

53

### Vector Embedding Retrieval

In [40]:
embeddings = HuggingFaceEmbeddings(
    model_name='BAAI/bge-small-en-v1.5', 
    model_kwargs={'device': 'cpu'},
    show_progress=True
)

In [45]:
db = FAISS.load_local('/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/rag-optimizations/faiss_videos', embeddings, allow_dangerous_deserialization=True)
# db = FAISS.from_documents(docs, embeddings)

In [44]:
# db.save_local('/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/rag-optimizations/faiss_videos')

In [46]:
vector_res = db.similarity_search("What is sequence to sequence model?", k=10)
vector_res

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[Document(id='b93b2ac5-8f49-4408-8d0a-281bade1a19f', metadata={'source': '/Users/wangzeyu/Desktop/Github projects/legalai-chatbot/data/transcriptions/C5W3L01 Basic Models [_i3aqgKVNQI].txt'}, page_content="Although it turns out there are multiple groups coming up with very similar models independently and at about the same time. So two other groups that had done very similar work at about the same time and I think independently of Maudal where all your vinnials exander Tocia of Sammy Benjo and Dimitri Urhan as well as Andrekapathy and Faye Faye. So you've now seen how a basic sequence to sequence model works, how a basic image to sequence or image captioning model works. But there are some differences between how you would run a model like this to generate a sequence compared to how you were synthesizing novel text using a language model. One of the key differences is you don't want to randomly chosen translation. You maybe want the most likely translation. You don't want to randomly c

### LLM Response

In [None]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Keep the answer as concise as possible.

Context: {context}

Question: {question}

Helpful Answer:"""

In [None]:
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model='', base_url='', api_key='n')
rag_res = model.invoke(
    prompt_template.format(
        context=''.join([i.page_content for i in vector_res]), 
        question="What is sequence to sequence model?"
    )
)
print(rag_res.content)

In [None]:
# Refer to https://python.langchain.com/docs/versions/migrating_chains/retrieval_qa/ 
# for building rag chain