---
title: Building a Full-Text Search Engine for fastai youtube channel  
author: "Francisco Mussari"  
date: 2022-12-04  
categories: [deeplearning, full-text search, SQLite, pytube, youtube-transcript-api]  
format:
  html:
    toc: true
    toc-depth: 3
    
---

Part 1. Create the service and upload the pictures

## Introduction

In this post we are going to get the transcriptions of YouTube videos from one or more given Playlists. Here we are going to do it with [fastai](https://www.youtube.com/@howardjeremyp/playlists) channel, but it applies to any given playlists (if the videos have transcriptions).  

After we get the transcriptions, we are going to build a search engine with [SQLite](https://sqlite.org/index.html)'s full-text search functionality provided by its [FTS5](https://sqlite.org/fts5.html) extension.

### References
If you want to get deeper, I encourage you to read these articles:  
- [Useful Full Text Search (FTS) queries with sqlite in Python 3.7](https://saraswatmks.github.io/2020/04/sqlite-fts-search-queries.html)
- [Fast Autocomplete Search for Your Website](https://simonwillison.net/2018/Dec/19/fast-autocomplete-search/)
- [YouTube Transcript/Subtitle API (including automatically generated subtitles and subtitle translations)](https://github.com/jdepoix/youtube-transcript-api#youtube-transcriptsubtitle-api-including-automatically-generated-subtitles-and-subtitle-translations)

## Get YouTube Transcriptions

### Install and Import Libraries

We need to first install the libraries we need (**pytube** and **youtube-transcript-api**).  
We can use `pip`:  
```
$ pip install pytube
$ pip install youtube_transcript_api
```
Or `conda`:
```
$ conda install -c conda-forge pytube
$ conda install -c conda-forge youtube-transcript-api
```

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi
from pytube import YouTube, Playlist

import sqlite3

### YouTube Playlists
Let's create a list of YouTube playlist ids. We can get them browsing YouTube playlist. The **id** is in the **url** which has the following format:  
`https://www.youtube.com/playlist?list={PLAYLIST_ID}`

In [None]:
base_pl = 'https://www.youtube.com/playlist?list='
base_yt = 'https://youtu.be/'

yt_pl_ids = [
    'PLfYUBJiXbdtSgU6S_3l6pX-4hQYKNJZFU', # fast.ai APL Study Group #2022
    'PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU', # Practical Deep Learning for Coders 2022
    'PLfYUBJiXbdtSLBPJ1GMx-sQWf6iNhb8mM', # fast.ai live coding & tutorials #2022
    'PLfYUBJiXbdtRL3FMB3GoWHRI8ieU6FhfM', # Practical Deep Learning for Coders (2020)
    'PLfYUBJiXbdtTIdtE1U8qgyxo4Jy2Y91uj', # Deep Learning from the Foundations #2019
    'PLfYUBJiXbdtSWRCYUHh-ThVCC39bp5yiq', # fastai v2 code walk-thrus #2019
    'PLfYUBJiXbdtSIJb-Qd3pw0cqCbkGeS0xn', # Practical Deep Learning for Coders 2019
    'PLfYUBJiXbdtSyktd8A_x0JNd6lxDcZE96', # Introduction to Machine Learning for Coders
    'PLfYUBJiXbdtTttBGq-u2zeY1OTjs5e-Ia', # Cutting Edge Deep Learning for Coders 2 #2018
    'PLfYUBJiXbdtS2UQRzyrxmyVHoGW0gmLSM', # Practical Deep Learning For Coders 2018
]

### Get Transcriptions

Let's explore the methods:

In [None]:
playlist = Playlist('https://www.youtube.com/playlist?list=PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU')
print(playlist.title)
video = YouTube(playlist[0])
print(video.title)
print(playlist[0])
video_id = playlist[0].split('=')[1]
script = YouTubeTranscriptApi.get_transcript(video_id, languages=('en',))
print(script[0])

#### Download all transcriptions
Now we are going to download all the transcriptions. Let's create three dictionaries to store the data: 
- `playlists` to store each playlist as `{playlist_id: playlist_name}`
- `videos` to store videos as `{video_id: video_name}`
- `database` to store all captions as `{playlist_id: {video_id: {'start': caption}}`.

In [None]:
playlists = dict()
videos = dict()
database = dict()

for pl_id in yt_pl_ids:
    playlist = Playlist(base_pl + pl_id)
    print(playlist.title)
    playlists[pl_id] = playlist.title
    database[pl_id] = dict()

    for video in playlist:
        video_id = video.split("=")[1]
        videos[video_id] = YouTube(video).title
        database[pl_id][video_id] = dict()
        # Manually created transcripts are returned first
        script = YouTubeTranscriptApi.get_transcript(video_id, languages=('en',))

        for txt in script:
            database[pl_id][video_id][txt['start']] = txt['text']

## Building the Search Engine

### Formatting the data to facilitate insertion into SQLite

In [None]:
# https://stackoverflow.com/a/60932565/10013187
records = [
    (level1, level2, level3, leaf)
    for level1, level2_dict in database.items()
    for level2, level3_dict in level2_dict.items()
    for level3, leaf in level3_dict.items()
]
print("(playlist_id, video_id, start, text)")
print(records[100])

### Creating the database

In [None]:
db = sqlite3.connect('fastai_yt.db')
cur = db.cursor()

In [None]:
# virtual table configured to allow full-text search
cur.execute('CREATE VIRTUAL TABLE transcriptions_fts USING fts5(playlist_id, video_id, start, text, tokenize="porter unicode61");')

# dimension like tables
cur.execute('CREATE TABLE playlist (playlist_id, playlist_name);')
cur.execute('CREATE TABLE video (video_id, video_name);')

In [None]:
# bulk index records
cur.executemany('INSERT INTO transcriptions_fts (playlist_id, video_id, start, text) values (?,?,?,?);', records)
cur.executemany('INSERT INTO playlist (playlist_id, playlist_name) values (?,?);', playlists.items())
cur.executemany('INSERT INTO video (video_id, video_name) values (?,?);', videos.items())
db.commit()

In [None]:
cur.execute('SELECT start, text FROM transcriptions_fts WHERE video_id="8SF_h3xF3cE" LIMIT 5').fetchall()