<img src="https://fsdl.me/logo-720-dark-horizontal">

# Building the FSDL Corpus

This notebook constructs a corpus of documents
from the
[Full Stack Deep Learning](https://fullstackdeeplearning.com)
course and website
and sends it to a [managed MongoDB database](https://www.mongodb.com/docs/atlas/).

These documents are then used to support LLM-powered Q&A.

To achieve higher quality results,
we use specialized parsing for the sources.

Data preparation is less exciting, but often higher yield, than modeling or engineering!

## Target Format

For sourced Q&A, we want to store a collection of documents.

In this context, a document is text plus some optional metadata --
including, ideally, a URL source and an identifier.

A pseudo-schema might look like this:

```json
Docs = {[Document]}

Document = {
    text: string,
    metadata: Metadata?
}
    
Metadata = {
    source: url?
    sha256: hash
    ...
}
```

To start,
we'll just accumulate info in flat `pandas` dataframes.

In [None]:
import pandas as pd


document_df = pd.DataFrame(columns=["text", "source", "sha256"])

document_df

The Q&A will be more useful the more precisely we slice and link the documents,
so we want to split a semantic "document", like a lecture or a video,
up into sub-documents first.

**Note**: we leave it up to the `langchain.TextSplitter` to split sub-documents into chunks smaller than a source at time of upsert into the vector database.

## Markdown Files

Most pages on the FSDL website
are originally written in Markdown,
which makes it easy to pull out relevant sub-documents.

### Lectures

We first define a `DataFrame` with basic metadata about where the lectures can be found -- on the website and as raw Markdown.

In [None]:
lecture_md_url_base = "https://raw.githubusercontent.com/full-stack-deep-learning/website/main/docs/course/2022"
website_url_base = "https://fullstackdeeplearning.com/course/2022"

In [None]:
lecture_slugs = {
    1: "lecture-1-course-vision-and-when-to-use-ml",
    2: "lecture-2-development-infrastructure-and-tooling",
    3: "lecture-3-troubleshooting-and-testing",
    4: "lecture-4-data-management",
    5: "lecture-5-deployment",
    6: "lecture-6-continual-learning",
    7: "lecture-7-foundation-models",
    8: "lecture-8-teams-and-pm",
    9: "lecture-9-ethics"
}

lecture_df = pd.DataFrame.from_dict(lecture_slugs, orient="index", columns=["url-slug"])
lecture_df

In [None]:
lecture_df["raw-md-url"] = lecture_df["url-slug"].apply(lambda s: f"{lecture_md_url_base}/{s}/index.md".format(s))

We then bring in the markdown files from GitHub,
parse them to split out headings as our "sources",
and use `slugify` to create URLs for those heading sources.

In [None]:
from smart_open import open


def get_text_from(url):
    with open(url) as f:
        contents = f.read()
    return contents

lecture_df["raw-text"] = lecture_df["raw-md-url"].apply(lambda url: get_text_from(url))

In [None]:
import mistune
from slugify import slugify


def get_target_headings_and_slugs(text):
    markdown_parser = mistune.create_markdown(renderer="ast")
    parsed_text = markdown_parser(text)
    
    heading_objects = [obj for obj in parsed_text if obj["type"] == "heading"]
    h2_objects = [obj for obj in heading_objects if obj["level"] == 2]
    
    targets = [obj for obj in h2_objects if not(obj["children"][0]["text"].startswith("description: "))]
    target_headings = [tgt["children"][0]["text"] for tgt in targets]
    
    heading_slugs = [slugify(target_heading) for target_heading in target_headings]
    
    return target_headings, heading_slugs

In [None]:
def split_lecture(row):
    text = row["raw-text"]
    
    headings, slugs = get_target_headings_and_slugs(text)
    
    texts = split_by_headings(text, headings)
    slugs = [""] + slugs
    
    text_rows = []
    for text, slug in zip(texts, slugs):
        text_rows.append({
            "url-slug": row["url-slug"] + "#" + slug,
            "raw-md-url": row["raw-md-url"],
            "text": text,
        })
    
    return pd.DataFrame.from_records(text_rows)

In [None]:
def split_by_headings(text, headings):
    texts = []
    for heading in reversed(headings):
        text, section = text.split("# " + heading)
        texts.append(f"## {heading}{section}")
    texts.append(text)
    texts = list(reversed(texts))
    return texts

In [None]:
lecture_dfs = []
for idx, row in lecture_df.iterrows():
    single_lecture_df = split_lecture(row)
    single_lecture_df["lecture-idx"] = idx
    lecture_dfs.append(single_lecture_df)
    
split_lecture_df = pd.concat(lecture_dfs, ignore_index=True)

In [None]:
split_lecture_df["source"] = split_lecture_df["url-slug"].apply(lambda s: f"{website_url_base}/{s}".format(s))

In [None]:
split_lecture_df

## YouTube Videos

Videos are not text, but transcripts are --
so we can also build a corpus based on videos
from the FSDL YouTube channel.

We first define the video metadata
and use it to build a `DataFrame`.

In [None]:
videos = {
    "id": ["-Iob-FW5jVM",
           "hltjXcaxExY",
           "9w8CVuHUk8U",
           "6fSd8RdtDBs",
           "lsWLgQyaeik",
           "BPYOsDCZbno",
           "NEGDJuINE9E",
           "RLemHNAO5Lw",
           "D65SlCSoS-0",
           "Jlm4oqW41vY",
           "zoS5Fx2Ou1Y",
           "W3hKjXg7fXM",
           "2j6rG-4zS6w",
           "nra0Tt3a-Oc",
           "-mKzxSC0r7w",
           "Rm11UeGwGgk",
           "a54xH6nT4Sw",
           "7FQpbYTqjAA",],
    "title": [
        "Lecture 01: When to Use ML and Course Vision (FSDL 2022)",
        "Lab Intro and Overview",
        "Lab 01: Neural networks in PyTorch",
        "Lab 02: PyTorch Lightning and Convolutional NNs",
        "Lab 03: Transformers and Paragraphs (FSDL 2022)",
        "Lecture 02: Development Infrastructure & Tooling (FSDL 2022)",
        "Lab 04: Experiment Management (FSDL 2022)",
        "Lecture 03: Troubleshooting & Testing (FSDL 2022)",
        "Lab 05: Troubleshooting & Testing (FSDL 2022)",
        "Lecture 04: Data Management (FSDL 2022)",
        "Lab 06: Data Annotation (FSDL 2022)",
        "Lecture 05: Deployment (FSDL 2022)",
        "Lab 07: Web Deployment (FSDL 2022)",
        "Lecture 06: Continual Learning (FSDL 2022)",
        "Lab 08: Monitoring (FSDL 2022)",
        "Lecture 07: Foundation Models (FSDL 2022)",
        "Lecture 08: ML Teams and Project Management (FSDL 2022)",
        "Lecture 09: Ethics (FSDL 2022)",]
}

In [None]:
# baby's first expectation test
assert len(videos["title"]) == len(videos["id"])

In [None]:
videos_df = pd.DataFrame.from_dict(videos)
videos_df.index = videos_df["id"]
videos_df = videos_df.drop("id", axis="columns")
videos_df

We use the `youtube_transcript_api` package
to pull down the transcripts
in a single line of Python.

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi


transcripts = [YouTubeTranscriptApi.get_transcript(video_id) for video_id in videos_df.index]

Conveniently enough, every second of a YouTube video is individually linkable
and the transcripts come with timestamps.

But a second of speech is not a useful source.

And by default, the subtitles come "chunked" in time
at too fine a grain as well:
more like five seconds than the thirty to sixty seconds
that it takes to make a reasonable point.

So now,
we combine the five-second subtitle timestamps
into longer chunks based on character count --
750 seems to generate nicely sized chunks on our corpus.

In [None]:
from datetime import timedelta

import srt


TRIGGER_LENGTH = 750  # 30-60 seconds

def merge(subtitles, idx):
    new_content = combine_content(subtitles)

    # preserve start as timedelta
    new_start = seconds_float_to_timedelta(subtitles[0]["start"])
    # merge durations as timedelta
    new_duration = seconds_float_to_timedelta(sum(sub["duration"] for sub in subtitles))
    
    # combine
    new_end = new_start + new_duration
    
    return srt.Subtitle(index=idx, start=new_start, end=new_end, content=new_content)


def combine_content(subtitles):
    contents = [subtitle["text"].strip() for subtitle in subtitles]
    return " ".join(contents) + "\n\n"


def get_charcount(subtitle):
    return len(subtitle["text"])


def seconds_float_to_timedelta(x_seconds):
    return timedelta(seconds=x_seconds)


def merge_subtitles(subtitles):
    merged_subtitles = []
    current_chunk, current_length, chunk_idx = [], 0, 1

    for subtitle in subtitles:
        current_chunk.append(subtitle)
        added_length = get_charcount(subtitle)
        new_length = current_length + added_length

        if new_length >= TRIGGER_LENGTH:
            merged_subtitle = merge(current_chunk, chunk_idx)
            merged_subtitles.append(merged_subtitle)
            current_chunk, current_length = [], 0
            chunk_idx += 1
        else:
            current_length = new_length

    if current_chunk:
        merged_subtitle = merge(current_chunk, chunk_idx)
        merged_subtitles.append(merged_subtitle)

    return merged_subtitles


subtitle_collections = [merge_subtitles(transcript) for transcript in transcripts]

# get strings as well for quick checks (and easier to write to files)
subtitle_strings = [srt.compose(merged_subtitles) for merged_subtitles in subtitle_collections]

We then add YouTube URLs
for those longer subtitles as sources
and combine them into a single `DataFrame`.

In [None]:
base_url_format = "https://www.youtube.com/watch?v={id}"
query_params_format = "&t={start}s"


def create_split_video_df(subtitles, base_url):
    rows = []
    for subtitle in subtitles:
        raw_text = subtitle.content
        text = raw_text.strip()
        start = timestamp_from_timedelta(subtitle.start)
        url = base_url + query_params_format.format(start=start)

        rows.append({"text": text, "source": url})

    video_df = pd.DataFrame.from_records(rows)
    return video_df


def timestamp_from_timedelta(td):
    return int(td.total_seconds())


split_video_dfs = [
    create_split_video_df(subtitles, base_url_format.format(id=video_id))
    for subtitles, video_id in zip(subtitle_collections, videos_df.index)
]

split_video_df = pd.concat(split_video_dfs, ignore_index=True)

In [None]:
split_video_df

# Combine

Now that we've got all of our texts and sources collated
in separate `DataFrame`s,
let's combine them together.

In [None]:
dfs = [split_lecture_df, split_video_df]
document_formatted_dfs = [df[["text", "source"]] for df in dfs]
document_df = pd.concat(document_formatted_dfs)

Now's a convenient time to add those `sha256` hashes for identification.

In [None]:
import hashlib

doc_ids = []
for _, row in document_df.iterrows():
    m = hashlib.sha256()
    m.update(row["text"].encode("utf-8"))
    doc_ids.append(m.hexdigest())
    
document_df.index = doc_ids

Let's look and see how many "documents" we ended up with,
before we move on:

In [None]:
len(document_df)

## Persist to Disk

As a first step to persisting our corpus,
let's save it to disk and reload it.

The data involved is relatively simple --
basically all strings --
so we don't need to `pickle` the `DataFrame`,
which comes with its own woes.

Instead, we just format it as `JSON` --
the web's favorite serialization format.

In [None]:
documents_json = document_df.to_json(orient="index", index=True)

In [None]:
with open("documents.json", "w") as f:
    f.write(documents_json)

Before moving on,
let's check that we can in fact reload the data.

In [None]:
import json

with open("documents.json") as f:
    s = f.read()
    
key, document = list(json.loads(s).items())[0]

In [None]:
print(document["text"], document["source"])

## Put into MongoDB

But a local filesystem isn't a good method for persistence.

We want these documents to be available via an API,
with the ability to scale reads and writes if needed.

So let's put them in a database.

We choose MongoDB simply for convenience --
we don't want to define a schema just yet,
since these tools are evolving rapidly,
and there are nice free hosting options.

> MongoDB is, in NoSQL terms, a "document database",
but the term document means something different
than it does in "Document Q&A".
In Mongoland, a "document" is just a blob of JSON.
We format our Q&A documents as JSON
and store them in Mongo,
so the distinction is not obvious here.

If you're running this yourself,
you'll need to create a hosted MongoDB instance
and add a database called `fsdl`
with a collection called `ask-fsdl`.

You can find instructions
[here](https://www.mongodb.com/basics/mongodb-atlas-tutorial).

You'll need the URL and password info
from that setup process to connect.

Add them to the `.env` file.

In [None]:
import json
import os

from dotenv import load_dotenv
import pymongo
from pymongo import InsertOne

load_dotenv()

mongodb_url = os.environ["MONGODB_URI"]
mongodb_password = os.environ["MONGODB_PASSWORD"]

CONNECTION_STRING = f"mongodb+srv://fsdl:{mongodb_password}@{mongodb_url}/?retryWrites=true&w=majority"

# connect to the database server
client = pymongo.MongoClient(CONNECTION_STRING)
# connect to the database
db = client.get_database("fsdl")
# get a representation of the collection
collection = db.get_collection("ask-fsdl")

collection

Now that we're connected,
we're ready to upsert.

We loop over the documents -- loaded from disk --
and format them into a Python dictionary
that fits our `Document` pseudoschema.

With `pymongo`,
we can just insert that dictionary directly,
using `InsertOne`,
and use `bulk_write` to get batching.

In [None]:
CHUNK_SIZE = 250
requesting = []

with open("documents.json") as f:
    documents = json.load(f)


for (sha_hash, content) in documents.items():
    metadata = {key: value for key, value in content.items() if key != "text"}
    metadata["sha256"] = sha_hash
    document = {"text": content["text"], "metadata": metadata}
    requesting.append(InsertOne(document))
    
    if len(requesting) >= CHUNK_SIZE:
        collection.bulk_write(requesting)
        requesting = []
        
if requesting:
    collection.bulk_write(requesting)
    requesting = []