An end-to-end data pipeline that scrapes Canvas files and lecture transcripts, processes them, and enables semantic search using sentence embeddings.
Uses:
- Scraping Tools: Selenium, Requests, Concurrency
- Embeddings: Transformers, Pytorch, PostgreSQL
- (hopefully) Web App: Nextjs, Tailwind, Transformers.js
- Canvas Scraper
- Uses Auth cookies and requests to scrape all Canvas files, and then uses Selenium to load the lecture hosting site and access transcripts
- The lecture scraper runs slowly, takes me 30-40 minutes to run but a more powerful PC or more optimized code can definitely speed it up.
- Course Downloading
- Turns the scraped download URLs into local files.
- Text Extractor
- Wrangles resources into raw text files. A lot of work can be done here (Tesseract OCR, better chunking, etc.)
- Embeddings Generation
- Generates embeddings for text files. So far, I've only experimented with lectures, I will experiment with PDFs soon.
TODOs:
- Data Exploration
- Look into fun ways to cluster and visualize lectures
- Cluster similar classes, similar lectures, semantic variation within lectures. etc
- UI and UX
- Turn into a functional web app, pushing embeddings to Postgres and allowing easy semantic search
- Think about useful interfaces
- Find similar passages in lectures and suggest lecture slides afterwards?
- Allow filtering for specific topics within classes (EG find where dynamic programming was introduced in an Algorithms class or a specific lecture)
- More!
Here's the structure of the tables I'm currently using. There's definitely room for expansion / modification. Off the top of my head: links to live resources, etc.
Note that I used MiniLM-L6-v2 to generate my text embeddings, if you use a different model, you will likely have to change the vector size to accomodate it.
postgres=# CREATE TABLE lectures (
lecture_id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
class TEXT, -- URL/filepath
avg_embedding VECTOR(384), -- Dimension matches MiniLM-L6-v2
created_at TIMESTAMP DEFAULT NOW(),
metadata JSONB -- author, date, tags, etc.
);
postgres=# CREATE TABLE chunks (
chunk_id SERIAL PRIMARY KEY,
lecture_id INT NOT NULL REFERENCES lectures(lecture_id) ON DELETE CASCADE,
content TEXT NOT NULL,
embedding VECTOR(384), -- Dimension matches MiniLM-L6-v2
position INT, -- Original order in lecture
metadata JSONB, -- page numbers, timestamps, etc.
created_at TIMESTAMP DEFAULT NOW()
);