Skip to content

An end-to-end data pipeline that scrapes Canvas files and lecture transcripts, processes them, and enables semantic search using sentence embeddings

Notifications You must be signed in to change notification settings

diegotyner/CanvasResourceSemanticSearch

Repository files navigation


Quick Summary

An end-to-end data pipeline that scrapes Canvas files and lecture transcripts, processes them, and enables semantic search using sentence embeddings.

Uses:

  • Scraping Tools: Selenium, Requests, Concurrency
  • Embeddings: Transformers, Pytorch, PostgreSQL
  • (hopefully) Web App: Nextjs, Tailwind, Transformers.js

Pipeline Order

  1. Canvas Scraper
    • Uses Auth cookies and requests to scrape all Canvas files, and then uses Selenium to load the lecture hosting site and access transcripts
    • The lecture scraper runs slowly, takes me 30-40 minutes to run but a more powerful PC or more optimized code can definitely speed it up.
  2. Course Downloading
    • Turns the scraped download URLs into local files.
  3. Text Extractor
    • Wrangles resources into raw text files. A lot of work can be done here (Tesseract OCR, better chunking, etc.)
  4. Embeddings Generation
    • Generates embeddings for text files. So far, I've only experimented with lectures, I will experiment with PDFs soon.

TODOs:

  1. Data Exploration
    • Look into fun ways to cluster and visualize lectures
    • Cluster similar classes, similar lectures, semantic variation within lectures. etc
  2. UI and UX
    • Turn into a functional web app, pushing embeddings to Postgres and allowing easy semantic search
    • Think about useful interfaces
      • Find similar passages in lectures and suggest lecture slides afterwards?
      • Allow filtering for specific topics within classes (EG find where dynamic programming was introduced in an Algorithms class or a specific lecture)
    • More!

Setting Up Postgres

Here's the structure of the tables I'm currently using. There's definitely room for expansion / modification. Off the top of my head: links to live resources, etc.

Note that I used MiniLM-L6-v2 to generate my text embeddings, if you use a different model, you will likely have to change the vector size to accomodate it.

postgres=# CREATE TABLE lectures (
    lecture_id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    class TEXT,  -- URL/filepath
    avg_embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    created_at TIMESTAMP DEFAULT NOW(),
    metadata JSONB  -- author, date, tags, etc.
);
postgres=# CREATE TABLE chunks (
    chunk_id SERIAL PRIMARY KEY,
    lecture_id INT NOT NULL REFERENCES lectures(lecture_id) ON DELETE CASCADE,
    content TEXT NOT NULL,
    embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    position INT,  -- Original order in lecture
    metadata JSONB,  -- page numbers, timestamps, etc.
    created_at TIMESTAMP DEFAULT NOW()
);

About

An end-to-end data pipeline that scrapes Canvas files and lecture transcripts, processes them, and enables semantic search using sentence embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published