# M2: Ingestion Pipeline

This notebook runs the course website crawler and displays all ingested URLs.

## Prerequisites
- MongoDB running (via docker-compose)
- Network access to https://pantelis.github.io/courses/ai/

In [1]:
# Add src to path
import sys
sys.path.insert(0, '/app')

from src.ingestion import Crawler, Storage, COURSE_ROOT

## 1. Initialize Storage

Connect to MongoDB and check current state.

In [2]:
# Connect to MongoDB
storage = Storage(
    mongo_uri='mongodb://erica:erica_password_123@mongodb:27017/',
    db_name='erica'
)

# Check current stats
stats = storage.get_stats()
print("Current database state:")
print(f"  Pages: {stats['pages']}")
print(f"  Resources: {stats['resources']['total']}")
print(f"  Failures: {stats['failures']}")

Current database state:
  Pages: 184
  Resources: 1317
  Failures: 55


## 2. Run Crawler (Optional)

‚ö†Ô∏è Only run this cell if you want to (re)crawl the website. Takes several minutes.

In [3]:
# # Uncomment to clear existing data first
# storage.clear_all()
# print("Cleared existing data")

In [3]:
# Run the crawler
crawler = Crawler(
    storage=storage,
    delay=0.2,  # 0.1 seconds between requests
    progress_interval=10  # Log every 10 pages
)

# Start crawling from course root
crawler.crawl(COURSE_ROOT)

[INFO] Starting crawl from https://pantelis.github.io/courses/ai/
[INFO] [1] https://pantelis.github.io/courses/ai - 17 internal, 0 pdfs, 1 images
[INFO] [2] https://pantelis.github.io/book/foundations/index.html - 26 internal, 0 pdfs, 6 images
[INFO] [3] https://pantelis.github.io/book/dnn/index.html - 26 internal, 0 pdfs, 6 images
[INFO] [4] https://pantelis.github.io/book/2d-perception/index.html - 36 internal, 0 pdfs, 10 images
[INFO] [5] https://pantelis.github.io/book/kinematics/index.html - 22 internal, 3 pdfs, 2 images
[INFO] [6] https://pantelis.github.io/book/state-estimation/index.html - 20 internal, 0 pdfs, 2 images
[INFO] [7] https://pantelis.github.io/book/llm/index.html - 40 internal, 0 pdfs, 10 images
[INFO] [8] https://pantelis.github.io/book/multimodal/index.html - 34 internal, 0 pdfs, 6 images
[INFO] [9] https://pantelis.github.io/book/task-planning/index.html - 19 internal, 0 pdfs, 3 images
[INFO] [10] https://pantelis.github.io/book/global-planning/index.html - 19 

## 3. View Ingested URLs

This satisfies the M2 requirement: "Ensure you have a notebook cell / markdown file that prints all the URLs that you have ingested."

In [4]:
# Get all URLs
urls = storage.get_all_urls()

print("=" * 60)
print("ALL INGESTED URLs")
print("=" * 60)

ALL INGESTED URLs


In [5]:
# Pages
print(f"\nüìÑ WEB PAGES ({len(urls['pages'])})")
print("-" * 40)
for url in sorted(urls['pages']):
    print(url)


üìÑ WEB PAGES (184)
----------------------------------------
https://pantelis.github.io/aiml-common/assignments/main/ai-fall-2025/assignment-1-grad.html
https://pantelis.github.io/aiml-common/assignments/main/ai-fall-2025/assignment-1-undergrad.html
https://pantelis.github.io/aiml-common/assignments/main/ai-fall-2025/assignment-2.html
https://pantelis.github.io/aiml-common/assignments/main/ai-fall-2025/assignment-3.html
https://pantelis.github.io/aiml-common/assignments/main/ai-fall-2025/assignment-4.html
https://pantelis.github.io/aiml-common/assignments/topics/cnn-explainers/index-2-preview.html
https://pantelis.github.io/aiml-common/assignments/topics/cnn-explainers/index-2.ipynb
https://pantelis.github.io/aiml-common/assignments/topics/devenv/index-preview.html
https://pantelis.github.io/aiml-common/assignments/topics/devenv/index.ipynb
https://pantelis.github.io/aiml-common/assignments/topics/dim-reduction/curse-dimensionality-preview.html
https://pantelis.github.io/aiml-common/

In [6]:
# PDFs
print(f"\nüìï PDF DOCUMENTS ({len(urls['pdfs'])})")
print("-" * 40)
for url in sorted(urls['pdfs']):
    print(url)


üìï PDF DOCUMENTS (42)
----------------------------------------
http://algorithmics.lsi.upc.edu/docs/Dasgupta-Papadimitriou-Vazirani.pdf
http://alvyray.com/CreativeCommons/BizCardUniversalTuringMachine_v2.3.pdf
http://cs231n.stanford.edu/handouts/derivatives.pdf
http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture10.pdf
http://cs231n.stanford.edu/vecDerivs.pdf
http://hosting.astro.cornell.edu/~cordes/A6523/GeneratingCorrelatedRandomVariables.pdf
http://incompleteideas.net/book/RLbook2020.pdf
http://math.mit.edu/~gs/linearalgebra/linearalgebra5_6-1.pdf
http://proceedings.mlr.press/v28/pascanu13.pdf
http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
https://arxiv.org/pdf/1301.3781.pdf
https://arxiv.org/pdf/1312.4400v3.pdf
https://arxiv.org/pdf/1409.1556.pdf
https://arxiv.org/pdf/1412.6806.pdf
https://arxiv.org/pdf/1503.04069v1.pdf
https://arxiv.org/pdf/1506.00019.pdf
https://arxiv.org/pdf/1605.06431.pdf
https://arxiv.o

In [7]:
# Videos
print(f"\nüé• YOUTUBE VIDEOS ({len(urls['videos'])})")
print("-" * 40)
for url in sorted(urls['videos']):
    print(url)


üé• YOUTUBE VIDEOS (11)
----------------------------------------
https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
https://www.youtube.com/watch?v=6niqTuYFZLQ&t=521s
https://www.youtube.com/watch?v=Cx5Z-OslNWE&list=PLUl4u3cNGP63oMNUHXqIUcrkS2PivhN3k
https://www.youtube.com/watch?v=DiOxbYTLXX8
https://www.youtube.com/watch?v=MMxMeX5emUA
https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=3
https://www.youtube.com/watch?v=WCUNPb-5EYI
https://www.youtube.com/watch?v=hzaG2Uq60uw
https://www.youtube.com/watch?v=x6wsTFnU3eY
https://youtu.be/_NOVa4i7Us8?list=PL1Q0jeuU6XppS_r2Sa9fzVanpbXKqLsYS&t=384


In [8]:
# Images (just count, can expand if needed)
print(f"\nüñºÔ∏è  IMAGES ({len(urls['images'])})")
print("-" * 40)
print(f"Total images downloaded: {len(urls['images'])}")
print("\nFirst 10 images:")
for url in sorted(urls['images'])[:10]:
    print(f"  {url}")
if len(urls['images']) > 10:
    print(f"  ... and {len(urls['images']) - 10} more")


üñºÔ∏è  IMAGES (613)
----------------------------------------
Total images downloaded: 613

First 10 images:
  https://colab.research.google.com/assets/colab-badge.svg
  https://dl.fbaipublicfiles.com/detectron2/Detectron2-Logo-Horz.png
  https://drive.google.com/uc?id=12ha59vACcd8eCEPAQdbZ4-axPGPKnn7F
  https://img.youtube.com/vi/x6wsTFnU3eY/0.jpg
  https://pantelis.github.io/aiml-common/assignments/topics/cnn-explainers/workflow.png
  https://pantelis.github.io/aiml-common/assignments/topics/sports-analytics/basketball/images/birdseye.png
  https://pantelis.github.io/aiml-common/lectures/VLM/clip/images/arrowhead_ball.png
  https://pantelis.github.io/aiml-common/lectures/VLM/clip/images/circle_triangle.png
  https://pantelis.github.io/aiml-common/lectures/VLM/clip/images/clip_inference_fig.png
  https://pantelis.github.io/aiml-common/lectures/VLM/clip/images/clip_mapping_diagram_two_branch.png
  ... and 603 more


## 4. View Statistics

In [9]:
stats = storage.get_stats()

print("\nüìä INGESTION STATISTICS")
print("=" * 40)
print(f"Pages ingested:     {stats['pages']}")
print(f"\nResources by type:")
print(f"  PDFs:             {stats['resources']['by_type']['pdf']}")
print(f"  Videos:           {stats['resources']['by_type']['video']}")
print(f"  Images:           {stats['resources']['by_type']['image']}")
print(f"  External links:   {stats['resources']['by_type']['external']}")
print(f"\nResources by status:")
print(f"  Pending:          {stats['resources']['by_status']['pending']}")
print(f"  Ingested:         {stats['resources']['by_status']['ingested']}")
print(f"  Failed:           {stats['resources']['by_status']['failed']}")
print(f"\nFailures logged:    {stats['failures']}")


üìä INGESTION STATISTICS
Pages ingested:     184

Resources by type:
  PDFs:             42
  Videos:           11
  Images:           613
  External links:   651

Resources by status:
  Pending:          53
  Ingested:         613
  Failed:           0

Failures logged:    27


## 5. Review Failures (if any)

In [10]:
# Check for failures
failures = list(storage.failures.find())

if failures:
    print(f"\n‚ö†Ô∏è  FAILURES ({len(failures)})")
    print("=" * 40)
    for f in failures:
        print(f"\nURL: {f['url']}")
        print(f"  Type: {f['failure_type']}")
        print(f"  Error: {f['error_message']}")
        print(f"  Attempts: {f['attempts']}")
else:
    print("\n‚úÖ No failures recorded!")


‚ö†Ô∏è  FAILURES (27)

URL: https://pantelis.github.io/courses/images/llama.jpeg
  Type: download_error
  Error: HTTP 404
  Attempts: 2

URL: https://pantelis.github.io/aiml-common/lectures/classification/perceptron/index.qmd
  Type: http_error
  Error: Page not found
  Attempts: 1

URL: https://pantelis.github.io/aiml-common/lectures/cnn/cnn-intro/index.md
  Type: http_error
  Error: Page not found
  Attempts: 1

URL: https://pantelis.github.io/aiml-common/lectures/figs/pitchyawroll.svg
  Type: download_error
  Error: HTTP 404
  Attempts: 1

URL: https://raw.githubusercontent.com/NxRLab/ModernRobotics/master/packages/python/doc/images/Chapter3/3.5_rodrigues.png
  Type: download_error
  Error: HTTP 404
  Attempts: 1

URL: https://pantelis.github.io/aiml-common/lectures/kinematics/motion-representations/.png
  Type: download_error
  Error: HTTP 404
  Attempts: 1

URL: https://pantelis.github.io/aiml-common/lectures/kinematics/wheeled-robots/index.qmd
  Type: http_error
  Error: Page no

## 6. Sample Page Content

View a sample of what was extracted from a page.

In [2]:
# Get a sample page
sample = storage.pages.find_one()

if sample:
    print(f"URL: {sample['url']}")
    print(f"Title: {sample['title']}")
    print(f"\nLinks found:")
    print(f"  Internal: {len(sample['links_found']['internal'])}")
    print(f"  PDFs: {len(sample['links_found']['pdf'])}")
    print(f"  Videos: {len(sample['links_found']['video'])}")
    print(f"  Images: {len(sample['links_found']['image'])}")
    print(f"\nContent preview (first 500 chars):")
    print("-" * 40)
    print(sample['content'][:500])
else:
    print("No pages ingested yet. Run the crawler first.")

NameError: name 'storage' is not defined

## 7. MongoDB Queries (Direct Access)

For more advanced queries, you can use the collections directly.

In [12]:
# Example: Find all pages that link to PDFs
pages_with_pdfs = storage.pages.find(
    {"links_found.pdf": {"$ne": []}},
    {"url": 1, "title": 1, "links_found.pdf": 1}
)

print("Pages that link to PDFs:")
for page in pages_with_pdfs:
    print(f"\n{page['title']}")
    print(f"  URL: {page['url']}")
    print(f"  PDFs: {len(page['links_found']['pdf'])}")

Pages that link to PDFs:

Kinematics ‚Äì Engineering AI Agents
  URL: https://pantelis.github.io/book/kinematics/index.html
  PDFs: 3

Maximum Likelihood (ML) Estimation of conditional models ‚Äì Engineering AI Agents
  URL: https://pantelis.github.io/aiml-common/lectures/optimization/maximum-likelihood/conditional_maximum_likelihood.html
  PDFs: 1

Introduction to Backpropagation ‚Äì Engineering AI Agents
  URL: https://pantelis.github.io/aiml-common/lectures/dnn/backprop-intro/index.html
  PDFs: 1

Backpropagation in Deep Neural Networks ‚Äì Engineering AI Agents
  URL: https://pantelis.github.io/aiml-common/lectures/dnn/backprop-dnn/index.html
  PDFs: 2

Data Preprocessing ‚Äì Engineering AI Agents
  URL: https://pantelis.github.io/aiml-common/lectures/optimization/whitening/index.html
  PDFs: 1

Covariance and Correlation Matrices ‚Äì Engineering AI Agents
  URL: https://pantelis.github.io/aiml-common/lectures/optimization/whitening/corr-cov-matrix.html
  PDFs: 2

Regularization in

In [3]:
storage = Storage(
    mongo_uri='mongodb://erica:erica_password_123@mongodb:27017/',
    db_name='erica'
)

# Run this to see what's pending
stats = storage.get_stats()
print(f"PDFs to process: {stats['resources']['by_type']['pdf']}")
print(f"Videos to process: {stats['resources']['by_type']['video']}")

# Preview some
pdfs = list(storage.resources.find({"resource_type": "pdf"}, {"url": 1}).limit(5))
videos = list(storage.resources.find({"resource_type": "video"}, {"url": 1}).limit(5))

print("\nSample PDFs:")
for p in pdfs:
    print(f"  {p['url']}")

print("\nSample Videos:")
for v in videos:
    print(f"  {v['url']}")

PDFs to process: 42
Videos to process: 11

Sample PDFs:
  https://pantelis.github.io/book/kinematics/assets/pdf/lynch-config-space.pdf
  https://pantelis.github.io/assets/pdf/lynch-body-motions.pdf
  https://pantelis.github.io/assets/pdf/lynch-wheeled-robots.pdf
  https://pantelis.github.io/aiml-common/lectures/optimization/maximum-likelihood/notescs6140linearregression.pdf
  http://cs231n.stanford.edu/handouts/derivatives.pdf

Sample Videos:
  https://www.youtube.com/watch?v=WCUNPb-5EYI
  https://www.youtube.com/watch?v=6niqTuYFZLQ&t=521s
  https://youtu.be/_NOVa4i7Us8?list=PL1Q0jeuU6XppS_r2Sa9fzVanpbXKqLsYS&t=384
  https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
  https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=3


In [7]:
from src.ingestion import PDFProcessor, Storage

storage = Storage(mongo_uri='mongodb://erica:erica_password_123@mongodb:27017/')
pdf_processor = PDFProcessor(storage=storage)
pdf_processor.process_all_pending()

[INFO] Processing 0 PDFs...
[INFO] PDF processing complete. Success: 0, Failed: 0


In [4]:
# Reset failed videos to try again
storage.resources.update_many(
    {"resource_type": "video", "status": "failed"},
    {"$set": {"status": "pending"}}
)

# Then run the processor again
from src.ingestion import YouTubeProcessor
yt_processor = YouTubeProcessor(storage=storage)
yt_processor.process_all_pending()

[INFO] Processing 11 YouTube videos...
[INFO] [1/11] https://www.youtube.com/watch?v=WCUNPb-5EYI
[INFO]   Extracted 21179 chars (en language)
[INFO] [2/11] https://www.youtube.com/watch?v=6niqTuYFZLQ&t=521s
[INFO]   Extracted 78759 chars (en language)
[INFO] [3/11] https://youtu.be/_NOVa4i7Us8?list=PL1Q0jeuU6XppS_r2Sa9fzVanpbXKqLsYS&t=384
[INFO]   Extracted 12351 chars (en language)
[INFO] [4/11] https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
[INFO]   Extracted 95211 chars (en language)
[INFO] [5/11] https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=3
[INFO]   Extracted 101324 chars (en language)
[INFO] [6/11] https://www.youtube.com/watch?v=x6wsTFnU3eY
[INFO]   Extracted 45175 chars (en language)
[INFO] [7/11] https://www.youtube.com/watch?v=MMxMeX5emUA
[INFO]   Extracted 5560 chars (en language)
[INFO] [8/11] https://www.youtube.com/watch?v=DiOxbYTLXX8
[INFO]   Extracted 11799 chars (en language)
[INFO] [9/1

In [5]:
# Check a processed PDF
pdf = storage.resources.find_one({"resource_type": "pdf", "status": "ingested"})
print(f"PDF: {pdf['url']}")
print(f"Pages: {pdf['metadata'].get('page_count')}")
print(f"Content preview: {pdf.get('content', '')[:500]}")

# Check a processed video
video = storage.resources.find_one({"resource_type": "video", "status": "ingested"})
print(f"Video: {video['url']}")
print(f"Content preview: {video.get('content', '')[:500]}")

PDF: https://pantelis.github.io/assets/pdf/lynch-body-motions.pdf
Pages: 78
Content preview: Chapter 3
Rigid-Body Motions
In the previous chapter, we saw that a minimum of six numbers is needed
to specify the position and orientation of a rigid body in three-dimensional
physical space. In this chapter we develop a systematic way to describe a rigid
body‚Äôs position and orientation which relies on attaching a reference frame to
the body. The conÔ¨Åguration of this frame with respect to a Ô¨Åxed reference frame
is then represented as a 4 √ó 4 matrix. This matrix is an example of an implicit
rep
Video: https://www.youtube.com/watch?v=WCUNPb-5EYI
Content preview: applications of machine learning have gotten a lot of traction in the last few years there's a couple of big categories that have had wins one is identifying pictures the equivalent of finding cats on the internet and any problem that can be made to look like that and the other is sequence to sequence translation this can be spee

In [6]:
# Cleanup
storage.close()
print("Storage connection closed.")

Storage connection closed.
