# Mirror UB CSE public pages (engineering.buffalo.edu)

This notebook documents a **safe, reproducible workflow** to download (mirror) the publicly accessible pages starting from:

- `https://engineering.buffalo.edu/computer-science-engineering.html`

Goal:
- **Include** pages on `engineering.buffalo.edu` that belong to the CSE section (primarily under `/computer-science-engineering/`).
- **Exclude** external domains (e.g., `cse.buffalo.edu`, `www.buffalo.edu`, YouTube, etc.).

We use `wget` (installed via Homebrew on macOS) and start from the section sitemap for coverage.


## 0) Install prerequisites (macOS)

### Install Homebrew (if you don't have it)

Run the command below in a code cell to install Homebrew.

> **Note**: If you already have Homebrew installed, you can skip this step.

### Install wget

Run the commands below to install wget and verify the installation.

> Tip: if `brew` isn't found after install, restart Terminal, or follow the Homebrew post-install instructions it prints.


In [None]:
# Install Homebrew (if you don't have it)
# Uncomment and run if you need to install Homebrew:
# !/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

In [None]:
# Install wget
!brew install wget
!wget --version

## 1) Choose a destination folder

Pick (or create) a folder where you want the mirrored site to live.

Run the commands below to create and navigate to a destination folder.


In [2]:
# Create and navigate to destination folder
import os
destination = os.path.expanduser("~/Documents/ub_cse_mirror")
os.makedirs(destination, exist_ok=True)
os.chdir(destination)
print(f"Working directory: {os.getcwd()}")

Working directory: /Users/carbonjo/Documents/UB_CSE_mirror


## 2) Dry-run: see what would be crawled (spider)

This checks URLs without downloading them. Start from the sitemap:

- `https://engineering.buffalo.edu/computer-science-engineering/sitemap.html`

Run the command below to perform a dry-run. This will show you what URLs would be crawled without actually downloading them.

If you want a deeper dry-run, change `-l 3` to `-l 4` or `-l 5` in the command.


In [None]:
# Dry-run: spider mode (no downloads)
!wget --spider -r -l 3 -nd -nv \
  --domains=engineering.buffalo.edu \
  --include-directories=/computer-science-engineering \
  "https://engineering.buffalo.edu/computer-science-engineering/sitemap.html"

## 3) Mirror the CSE section (recommended default)

This mirrors HTML plus required assets (CSS/JS/images) and rewrites links so you can browse offline.

**Recommended default**: restrict to the CSE section path.

Run the command below to start the mirroring process. This may take a while depending on the size of the site.

After completion, open the mirrored sitemap locally (example path):
```text
engineering.buffalo.edu/computer-science-engineering/sitemap.html
```


In [None]:
# Mirror the CSE section (recommended default)
!wget \
  --mirror \
  --convert-links \
  --adjust-extension \
  --page-requisites \
  --domains=engineering.buffalo.edu \
  --include-directories=/computer-science-engineering \
  --wait=1 --random-wait \
  --execute robots=on \
  -o wget_mirror.log \
  "https://engineering.buffalo.edu/computer-science-engineering/sitemap.html"

## 4) Optional: include selected shared CMS pages under `/content/shared/...`

In your earlier spider output you saw URLs like:

- `.../news.host.html/content/shared/engineering/...`
- `.../faces-voices.host.html/content/shared/engineering/...`

These are on the same host but may fall **outside** `/computer-science-engineering/` in terms of directory structure.
If you notice missing articles/profiles, use this **more permissive** rule that still tries to avoid drifting into unrelated `university` shared content.

Run the command below to mirror with shared content included.

> If this gets too big, tighten the regex to only the specific `.host.html` sections you care about (e.g., `news.host.html`, `faces-voices.host.html`, `faculty-directory.host.html`).


In [None]:
# Optional: Mirror with shared CMS pages included
!wget \
  --mirror \
  --convert-links \
  --adjust-extension \
  --page-requisites \
  --domains=engineering.buffalo.edu \
  --accept-regex='engineering\.buffalo\.edu/(computer-science-engineering(\.html)?|computer-science-engineering/.*|.*\.host\.html/content/shared/engineering/)' \
  --reject-regex='engineering\.buffalo\.edu/.*/content/shared/university/' \
  --wait=1 --random-wait \
  --execute robots=on \
  -o wget_mirror_with_shared.log \
  "https://engineering.buffalo.edu/computer-science-engineering/sitemap.html"

## 5) Optional: avoid large downloads (videos/archives)

If the crawl starts pulling large media or archives you don't want, add the `--reject` flag to exclude specific file types.

Use the command below as an example. You can add `--reject=mp4,mov,zip,rar,7z,iso` to any of the wget commands above.


In [3]:
#Example: Mirror with large file types rejected
#Uncomment and modify as needed:
!wget \
  --mirror \
  --convert-links \
  --adjust-extension \
  --page-requisites \
  --domains=engineering.buffalo.edu \
  --include-directories=/computer-science-engineering \
  --reject=mp4,mov,zip,rar,7z,iso \
  --wait=1 --random-wait \
  --execute robots=on \
  -o wget_mirror.log \
  "https://engineering.buffalo.edu/computer-science-engineering/sitemap.html"

## 6) Notes & troubleshooting

- **"Remote file does not exist"** during spider/mirror usually means a broken link on the website (404). The mirror is still fine.
- If pages look broken offline, ensure you used `--page-requisites` and `--convert-links`.
- If content appears missing because itâ€™s rendered by JavaScript after load, `wget` may not capture it fully. In that case, consider a headless browser crawler (Playwright/Puppeteer) to render pages before saving.
