Add LXX Morphology downloader and README chat-question template#1
Add LXX Morphology downloader and README chat-question template#1editor77 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d26e317163
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for link in links: | ||
| if link.endswith("/"): |
There was a problem hiding this comment.
Prevent crawling outside the requested base URL
If the directory listing includes a parent link such as ../ (common in Apache-style indexes), parse_links will return it and the recursion will follow it because there is no guard that link stays under the original base_url. That means a run starting at .../lxxmorph/ can walk up to .../biblical/ and begin downloading unrelated directories (still matching *.mlxx), which is both unexpected and potentially huge. Consider rejecting links that are not prefixed by the initial base or explicitly skipping .. links before recursing.
Useful? React with 👍 / 👎.
Motivation
*.mlxxcorpus.User-Agentto increase compatibility with servers.README.mdto standardize requests for analysis runs against the local*.mlxxfiles.Description
scripts/download_lxxmorph.py, a recursive downloader that parses HTML/gopher-style listings, follows links, creates output directories, and saves files idempotently.--includeargument (default\.mlxx$) to filter filenames and a--user-agentargument (defaultMozilla/5.0 (compatible; LXXMorphDownloader/1.0)) which is threaded into requests viafetch_bytes.urllib.parse.urljoin, and directories are created withensure_directory.README.mdwith usage notes and add aChat question templateexample for asking analysis requests against the local.mlxxcorpus and associated manual.Testing
Codex Task