Skip to content

Add LXX Morphology downloader and README chat-question template#1

Open
editor77 wants to merge 1 commit into
mainfrom
codex/access-gopher-text-resource
Open

Add LXX Morphology downloader and README chat-question template#1
editor77 wants to merge 1 commit into
mainfrom
codex/access-gopher-text-resource

Conversation

@editor77
Copy link
Copy Markdown
Owner

Motivation

  • Provide a small utility to fetch the LXX Morphology corpus from the UPenn mirror so the repository can populate a local *.mlxx corpus.
  • Restrict downloads to MLXX files by default to avoid pulling unrelated mirror content and add a configurable User-Agent to increase compatibility with servers.
  • Add a short chat-style question template to README.md to standardize requests for analysis runs against the local *.mlxx files.

Description

  • Add scripts/download_lxxmorph.py, a recursive downloader that parses HTML/gopher-style listings, follows links, creates output directories, and saves files idempotently.
  • Implement an --include argument (default \.mlxx$) to filter filenames and a --user-agent argument (default Mozilla/5.0 (compatible; LXXMorphDownloader/1.0)) which is threaded into requests via fetch_bytes.
  • Ensure existing files are skipped, links are normalized with urllib.parse.urljoin, and directories are created with ensure_directory.
  • Update README.md with usage notes and add a Chat question template example for asking analysis requests against the local .mlxx corpus and associated manual.

Testing

  • No automated tests were run for this change.
  • No network-based validation was executed as part of this rollout, so remote download behavior was not exercised by automated checks.

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d26e317163

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +85 to +86
for link in links:
if link.endswith("/"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent crawling outside the requested base URL

If the directory listing includes a parent link such as ../ (common in Apache-style indexes), parse_links will return it and the recursion will follow it because there is no guard that link stays under the original base_url. That means a run starting at .../lxxmorph/ can walk up to .../biblical/ and begin downloading unrelated directories (still matching *.mlxx), which is both unexpected and potentially huge. Consider rejecting links that are not prefixed by the initial base or explicitly skipping .. links before recursing.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant