Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ingestion for Cosmos Docs #290

Merged
merged 10 commits into from
Feb 22, 2024
Merged

Add Ingestion for Cosmos Docs #290

merged 10 commits into from
Feb 22, 2024

Conversation

davidgxue
Copy link
Collaborator

@davidgxue davidgxue commented Feb 1, 2024

Description

  • Need to ingest Cosmos Docs
  • The website is really small (less than 50 pages of content), expecting minimum to no disruption to existing retrieval quality.

Technical Details

  • airflow/dags/ingestion/ask-astro-load-cosmos-docs.py: incremental ingestion that runs periodically
  • airflow/dags/ingestion/ask-astro-load.py: added extract and ingest cosmos docs in bulk load
  • airflow/include/tasks/extract/cosmos_docs.py: main file that handles the extraction from cosmos website
    • Note: only the main body of each page is extracted to minimize noise so the parsing logic is written tailoring to this data source (e.g. looking for article tag in the html body)

Tests

  • This is what the dataframe looks like after extracting (very small website not a lot of content overall)
    df_dump.csv
  • Airflow Ingestion UI confirmation that it is working for bulk load
    image
  • Incremental ingestion works successfully on Airflow UI
    image

Retrieval Quality Evaluation

  • Existing questions have no quality degradation
  • New questions specific to cosmos are correctly answered (see csv below, note: ordering of the references in this csv is not the same as actual result)
    cosmos_ingest_1.csv

closes #277

@davidgxue davidgxue self-assigned this Feb 1, 2024
Copy link

cloudflare-pages bot commented Feb 1, 2024

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1b8037f
Status: ✅  Deploy successful!
Preview URL: https://6d425522.ask-astro.pages.dev
Branch Preview URL: https://ingest-cosmos-docs.ask-astro.pages.dev

View logs

@davidgxue davidgxue marked this pull request as ready for review February 1, 2024 06:27
Copy link
Contributor

@josh-fell josh-fell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Some drive-by, and possibly naïve comments (since I personally don't have a ton of context in the implementation details), but a grain of salt is most likely required with the review.

airflow/dags/ingestion/ask-astro-load-cosmos-docs.py Outdated Show resolved Hide resolved
airflow/dags/ingestion/ask-astro-load-cosmos-docs.py Outdated Show resolved Hide resolved
airflow/dags/ingestion/ask-astro-load-cosmos-docs.py Outdated Show resolved Hide resolved
airflow/dags/ingestion/ask-astro-load-cosmos-docs.py Outdated Show resolved Hide resolved
airflow/dags/ingestion/ask-astro-load-cosmos-docs.py Outdated Show resolved Hide resolved
airflow/dags/ingestion/ask-astro-load.py Outdated Show resolved Hide resolved
airflow/dags/ingestion/ask-astro-load.py Show resolved Hide resolved
airflow/include/tasks/extract/cosmos_docs.py Show resolved Hide resolved
davidgxue and others added 2 commits February 20, 2024 15:05
Co-authored-by: Josh Fell <48934154+josh-fell@users.noreply.github.com>
Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I also suggest changing the documentation in Readme for the same

@davidgxue davidgxue added this to the 0.3.0 milestone Feb 22, 2024
@davidgxue davidgxue merged commit d6bc44b into main Feb 22, 2024
8 checks passed
@davidgxue davidgxue deleted the ingest_cosmos_docs branch February 22, 2024 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ingest Documents from Astronomer's Cosmos Website
3 participants