Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293

davidgxue · 2024-02-02T23:48:49Z

Description

This is 1st part of a 2 part effort to improve the scraping, extraction, chunking and tokenizing logic for Ask Astro's data ingestion process. (see details in this issue Research & Implement: Data Ingestion Related Improvements #258)
This PR mainly focuses on improving noise from ingestion process of the Astro Docs data source, along with some other related changes such as only scraping the latest doc versions, add auto exponential backoff on html get function and etc.

Closes the Following Issues

Partially Completes Issues

Research & Implement: Data Ingestion Related Improvements #258 (2 part effort, only 1 PR completed)
Handle big text ingestion #221 (tackles token limit in html splitting logic, other parts needs tackling still)

Technical Details

airflow/include/tasks/extract/astro_docs.py
- Add function process_astro_doc_page_content: which gets rid of noisey not useful content such as nav bar, footer, header and only extract the main page article content
- Remove the previous function scrape_page (which scraps the HTML content AND finds scraps all its sub pages using links contained). This is done since 1. there is already a centralized util function called fetch_page_content() that does the job of fetching each page's HTML elements, 2. there is already a centralized util function called get_internal_links that finds all links in, 3. the scraping process itself does not exclude noisey unrelated content which is replaced by the function in the previous bullet point process_astro_doc_page_content
airflow/include/tasks/split.py
- Modify function split_html: it previously splits on specific HTML tags using HTMLHeaderTextSplitter but it is not ideal as we do not want to split that often and there is no guarantee splitting on such tags retains semantic meaning. This is changed to using RecursiveCharacterTextSplitter with a token limit. This will ONLY split if the chunk starts exceeding a certain number of specified token amount. If it still exceeds then go down the separator list and split further, until splitting by space and character to fit into token limit. This retains better semantic meaning in each chunks and enforces token limit.
airflow/include/tasks/extract/utils/html_utils.py
- Change function fetch_page_content to add auto retry with exponential backoff using tenacity.
- Change function get_page_links to make it traverse a given page recursively and finds all links related to this website. This ensures no duplicate pages are traversed and no pages are missing. Previously, the logic is missing some links when traversing potentially due to the fact that it is using a for loop and not doing recursive traversal until all links are exhausted.
- Note: This has a huge URL difference. Previously a lot of links were like https://abc.com/abc#XXX and https://abc.com/abc#YYY where the hashtag is the same page but one section of the page, but the logic wasn't able to distinguish them.
airflow/requirements.txt: adding required packages
api/ask_astro/settings.py: remove unused variables
All other DAGs: changed batch size as it was hitting OpenAI rate limit when batched to 1k

Results

Astro Docs: Better URLs Fetched + Crawling Improvement + HTML Splitter Improvement

Example of formatting and chunking

Previously (near unreadable)
Now (cleaned!)

Example of URLs difference

Previously
- around 1000 links fetched. Many have DUPLCIATE content since they are the same link.
- XMLs and non HTML/website content fetch
  See old links: astro_docs_links_old.txt
Now
- No more duplicate pages or unreleased pages
- No older versions for software docs, only latest docs being ingested. (e.g.: the .../0.31... links are gone)
  new_astro_docs_links.txt

Evaluation

General improvement in response quality and document quality retrieved. Better quoting from the docs
See a subset of evaluation results
evaluation_result_improve_1.csv

cloudflare-pages · 2024-02-13T20:01:56Z

Deploying with Cloudflare Pages

Latest commit:	`bbfa510`
Status:	✅ Deploy successful!
Preview URL:	https://3291af56.ask-astro.pages.dev
Branch Preview URL:	https://improve-html-splitter-url-fe.ask-astro.pages.dev

View logs

api/ask_astro/settings.py

airflow/requirements.txt

davidgxue added 2 commits February 2, 2024 15:47

Improve HTML Splitter, URL Fetch Logic & Astro Docs Ingestion

6743f4c

delete unused env vars

c259298

davidgxue self-assigned this Feb 13, 2024

davidgxue commented Feb 21, 2024

View reviewed changes

api/ask_astro/settings.py Show resolved Hide resolved

davidgxue marked this pull request as ready for review February 21, 2024 06:37

davidgxue requested review from Lee-W, pankajastro and sunank200 as code owners February 21, 2024 06:37

Nit change

27e501a

sunank200 approved these changes Feb 21, 2024

View reviewed changes

airflow/requirements.txt Outdated Show resolved Hide resolved

davidgxue added 3 commits February 22, 2024 14:10

Change Batch Size Due to Rate Limit & Fix Issues of Split Func Used

0404f36

Clean requirements.txt

1476738

Change batch size and fix minor typo from previous commit

bbfa510

davidgxue merged commit c43ffc1 into main Feb 23, 2024
8 checks passed

davidgxue deleted the improve_html_splitter_url_fetch_and_astro_docs_ingestion branch February 23, 2024 00:27

This was referenced Feb 23, 2024

Add Exponential Backoff and Auto Retry in Fetch Page Content When Crawling #292

Closed

Improve Response Quality for Latest/Version Number Related Query #270

Closed

Change Ingestion for Astro SDK Doc to Only Ingest the Latest Version #209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293

Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293

davidgxue commented Feb 2, 2024 •

edited

cloudflare-pages bot commented Feb 13, 2024 •

edited

Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293

Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293

Conversation

davidgxue commented Feb 2, 2024 • edited

Description

Closes the Following Issues

Partially Completes Issues

Technical Details

Results

Astro Docs: Better URLs Fetched + Crawling Improvement + HTML Splitter Improvement

Evaluation

cloudflare-pages bot commented Feb 13, 2024 • edited

Deploying with Cloudflare Pages

davidgxue commented Feb 2, 2024 •

edited

cloudflare-pages bot commented Feb 13, 2024 •

edited