Question about the way to extract text from CC HTML #18

voladorlu · 2024-04-16T08:27:17Z

Hi guys @DeepSeekPH , thanks so much for sharing such an excellent work. I note that Openwebmath uses a specialized pipeline to extract content from HTML instead of directing using the WET file from Common Crawl. I just wonder how you guys deal with this problem? Do you also follow openwebmath to process the html with a private diagram? sincerely wait for your feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the way to extract text from CC HTML #18

Question about the way to extract text from CC HTML #18

voladorlu commented Apr 16, 2024 •

edited

Question about the way to extract text from CC HTML #18

Question about the way to extract text from CC HTML #18

Comments

voladorlu commented Apr 16, 2024 • edited

voladorlu commented Apr 16, 2024 •

edited