-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
Steps:
- pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang
- import pseudo crawled dataset on JZ @SaulLu
- run 1st step of extraction:
- run 2nd step of extraction:
- Extract Website descriptions @shanyas10 @SaulLu
- run 3rd step of extraction:
- run 4th step of extraction:
- Extract Paragraph @tianjianjiang @SaulLu
- How do we define a paragraph? #114
- feat: HTML scanner for text content & content sectioning elements → segment paragraphs #125
- annotator (preprocessor) of the metadata
- Modify entities metadata with paragraph information @manandey @SaulLu
- Modify generation length with paragraph information @chkla @SaulLu
- Extract Paragraph @tianjianjiang @SaulLu
- (optional) clean final dataset:
- Remove empty lines @SaulLu
- Remove "errors" columns @SaulLu
- (optional) Gather all metadata into same column @cccntu @timoschick @SaulLu
- push dataset to Hub @SaulLu
changjonathanc, manandey and tianjianjiang
Metadata
Metadata
Assignees
Type
Projects
Status
In Progress