Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added dataset validator #269

Merged
merged 5 commits into from
May 9, 2024
Merged

added dataset validator #269

merged 5 commits into from
May 9, 2024

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented May 2, 2024

Full print from the validator:

Click to unfold
INFO:__main__:Checking datasets in /work/dfm-data/pre-training against datasheets in /work/danish-foundation-models/docs/datasheets.
Checking dataset: lexdk:   0%|                                                                                                                         | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset lexdk failed validation ------------
ERROR:__main__:Datasheet lexdk does not exist.
Error reading datasheet lexdk: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/lexdk'
Error in document file lexdk_articles.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2021-01-20T13:16:40+01:00', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: hplt_mini:   0%|                                                                                                                     | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset hplt_mini failed validation ------------
ERROR:__main__:Datasheet hplt_mini does not exist.
Error reading datasheet hplt_mini: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/hplt_mini'
Error in document file da_1_32.jsonl.gz: 1 validation error for Document
created
  Field required [type=missing, input_value={'id': '32000', 'text': '..., 'collection': 'cc40'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
Error in document file da_1_70.jsonl.gz: 1 validation error for Document
created
  Field required [type=missing, input_value={'id': '70000', 'text': "..., 'collection': 'cc40'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
Error in document file da_1_10.jsonl.gz: 1 validation error for Document
created
  Field required [type=missing, input_value={'id': '10000', 'text': '..., 'collection': 'cc40'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing

Checking dataset: eur-lex-sum-da:   0%|                                                                                                                | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset eur-lex-sum-da failed validation ------------
ERROR:__main__:Datasheet eur-lex-sum-da does not exist.
Error reading datasheet eur-lex-sum-da: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/eur-lex-sum-da'
Error in document file eur-lex-sum-da.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='1993-11-01T00:00:00.000Z...024-03-18T11:54:14.000Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: swedish_gigaword:   0%|                                                                                                              | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset swedish_gigaword failed validation ------------
ERROR:__main__:Datasheet swedish_gigaword does not exist.
Error reading datasheet swedish_gigaword: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/swedish_gigaword'
Error in document file swedish_gigaword_new.jsonl.gz: Source should be swedish_gigaword, but is Swedish gigaword
Error in document file swedish_gigaword.jsonl.gz: Source should be swedish_gigaword, but is Swedish gigaword

Checking dataset: hplt:   0%|                                                                                                                          | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset hplt failed validation ------------
ERROR:__main__:Datasheet hplt does not exist.
Error reading datasheet hplt: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/hplt'
Error in document file sv_2_12.jsonl.gz: 1 validation error for Document
created
  Field required [type=missing, input_value={'id': '25910686', 'text'...'collection': 'wide15'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
Error in document file nb_1_2.jsonl.gz: 1 validation error for Document
created
  Field required [type=missing, input_value={'id': '2000000', 'text':..., 'collection': 'cc40'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
Error in document file nb_1_4.jsonl.gz: 1 validation error for Document
created
  Field required [type=missing, input_value={'id': '4000000', 'text':..., 'collection': 'cc40'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing

Checking dataset: scrape_hovedstaden:   0%|                                                                                                            | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset scrape_hovedstaden failed validation ------------
ERROR:__main__:Datasheet scrape_hovedstaden does not exist.
Error reading datasheet scrape_hovedstaden: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/scrape_hovedstaden'
Error in document file scrape_hovedstaden.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2023-11-16T13:44:00+01:0...24-04-04T09:09:00+02:00', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: ftspeech:   0%|                                                                                                                      | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset ftspeech failed validation ------------
ERROR:__main__:Datasheet ftspeech does not exist.
Error reading datasheet ftspeech: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/ftspeech'
Error in document file ft_lm_train_data.jsonl.gz: 1 validation error for Document
added
  Input should be a valid string [type=string_type, input_value=['2024-04-02T13:54:52.000Z'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.7/v/string_type

Checking dataset: danews2.0:   0%|                                                                                                                     | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset danews2.0 failed validation ------------
ERROR:__main__:Datasheet danews2.0 does not exist.
Error reading datasheet danews2.0: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/danews2.0'
Error in document file filtered_articles.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2011-04-12T00:00:00Z, 2012-04-11T00:00:00Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
Error in document file articles.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2011-04-12T00:00:00Z, 2012-04-11T00:00:00Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: scandi-wiki:   0%|                                                                                                                   | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset scandi-wiki failed validation ------------
ERROR:__main__:Datasheet scandi-wiki does not exist.
Error reading datasheet scandi-wiki: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/scandi-wiki'
Error in document file sv.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2001-01-15T00:00:00.000Z...024-03-12T15:39:25.000Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
Error in document file is.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2001-01-15T00:00:00.000Z...024-03-12T15:39:25.000Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
Error in document file nb.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2001-01-15T00:00:00.000Z...024-03-12T15:39:25.000Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: scandi-reddit:   0%|                                                                                                                 | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset scandi-reddit failed validation ------------
ERROR:__main__:Datasheet scandi-reddit does not exist.
Error reading datasheet scandi-reddit: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/scandi-reddit'

Checking dataset: domsdatabasen:   0%|                                                                                                                 | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset domsdatabasen failed validation ------------
ERROR:__main__:Datasheet domsdatabasen does not exist.
Error reading datasheet domsdatabasen: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/domsdatabasen'
Error in document file domsdatabasen.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='1855-02-28T00:00:00.000Z...024-03-22T12:42:46.000Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: ai_aktindsigt:   0%|                                                                                                                 | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset ai_aktindsigt failed validation ------------
ERROR:__main__:Datasheet ai_aktindsigt does not exist.
Error reading datasheet ai_aktindsigt: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/ai_aktindsigt'
Error in document file ai_aktindsigt.jsonl.gz: 1 validation error for Document
created
  Field required [type=missing, input_value={'text': 'Vallensbæk Sta...be80948f767eb5fa04645'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing

Checking dataset: colossal_oscar_1_0:   0%|                                                                                                            | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset colossal_oscar_1_0 failed validation ------------
ERROR:__main__:Datasheet colossal_oscar_1_0 does not exist.
Error reading datasheet colossal_oscar_1_0: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/colossal_oscar_1_0'
Error in document file 05-19__sv_meta__sv_meta_part_2_0.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2019-05-20T05:24:02Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
Error in document file 11-17__sv_meta__sv_meta_part_3_1.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2017-10-20T01:22:21Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
Error in document file 06-07-22__sv_meta__sv_meta_part_1_0.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2022-07-01T10:12:52Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: dr-facebook:   0%|                                                                                                                   | 0/22 [00:00<?, ?it/s]ERROR:__main__:--- Dataset dr-facebook failed validation ------------
ERROR:__main__:Datasheet dr-facebook does not exist.
Error reading datasheet dr-facebook: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/dr-facebook'
Error in document file dr-facebook-data.jsonl.gz: 2 validation errors for Document
added
  Input should be a valid string [type=string_type, input_value=['2024-04-02T14:15:04.000Z'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.7/v/string_type
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='2021-03-06T00:00:00.000CET', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: mC4_da_cleaned:  64%|████████████████████████████████████████████████████████████████▉                                     | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset mC4_da_cleaned failed validation ------------
ERROR:__main__:Datasheet mC4_da_cleaned does not exist.
Error reading datasheet mC4_da_cleaned: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/mC4_da_cleaned'
Folder 'documents' does not contain any document files in dataset mC4_da_cleaned

Checking dataset: augmented_dagw:  64%|████████████████████████████████████████████████████████████████▉                                     | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset augmented_dagw failed validation ------------
ERROR:__main__:Datasheet augmented_dagw does not exist.
Error reading datasheet augmented_dagw: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/augmented_dagw'
File merged.jsonl.gz is not allowed in dataset augmented_dagw
File filtered.jsonl.gz is not allowed in dataset augmented_dagw
File new_merged.jsonl.gz is not allowed in dataset augmented_dagw
Error in document file augmented_dagw.jsonl.gz: 1 validation error for Document
added
  Value error, Timestamp 'added' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='Fri Jun 26 13:06:11 2020 CEST +0200', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: ncc:  64%|███████████████████████████████████████████████████████████████████████▉                                         | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset ncc failed validation ------------
ERROR:__main__:Datasheet ncc does not exist.
Error reading datasheet ncc: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/ncc'
Error in document file ncc.jsonl.gz: 1 validation error for Document
created
  Value error, Timestamp 'created' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE, YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='1968-01-01T00:00:00.000Z', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: mC4:  64%|███████████████████████████████████████████████████████████████████████▉                                         | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset mC4 failed validation ------------
ERROR:__main__:Datasheet mC4 does not exist.
Error reading datasheet mC4: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/mC4'
Folder 'documents' does not contain any document files in dataset mC4

Checking dataset: mC4_da:  64%|██████████████████████████████████████████████████████████████████████                                        | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset mC4_da failed validation ------------
ERROR:__main__:Datasheet mC4_da does not exist.
Error reading datasheet mC4_da: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/mC4_da'
Folder 'documents' does not contain any document files in dataset mC4_da

Checking dataset: dagw:  64%|███████████████████████████████████████████████████████████████████████▎                                        | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset dagw failed validation ------------
ERROR:__main__:Datasheet dagw does not exist.
Error reading datasheet dagw: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/dagw'
File whole_dataset.jsonl.gz is not allowed in dataset dagw
Error in document file dagw-ep.jsonl.gz: 1 validation error for Document
added
  Value error, Timestamp 'added' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='Wed Nov 20 10:15:08 2019 CET +0100', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
Error in document file dagw-synne.jsonl.gz: 1 validation error for Document
added
  Value error, Timestamp 'added' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='Fri Jun 26 10:36:35 2020 CEST +0200', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error
Error in document file dagw-jvj.jsonl.gz: 1 validation error for Document
added
  Value error, Timestamp 'added' should be in the format 'YYYY-MM-DDTHH:MM:SS.TIMEZONE'. [type=value_error, input_value='Fri Jun 26 13:06:11 2020 CEST +0200', input_type=str]
    For further information visit https://errors.pydantic.dev/2.7/v/value_error

Checking dataset: memo:  64%|███████████████████████████████████████████████████████████████████████▎                                        | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset memo failed validation ------------
ERROR:__main__:Datasheet memo does not exist.
Error reading datasheet memo: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/memo'
Error in document file normalized_memo.jsonl.gz: Source should be memo, but is KB

Checking dataset: nordjylland_news:  64%|███████████████████████████████████████████████████████████████▋                                    | 14/22 [00:00<00:00, 135.26it/s]ERROR:__main__:--- Dataset nordjylland_news failed validation ------------
ERROR:__main__:Datasheet nordjylland_news does not exist.
Error reading datasheet nordjylland_news: [Errno 2] No such file or directory: '/work/danish-foundation-models/docs/datasheets/nordjylland_news'
Error in document file converted_train.jsonl.gz: Source should be nordjylland_news, but is TV2 Nord

Checking dataset: nordjylland_news: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [00:00<00:00, 102.29it/s]
ERROR:__main__:The following datasets failed validation:
ERROR:__main__:lexdk
 - hplt_mini
 - eur-lex-sum-da
 - swedish_gigaword
 - hplt
 - scrape_hovedstaden
 - ftspeech
 - danews2.0
 - scandi-wiki
 - scandi-reddit
 - domsdatabasen
 - ai_aktindsigt
 - colossal_oscar_1_0
 - dr-facebook
 - mC4_da_cleaned
 - augmented_dagw
 - ncc
 - mC4
 - mC4_da
 - dagw
 - memo
 - nordjylland_news

Copy link
Contributor

@peterbjorgensen peterbjorgensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks, good. Just have a few specific comments.

data-processing/scripts/dataset_validator.py Outdated Show resolved Hide resolved
data-processing/scripts/dataset_validator.py Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor Author

I have now added the corrections and will merge this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants