Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Best way to save crawled articles #529

Closed
stefan-it opened this issue May 31, 2024 · 4 comments · Fixed by #530
Closed

[Question]: Best way to save crawled articles #529

stefan-it opened this issue May 31, 2024 · 4 comments · Fixed by #530
Labels
question Further information is requested

Comments

@stefan-it
Copy link
Member

Question

Hi,

many thanks for releasing this great crawler! Particulary, the supported number of German publishers is amazing - I am planing to collect some articles for LM pretraining.

I opened this issue, because I couldn't find an example in the docs: what is the best and recommended way to export articles into e.g. a jsonl file? I could think of adding a to_json function to an Article object and then write it to a file 🤔

But it would be great if the documention could also cover exporting articles :)

Many thanks in advance!

@stefan-it stefan-it added the question Further information is requested label May 31, 2024
@stefan-it
Copy link
Member Author

Pinging @MaxDall for help :)

@stefan-it
Copy link
Member Author

For now I came up with the following solution:

image

@alanakbik
Copy link
Contributor

Thanks @stefan-it for pointing this out!

I think it would be good for Fundus to offer support for serializing articles. We'd need some helper methods to serialize/deserialize articles. JSON seems like a good fit since it is human-readable. @addie9800 what do you think?

@addie9800
Copy link
Collaborator

I definitely agree, also since we are already using JSON to represent the parsed articles within our tests. @MaxDall has also already started working on a solution implementing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants