Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide classes to use FastWARC to read WARC/WAT/WET files #37

Closed
sebastian-nagel opened this issue Sep 20, 2022 · 0 comments
Closed

Provide classes to use FastWARC to read WARC/WAT/WET files #37

sebastian-nagel opened this issue Sep 20, 2022 · 0 comments

Comments

@sebastian-nagel
Copy link
Contributor

FastWARC (see also FastWARC API docs) is a Python WARC parsing library

  • written in C++ for high performance
  • although inspired by warcio, not API compatible
  • without less-frequently used features, eg. reading ARC files or (as of now) chunked transfer encoding

Ideally, API differences between FastWARC and warcio should be hidden away in methods in CCSparkJob or a derived class, so that users do not have to care about the differences, except for very specific cases. Because of the differences and the required compilation of C++ components, usage of FastWARC should be optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant