Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large refacto #12

Closed
wants to merge 84 commits into from
Closed

Large refacto #12

wants to merge 84 commits into from

Conversation

percevalw
Copy link
Member

@percevalw percevalw commented Feb 24, 2023

Description

Complete refacto and many new features in this PR 🎉 , already put to the test internally at APHP

Core features

  • new pipeline system whose API is inspired by spaCy
  • first-class support for pytorch
  • hybrid model inference and training (rules + deep learning)
  • moved from pandas DataFrame to dataclasses (attrs) for representing PDF documents
  • new configuration system, whose API is inspired by thinc/confection, with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features

  • new extractors: pymupdf and poppler (separate packages for licensing reasons)
  • many deep learning layers (box-transformer, 2d attention with relative position information, ...)
  • deep learning classifier
  • training recipes for deep learning models

Checklist

  • If this PR is a bug fix, the bug is documented in the test suite.
  • Changes were documented in the changelog (pending section).
  • If necessary, changes were made to the documentation.

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2023

Codecov Report

Patch coverage: 94.98% and project coverage change: -4.95 ⚠️

Comparison is base (2efafd3) 100.00% compared to head (bc9e10e) 95.05%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##            master      #12      +/-   ##
===========================================
- Coverage   100.00%   95.05%   -4.95%     
===========================================
  Files           33       32       -1     
  Lines          594     1900    +1306     
===========================================
+ Hits           594     1806    +1212     
- Misses           0       94      +94     
Impacted Files Coverage Δ
edspdf/utils/torch.py 53.84% <53.84%> (ø)
...f/components/embeddings/box_layout_preprocessor.py 86.66% <86.66%> (ø)
edspdf/layers/relative_attention.py 90.62% <90.62%> (ø)
edspdf/visualization/annotations.py 96.87% <90.90%> (-3.13%) ⬇️
edspdf/pipeline.py 94.30% <94.30%> (ø)
edspdf/layers/sinusoidal_embedding.py 95.00% <95.00%> (ø)
edspdf/utils/collections.py 95.29% <95.29%> (ø)
edspdf/utils/optimization.py 95.83% <95.83%> (ø)
edspdf/layers/vocabulary.py 96.00% <96.00%> (ø)
...pdf/components/embeddings/simple_text_embedding.py 96.18% <96.18%> (ø)
... and 22 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants