What's Changed
- Update Slack Link by @karanataryn in #1400
- add initial extract transform + interfaces by @HenryL27 in #1396
- Add aggregation interface by @HenryL27 in #1391
- Add code for suggest properties. by @akarshgupta7 in #1401
- [extract] various improvements by @HenryL27 in #1403
- Initial implementation of SchemaV2. by @bsowell in #1404
- Use SchemaV2 for suggest schema/properties. by @akarshgupta7 in #1406
- adding property derivation by @Soeb-aryn in #1358
- Fix pagination; use Aryn SDK for storage operations by @austin-aryn-ai in #1407
- [Extract] Move to SchemaV2 by @HenryL27 in #1408
- Update Ray version. by @akarshgupta7 in #1409
- Add integration test for schema extract. by @akarshgupta7 in #1410
- Handle null values better. by @akarshgupta7 in #1411
- Add init files to fix integ tests for schema extract. by @akarshgupta7 in #1412
- Fix Unit Tests to Mock out External Calls by @karanataryn in #1413
- Remove Image Padding by @karanataryn in #1414
- Migrate to
logger.warningusage by @emmanuel-ferdman in #1289 - Update Sycamore to leverage new Schema model. by @bsowell in #1415
- [extract] add text-search based attribution by @HenryL27 in #1416
- make pinecone wait 30s test not wait 30s by @HenryL27 in #1417
- [extract] put richproperties in entity_metadata and plain python props in entity by @HenryL27 in #1418
- [extract] default to datatype.string if prediction is none by @HenryL27 in #1419
- Update Table to HTML Function by @karanataryn in #1405
- Add support for document filtering when reading a DocSet by @austin-aryn-ai in #1420
- Add basic gpt-5 support. by @bsowell in #1422
- redact potential openai keys in opensearch requests by @HenryL27 in #1423
- [extract] misc fixes by @HenryL27 in #1424
- Add support for updating properties only by @austin-aryn-ai in #1425
- Bump Torch by @karanataryn in #1426
- Add reduce function as a parameter to suggest_schema. by @akarshgupta7 in #1427
- [extract] more fixes by @HenryL27 in #1429
- Convert "str" to "string" when handling old schema format. by @bsowell in #1430
- Allow property_type in addition to field_type when deserializing sche… by @bsowell in #1431
- Add frequency filter reduce method to suggest properties. by @akarshgupta7 in #1433
- Change Schema serialization to support backwards compatibility. by @bsowell in #1434
- Remove fitting of table row and column bounding boxes to the ocr tokens contained within them. by @vikram-ak in #1432
- Fix one additional test case with schema serialization. by @bsowell in #1435
- Add DataType aliases for backward compatibility. by @bsowell in #1437
- Upgrade pypdf to 6.0.0. by @bsowell in #1438
- Rewrite union_dropped_token_with_cells and update tests. by @vikram-ak in #1436
- Remove put_in_properties_dot_entity flag by @HenryL27 in #1439
- Default array item_type to string. by @bsowell in #1440
- Add code to support nested properties in suggest schema. by @akarshgupta7 in #1441
- [Extract] Add validators by @HenryL27 in #1442
- Fix merge logic in suggest schema. by @akarshgupta7 in #1443
- add threshold for dropped token assignment to an intersecting cell by @vikram-ak in #1444
- Move pillow import into TYPE_CHECKING by @vikram-ak in #1449
- Enforce minimum overlap threshold for token-to-cell assignment (>0) and related bug fixes by @vikram-ak in #1447
- Handle alias values better. by @akarshgupta7 in #1450
- Add Boolean Validator by @karanataryn in #1448
- faster import by @HenryL27 in #1453
- Fix new cell text and spans creation for include_additional_text. by @vikram-ak in #1451
- Remove a verbose print statement in the Aryn writer. by @bsowell in #1454
- Add existing schema to suggest schema method. by @akarshgupta7 in #1455
- Fix local mode for Aryn reader by @austin-aryn-ai in #1458
- [extract] add ziptraverse and move some implementations to it by @HenryL27 in #1452
- [extract] Add retries for validators by @HenryL27 in #1456
- Fix regression for linear plan rewriting by @bohou-aryn in #1459
- Improve extract_json and rendered prompts by @eric-anderson in #1460
- remove dependency on ziptraverse in schema.py by @HenryL27 in #1462
- If llm generation fails, stash the failing prompt into /tmp by @eric-anderson in #1463
- Update dependencies. by @bsowell in #1464
- [extract] fill in nulls where appropriate post-extraction by @HenryL27 in #1465
- Handle a case where large table headers cause max depth recursion by @austin-aryn-ai in #1466
- Make file extension comparison case insensitive in local mode. by @bsowell in #1468
- Ensure BoundingBox to_dict contains native floats by @austin-aryn-ai in #1469
- [Dependencies] Upgrade authlib. by @bsowell in #1473
- Torch speedups from benchmarking: low hanging fruit. by @alexaryn in #1472
- Add a hack so we can disable the automatic detr retry. Add a hack to help debug LLM prompts and results. by @eric-anderson in #1471
- Add support for model override at invocation of llm.generate by @austin-aryn-ai in #1359
- Add 'latest' models for Gemini by @austin-aryn-ai in #1477
- add dedent that strips leading newlines by @HenryL27 in #1476
- Bump Pip for Security Update by @karanataryn in #1478
- Fix resource leak in requests sessions by @austin-aryn-ai in #1479
- Descriptive names for PDF-to-PPM threads. by @alexaryn in #1480
- Reduce imports to reduce import time. by @alexaryn in #1483
- Change
HybridTableStructureExtractorto default todeformable_detrinstead oftable_transformerby @MarkLindblad in #1474 - Bump Jupyter to Fix Security Vulnerability by @karanataryn in #1481
- extract bugfix by @HenryL27 in #1484
- Claude 4.5 Sonnet Support. by @bsowell in #1485
- Add tree display of import time. by @alexaryn in #1482
- [extract] fix bug where object property attributes would be dropped by @HenryL27 in #1486
- Serialize set value attributes as a list. by @bsowell in #1487
- Make datetime objecst JSON serializable by @austin-aryn-ai in #1489
- PDF-to-image iterator that cleans up if interrupted. by @alexaryn in #1488
- Bump Langchain by @karanataryn in #1491
- Add a new unit test to verify correctness of get_llm by @austin-aryn-ai in #1492
- Sycamore: disable caching via NullCache. by @alexaryn in #1493
- Bump Authlib by @karanataryn in #1494
- Update schema deserialization fallback behavior. by @bsowell in #1497
- Upgrade Ray version. by @bsowell in #1495
- Handle a case where finishreason is not STOP and content is None by @austin-aryn-ai in #1498
- Update claude models with 4.1-opus, 4.5-haiku by @eric-anderson in #1500
- Attempt to fix flaky unit test due to Ray Dataset ordering by @bsowell in #1501
- Initial Iceberg writer. by @bsowell in #1499
- add union operator by @HenryL27 in #1496
- add apply docset method by @HenryL27 in #1490
- Upgrade pypdf to ^6.1.3. by @bsowell in #1502
- Bump Pip by @karanataryn in #1503
- Fix in and out token order by @austin-aryn-ai in #1504
- Fix Cache Serialization by @karanataryn in #1506
- Fix table missing cell halluciation by @bohou-aryn in #1507
- Make SchemaExtract customizable by @austin-aryn-ai in #1505
- Allow customization of suggest property user prompt by @austin-aryn-ai in #1508
- Allow JsonWriter to write MetadataDocument by @austin-aryn-ai in #1510
- Add helicone to gemini by @HenryL27 in #1512
- Add retries on schema extraction LLM calls by @austin-aryn-ai in #1509
- Fix metadata doc write by @austin-aryn-ai in #1513
- Add support for GPT 5.1. by @bsowell in #1514
- Support 'boolean' as an alias for 'bool' in SchemaV2. by @bsowell in #1515
- Update PdfMiner due to CVE-2025-64512. by @alexaryn in #1516
- Bump Scrapy by @karanataryn in #1518
- Bump Scrapy #2 by @karanataryn in #1519
- Add support for Gemini 3.0 Pro Preview. by @bsowell in #1520
- Perform property extraction using parallel calls to LLMs by @austin-aryn-ai in #1517
- Update Langchain by @karanataryn in #1523
- Add infer_schema that returns a DocSet by @austin-aryn-ai in #1511
- Add support for LLM-based attribution by @bsowell in #1522
- Make Aryn reader work for local exec mode by @austin-aryn-ai in #1524
- Fix a small bug in which attribution is unset. by @bsowell in #1525
- Handle bbox as an empty array by @austin-aryn-ai in #1526
- Add support for Gemini3 thinking_level parameter. by @eric-anderson in #1527
- Redo Langchain Bump by @karanataryn in #1532
- Don't default to all elements for LLM-based attribution. by @bsowell in #1533
- Upgrade to Ray 2.52.1. by @bsowell in #1534
- Add support for using original elements (pre-chunking) via Aryn reader by @austin-aryn-ai in #1535
- Ensure element indexes are there when reading by @austin-aryn-ai in #1537
- Update pypdf and fonttools. by @bsowell in #1536
- Poetry lock root. by @bsowell in #1538
- Cleanup Gemini LLM code, add a little more debugging on failures. by @eric-anderson in #1530
- Bump urllib3 from 2.5.0 to 2.6.1 due to CVEs. by @alexaryn in #1539
- Add exlusion support for DocFilter; extend it to OpenSearch reader by @austin-aryn-ai in #1529
- Force urllib3 version past CVE. by @alexaryn in #1542
- Make LLM.generate_metadata an official API by @eric-anderson in #1541
- Add extract_metadata to VLM table extractor for LLM stat extraction by @austin-aryn-ai in #1543
- Update langchain-core due to CVE-2025-68664. by @alexaryn in #1544
- Fix split element logic to move past header by @austin-aryn-ai in #1545
- Improve logic to add _header and column headers to split elements; abort splitting if max depth exceeded by @austin-aryn-ai in #1546
- Sycamore: upgrade pdfminer.six to 20251230 for CVE. by @alexaryn in #1547
- add async llm response checker chaining by @HenryL27 in #1549
- Update dependencies to address dependabot alerts. by @bsowell in #1548
- Bump pdfminer to latest to pick up cmap type fix by @austin-aryn-ai in #1551
- Sycamore: Upgrade aiohttp to 3.13.3 for CVEs by @alexaryn in #1550
- Sycamore: upgrade Authlib, urllib3, filelock by @alexaryn in #1556
- Speed up unit tests with fake ML models using pre-recorded ground truth by @bsowell in #1552
- More dependency updates: pypdf and additional re-locking. by @bsowell in #1557
- Bump the virtualenv dependency. by @bsowell in #1558
- Add a simple parser for evaluating a limited set of expressions for b… by @austin-aryn-ai in #1553
- Sycamore: fastnanoid is 3-4x faster than nanoid. by @alexaryn in #1561
- fix anthropic with images by @HenryL27 in #1560
- Update pyasn1 and re-lock. by @bsowell in #1562
- Support writing tables as HTML in Markdown output. by @bsowell in #1563
- Sycamore: update setuptools to 80.10.2 for CVEs. by @alexaryn in #1564
- Sycamore: revive poetry-lock-all; address CVEs by @alexaryn in #1566
- Hanle boolean property extraction to continue extracting until first … by @austin-aryn-ai in #1565
- Introduce prediction mode to allow more flexibility in prediction usi… by @austin-aryn-ai in #1569
- Sycamore: Remove specifications of protobuf version. by @alexaryn in #1570
- Update unstructured and python-multipart. by @bsowell in #1571
- Update bbox_sort and xycut to support an optional reading order. by @bsowell in #1475
- Sycamore: upgrade paddleocr to 3.3 by @alexaryn in #1567
- Install paddlepaddle 3.3.x for test actions. by @alexaryn in #1572
- Add client args support for Anthropic by @eric-anderson in #1575
- Dependency Upgrades: scrapy, protobuf, nbconvert, pip by @bsowell in #1576
- Dependency upgrades: nbconvert and cryptography by @bsowell in #1578
- Sycamore: Upgrade pillow version due to CVE. by @alexaryn in #1579
- Sycamore: Stop using diskcache, which is vulnerable. by @alexaryn in #1580
- Sycamore: upgrade paddleocr to 3.4 for VL 1.5 by @alexaryn in #1577
- Handle import failures in schema.py. by @bsowell in #1581
- Handle import errors in one more place. by @bsowell in #1582
- Remove last remnants of guidance. by @bsowell in #1583
- Update pypdf to address CVEs. by @bsowell in #1584
- Sycamore: Upgrade Ray to 2.54.0 by @alexaryn in #1585
- Remove langchain as an optional dependency. by @bsowell in #1586
- Remove unstructured dependency and legacy-partitioners. by @bsowell in #1587
- Upgrade nltk and pypdf by @bsowell in #1588
- relock the root. by @bsowell in #1589
- Final attempt to get nltk updated by @bsowell in #1590
- Sycamore: push torch to 2.9 by @alexaryn in #1592
- Upgrade pypdf and authlib for CVEs. by @bsowell in #1593
- Dependency Upgrades. tornado, pypdf, and black. by @bsowell in #1594
- Sycamore: update packages due to CVEs. by @alexaryn in #1595
- Sycamore: Update requests, cbor2 for CVEs by @alexaryn in #1596
- Upgrade pypdf to 6.9.1 due to CVE. by @alexaryn in #1597
- Sycamore: Upgrade aiohttp due to CVEs. by @alexaryn in #1598
- Sycamore: Update transformers due to CVE. by @alexaryn in #1599
- add image_format llm kwarg to let me change the image format by @HenryL27 in #1600
- Sycamore: CVE updates and related changes by @alexaryn in #1601
- Switch LLM cache from pickle to JSON. by @bsowell in #1603
- Bump sycamore version to 0.1.34 by @bsowell in #1604
New Contributors
- @emmanuel-ferdman made their first contribution in #1289
Full Changelog: v0.1.33...v0.1.34