Skip to content

more text stats, consistent doc extensions, better packaging

Compare
Choose a tag to compare
@bdewilde bdewilde released this 06 Dec 14:59
· 102 commits to main since this release

New and Changed

  • Refactored and extended text statistics functionality (PR #350)
    • Added functions for computing measures of lexical diversity, such as the clasic Type-Token-Ratio and modern Hypergeometric Distribution Diversity
    • Added functions for counting token-level attributes, including morphological features and parts-of-speech, in a convenient form
    • Refactored all text stats functions to accept a Doc as their first positional arg, suitable for use as custom doc extensions (see below)
    • Deprecated the TextStats class, since other methods for accessing the underlying functionality were made more accessible and convenient, and there's no longer need for a third method.
  • Standardized functionality for getting/setting/removing doc extensions (PR #352)
    • Now, custom extensions are accessed by name, and users have more control over the process:

      >>> import textacy
      >>> from textacy import extract, text_stats
      >>> textacy.set_doc_extensions("extract")
      >>> textacy.set_doc_extensions("text_stats.readability")
      >>> textacy.remove_doc_extensions("extract.matches")
      >>> textacy.make_spacy_doc("This is a test.", "en_core_web_sm")._.flesch_reading_ease()
      118.17500000000001
    • Moved top-level extensions into spacier.core and extract.bags

    • Standardized extract and text_stats subpackage extensions to use the new setup, and made them more customizable

  • Improved package code, tests, and docs
    • Fixed outdated code and comments in the "Quickstart" guide, then renamed it "Walkthrough" since it wasn't actually quick; added a new and, yes, quick "Quickstart" guide to fill the gap (PR #353)
    • Added a pytest conftest file to improve maintainability and consistency of unit test suite (PR #353)
    • Improved quality and consistency of type annotations, everywhere (PR #349)
    • Note: Bumped Python version support from 3.7–3.9 to 3.8–3.10 in order to take advantage of new typing features in PY3.8 and formally support the current major version (PR #348)
    • Modernized and streamlined package builds and configuration (PR #347)
      • Removed deprecated setup.py and switched from setuptools to build for builds
      • Consolidated tool configuration in pyproject.toml
      • Extended and tidied up dev-oriented Makefile
      • Addressed some CI/CD issues

Fixed

  • Added missing import, args in TextStats docs (PR #331, Issue #334)
  • Fixed normalization in YAKE keyword extraction (PR #332)
  • Fixed text encoding issue when loading ConceptNet data on Windows systems (Issue #345)

Contributors

Thanks to @austinjp, @scarroll32, @mirkolenz for their help!