Merge pull request #19 from eriknovak/feature/pattern-extractor

Adds a new Pattern Extractor and updated the package documentation
eriknovak · Jul 16, 2024 · 5510030 · 5510030
2 parents d0af713 + 56f700a
commit 5510030
Show file tree

Hide file tree

Showing 67 changed files with 3,746 additions and 4,386 deletions.
diff --git a/.gitignore b/.gitignore
@@ -139,6 +139,6 @@ data/**/
 !data/README.md
 
 notebooks
-!docs/documentation/notebooks
+!docs/how-to-guides/notebooks
 
 scripts
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,30 +1,30 @@
-anonipy-0.0.8 (2024-06-17)
+### anonipy-0.0.8 (2024-06-17)
 
 - Add automatic date format detection support to DateGenerator
 
-anonipy-0.0.7 (2024-06-06)
+### anonipy-0.0.7 (2024-06-06)
 
 - Upgrade gliner-spacy to have cleaner code
 - Add function to help manual post-anonymization replacement fixing
 
-anonipy-0.0.6 (2024-05-31)
+### anonipy-0.0.6 (2024-05-31)
 
 - Add GPU support and entity scores to EntityExtractor
 - Standardize the function naming in strategies
 
-anonipy-0.0.5 (2024-05-29)
+### anonipy-0.0.5 (2024-05-29)
 
 - Re-implement file reading methods + add unit tests
 - Expland the test environment on all OS
 
-anonipy-0.0.4 (2024-05-27)
+### anonipy-0.0.4 (2024-05-27)
 
 - Add unit tests
 - Fix the LANGUAGES constant
 - Refine the Entity implementation
 - Update documentation
 
-anonipy-0.0.3 (2024-05-22)
+### anonipy-0.0.3 (2024-05-22)
 
 - Add read_json function
 - Add write_json function
@@ -33,11 +33,11 @@ anonipy-0.0.3 (2024-05-22)
 - Reduce the number of viable suggestions used to create a substitute in MaskLabelGenerator
 - Add the entity label to the replacements in strategies
 
-anonipy-0.0.2 (2024-05-22)
+### anonipy-0.0.2 (2024-05-22)
 
 - Add write_file function
 - Add blog to the documentation
 
-anonipy-0.0.1 (2024-05-21)
+### anonipy-0.0.1 (2024-05-21)
 
 - Initial release
diff --git a/README.md b/README.md
@@ -30,26 +30,24 @@
 
 The anonipy package is a python package for data anonymization. It is designed to be simple to use and highly customizable, supporting different anonymization strategies. Powered by LLMs.
 
-## ✅ Requirements
+## Requirements
 Before starting the project make sure these requirements are available:
 
 - [python]. The python programming language (v3.8, v3.9, v3.10, v3.11).
 
-## 💾 Install
+## Install
 
 ```bash
 pip install anonipy
 ```
 
-## ⬆️ Upgrade
+## Upgrade
 
 ```bash
 pip install anonipy --upgrade
 ```
 
-## 🔎 Example
-
-The details of the example can be found in the [Overview](https://eriknovak.github.io/anonipy/documentation/notebooks/00-overview.ipynb).
+## Example
 
 ```python
 original_text = """\
@@ -77,14 +75,14 @@ Use the language detector to detect the language of the text:
 ```python
 from anonipy.utils.language_detector import LanguageDetector
 
-lang_detector = LanguageDetector()
-language = lang_detector(original_text)
+language_detector = LanguageDetector()
+language = language_detector(original_text)
 ```
 
 Prepare the entity extractor and extract the personal infomation from the original text:
 
 ```python
-from anonipy.anonymize.extractors import EntityExtractor
+from anonipy.anonymize.extractors import NERExtractor
 
 # define the labels to be extracted and anonymized
 labels = [
@@ -94,14 +92,14 @@ labels = [
     {"label": "date", "type": "date"},
 ]
 
-# language taken from the language detector
-entity_extractor = EntityExtractor(labels, lang=language, score_th=0.5)
+# initialize the NER extractor for the language and labels
+extractor = NERExtractor(labels, lang=language, score_th=0.5)
 
 # extract the entities from the original text
-doc, entities = entity_extractor(original_text)
+doc, entities = extractor(original_text)
 
 # display the entities in the original text
-entity_extractor.display(doc)
+extractor.display(doc)
 ```
 
 Use generators to create substitutes for the entities:
@@ -123,9 +121,9 @@ def anonymization_mapping(text, entity):
     if entity.type == "string":
         return llm_generator.generate(entity, temperature=0.7)
     if entity.label == "date":
-        return date_generator.generate(entity, output_gen="middle_of_the_month")
+        return date_generator.generate(entity, output_gen="MIDDLE_OF_THE_MONTH")
     if entity.label == "date of birth":
-        return date_generator.generate(entity, output_gen="middle_of_the_year")
+        return date_generator.generate(entity, output_gen="MIDDLE_OF_THE_YEAR")
     if entity.label == "social security number":
         return number_generator.generate(entity)
     return "[REDACTED]"
@@ -143,7 +141,7 @@ pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)
 anonymized_text, replacements = pseudo_strategy.anonymize(original_text, entities)
 ```
 
-## 📖 Acknowledgements
+## Acknowledgements
 
 [Anonipy](https://eriknovak.github.io/anonipy/) is developed by the
 [Department for Artificial Intelligence](http://ailab.ijs.si/) at the

diff --git a/anonipy/__init__.py b/anonipy/__init__.py
@@ -1,25 +1,15 @@
-"""
-anonipy
-
-The anonipy package provides utilities for data anonymization.
-
-Submodules
-----------
-anonymize :
-    The package containing anonymization classes and functions.
-utils :
-    The package containing utility classes and functions.
-definitions :
-    The object definitions used within the package.
-constants :
-    The constant values used to help with data anonymization.
+"""`Anonipy` is a text anonymization package.
 
+The `anonipy` package provides utilities for data anonymization. It provides
+a set of modules and utilities for (1) identifying relevant information
+that needs to be anonymized, (2) generating substitutes for the identified
+information, and (3) strategies for anonymizing the identified information.
 
-How to use the documentation
-----------------------------
-Documentation is available in two forms: docstrings provided
-with the code and a loose standing reference guide, available
-from `the anonipy homepage <https://eriknovak.github.io/anonipy>`.
+Modules:
+    anonymize: The module containing the anonymization submodules and utility.
+    utils: The module containing utility classes and functions.
+    definitions: The module containing predefined types used across the package.
+    constants: The module containing the predefined constants used across the package.
 
 """
 

diff --git a/anonipy/anonymize/__init__.py b/anonipy/anonymize/__init__.py
@@ -1,29 +1,23 @@
-"""
-anonymize
+"""Module containing the anonymization modules and utility.
 
-The module provides a set of anonymization utilities.
+The `anonymize` module provides a set of anonymization modules and utility,
+including `extractors`, `generators`, and `strategies`. In addition, it provides
+methods for anonymizing text based on a list of replacements.
 
-Submodules
-----------
-extractors :
-    The module containing the extractor classes
-generators :
-    The module containing the generator classes
-strategies :
-    The module containing the strategy classes
-regex :
-    The module containing the regex patterns
+Modules:
+    extractors: The module containing the extractor classes.
+    generators: The module containing the generator classes.
+    strategies: The module containing the strategy classes.
 
-Methods
--------
-anonymize()
+Methods:
+    anonymize(text, replacements):
+        Anonymize the text based on the replacements.
 
 """
 
 from . import extractors
 from . import generators
 from . import strategies
-from . import regex
 from .helpers import anonymize
 
-__all__ = ["extractors", "generators", "strategies", "regex", "anonymize"]
+__all__ = ["extractors", "generators", "strategies", "anonymize"]
diff --git a/anonipy/anonymize/extractors/__init__.py b/anonipy/anonymize/extractors/__init__.py
@@ -1,18 +1,19 @@
-"""
-extractors
+"""Module containing the `extractors`.
 
-The module provides a set of extractors used in the library.
+The `extractors` module provides a set of extractors used to identify relevant
+information within a document.
 
-Classes
--------
-ExtractorInterface :
-    The class representing the extractor interface
-EntityExtractor :
-    The class representing the entity extractor
+Classes:
+    NERExtractor: The class representing the named entity recognition (NER) extractor.
+    PatternExtractor: The class representing the pattern extractor.
+    MultiExtractor: The class representing the multi extractor.
 
 """
 
 from .interface import ExtractorInterface
-from .entity_extractor import EntityExtractor
+from .multi_extractor import MultiExtractor
+from .ner_extractor import NERExtractor
+from .pattern_extractor import PatternExtractor
+
 
-__all__ = ["ExtractorInterface", "EntityExtractor"]
+__all__ = ["ExtractorInterface", "MultiExtractor", "NERExtractor", "PatternExtractor"]