Skip to content

davidprush/keycollator

Repository files navigation


Pylint Makefile CI Python Version License

β”¬β”Œβ”€β”Œβ”€β”β”¬ β”¬β”Œβ”€β”β”Œβ”€β”β”¬  ┬  β”Œβ”€β”β”Œβ”¬β”β”Œβ”€β”β”¬β”€β”
β”œβ”΄β”β”œβ”€ β””β”¬β”˜β”‚  β”‚ β”‚β”‚  β”‚  β”œβ”€β”€ β”‚ β”‚ β”‚β”œβ”¬β”˜
β”΄ β”΄β””β”€β”˜ β”΄ β””β”€β”˜β””β”€β”˜β”΄β”€β”˜β”΄β”€β”˜β”΄ β”΄ β”΄ β””β”€β”˜β”΄β””β”€

Compares text in a file to reference/glossary/key-items/dictionary.[1][2]

🧱 Built by David Rush fueled by β˜•οΈ ℹ️ info

keycollator #.#.# Pypi Project Description


πŸ‘‡ Table of Contents

  1. Structure
  2. Features
  3. Installation
    1. Install from Pypi using pip3
  4. Documentation
  5. Supported File Formats
  6. Usage
    1. Import keycollator into Python Projects
    2. Requirements
    3. CLI
    4. Turn on verbose output
    5. Apply fuzzy matching
    6. Set the key file
    7. Set the text file
    8. Specify the output file
    9. Set limit results for console and output file
    10. Set upper bound limit
    11. Turn on logging:
    12. Create a log file
  7. Example Output
  8. Todo
  9. Project Resource Acknowledgements
  10. Deployment Features
  11. Releases
    1. Pypi Versions
  12. License
  13. Citation
  14. Additional Information

πŸ—‚οΈ Structure

.
β”‚
β”œβ”€β”€ assets
β”‚   └── images
β”‚       └── coverage.svg
β”‚
β”œβ”€β”€ docs
β”‚   β”œβ”€β”€ cli.md
β”‚   └── index.md
β”‚
β”œβ”€β”€ src
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ cli.py
β”‚   β”œβ”€β”€ keycollator.py
β”‚   β”œβ”€β”€ extractfile.py
β”‚   β”œβ”€β”€ threadanalysis.py
β”‚   β”œβ”€β”€ extractonator.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └──data
β”‚       β”œβ”€β”€ (placeholder)
β”‚       └── (placeholder)
β”‚
β”œβ”€β”€ tests
β”‚   └── test_keycollator
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── test_keycollator.py
β”‚
β”œβ”€β”€ COD_OF_CONDUCT.md
β”œβ”€β”€ CONTRIBUTING.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ Makefile
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ README.README
β”œβ”€β”€ README.rst
β”œβ”€β”€ setup.cfg
└── setup.py

πŸš€ Features

──> Extract text from file to dictionary
    └──> Extract keys from file to dictionary
          └──> Find matches of keys in text file
                └──> Apply fuzzy matching

🧰 Installation

πŸ–₯️ Install from Pypi using pip3

πŸ“¦ https://pypi.org/project/keycollator/

python3 -m pip install --upgrade keycollator

πŸ“„ Documentation

Official documentation can be found here:

https://github.com/davidprush/keycollator/tree/main/docs

πŸ’ͺ Supported File Formats

  • TXT/CSV files (Mac/Linux/Win)
  • Plans to add PDF and JSON

πŸ“ Usage

πŸ–₯️ Import keycollator into Python Projects

from keycollator.customlogger import CustomLogger as cl
from keycollator.proceduretimer import ProcedureTimer as pt

πŸ–₯️ Requirements

click >= 8.0.2
datetime >= 4.7
fuzzywuzzy >= 0.18.0
halo >= 0.0.31
nltk >= 3.7
pytest >= 7.1.3
python-Levenshtein >= 0.12.2
termtables >= 0.2.4
joblib >= 1.2.0

πŸ–₯️ CLI

keycollator uses the CLI to change default parameters and functions

Usage: keycollator.py [OPTIONS] COMMAND [ARGS]...

  keycollator is an app that finds keys in a text file.

Options:
  -t, --text-file PATH          Path/file name of the text to be searched for
                                against items in the key file
  -k, --key-file PATH           Path/file name of the key file containing a
                                dictionary, key items, glossary, or reference
                                list used to search the text file
  -r, --result-file PATH        Path/file name of the output file that
                                will contain the results (CSV or TXT)
  --limit-result TEXT           Limit the number of results
  --abreviate INTEGER           Limit the text length of the results
                                (default=32)
  --fuzz-ratio INTEGER RANGE    Set the level of fuzzy matching (default=99)
                                to validate matches using approximations/edit
                                distances, uses acceptance ratios with integer
                                values from 0 to 99, where 99 is nearly
                                identical and 0 is not similar  [0<=x<=99]
  --ubound-limit INTEGER RANGE  Ignores items from the results with matches
                                greater than the upper boundary (upper-limit);
                                reduce eroneous matches  [1<=x<=99999]
  --lbound-limit INTEGER RANGE  Ignores items from the results with matches
                                less than the lower boundary (lower-limit);
                                reduce eroneous matches  [0<=x<=99999]
  -v, --verbose                 Turn on verbose
  -l, --logging                 Turn on logging
  -L, --log-file PATH           Path/file name to be used for the log file
  --help                        Show this message and exit.

πŸ–₯️ Turn on verbose output

currently provides only one level for verbose, future versions will implement multiple levels (DEBUG, INFO, WARN, etc.)

keycollator --verbose

πŸ–₯️ Apply fuzzy matching

fuzzy matching uses approximate matches (edit distances) whereby 0 is the least strict and accepts nearly anything as a match and more strictly 99 accepts only nearly identical matches; by default the app uses level 99 only if regular matching finds no matches

keycollator --fuzzy-matching=[0-99]

πŸ–₯️ Set the key file

each line of text represents a key which will be used to match with items in the text file

keycollator --key-file="/path/to/key/file/keys.txt"

πŸ–₯️ Set the text file

text file whereby each line represents an item that will be compared with the items in the keys file

keycollator --text-file="/path/to/key/file/text.txt"

πŸ–₯️ Specify the output file

currently uses CSV but will add additional file formats in future releases (PDF/JSON/DOCX)

keycollator --output-file="/path/to/results/result.csv"

πŸ–₯️ Set limit results for console and output file

Limit the number of results

keycollator --limit-results=30

πŸ–₯️ Set upper bound limit

rejects items with matches over the integer value set, helps with eroneous matches when using fuzzy matching

keycollator --ubound-limit

πŸ–₯️ Turn on logging:

turn on logging whereby if no log file is supplied by user it will create one using the default log.log

keycollator --set-logging

πŸ–₯️ Create a log file

set the name of the log file to be used by logging

keycollator --log-file="/path/to/log/file/log.log"

Example Output

python3 src/keycollator.py
Analyzing text for keys...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 679/679 [00:51<00:00, 13.31it/s]
1.r              [536]   51.conduct        [7]   101.connect       [3]   151.assist develo*[1]
2.manage          [73]   52.establish      [7]   102.determine     [3]   152.assist tracki*[1]
3.develop         [62]   53.execute        [7]   103.facilitate    [3]   153.capture speci*[1]
4.report          [58]   54.follow         [7]   104.foster        [3]   154.conduct code *[1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47.finance        [8]    97.business admin*[3]   147.advise sponso*[1]   197.flexible      [1]
48.powerpoint     [8]    98.attention deta*[3]   148.advocate      [1]   198.creative      [1]
49.build          [7]    99.python         [3]   149.align documen*[1]   199.selfmotivated [1]
50.complete       [7]    100.collaborate   [3]   150.analyze under*[1]   200.difference la*[1]
[0.00]seconds

🎯 Todo πŸ“Œ

    ❌ Fix pylint errors
    ❌ Add command line option to add a stopwords file
    ❌ Fix all cli options
    ❌ Add comments
    ❌ Refactor code and remove redunancies
    ❌ Fix pylint errors
    ❌ Add proper error handling
    ❌ Add CHANGELOG.md
    ❌ Create method to KeyKrawler to select and _create missing files_
    ❌ Update CODE_OF_CONDUCT.md
    ❌ Update CONTRIBUTING.md
    ❌ Github: issue and pr templates
    ❌ Workflow Automation
    ❌ Makefile Usage
    ❌ Dockerfile
    ❌ @dependabot configuration
    ❌ Release Drafter (release-drafter.yml)

πŸ‘” Project Resource Acknowledgements

  1. Creating a Python Package
  2. javiertejero

πŸ’Ό Deployment Features (Not yet implemented)

Feature Notes
Github issue and pr templates
Workflows Automate your workflow from idea to production
Makefile-usage Makefile Usage
Dockerfile Docker Library: Python
@dependabot Configuring Dependabot version updates
Release Drafter release-drafter.yml

πŸ“ˆ Releases

Release Version Status
Current: 0.0.5 Working

πŸ“¦ Pypi Versions

Version Notes
0.0.1 Initial prototype
0.0.2 Bug fixes
0.0.4 Fixed functions/methods
0.0.5 Fixed functions/methods

πŸ›‘ License

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

πŸ“„ Citation

@misc{keycollator,
  author = {David Rush},
  title = {Compares text in a file to reference/glossary/key-items/dictionary file.},
  year = {2022},
  publisher = {Rush Solutions, LLC},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/davidprush/keycollator}}
}

Additional Information

  1. The latest version of this document can be found here; if you are viewing it there (via HTTPS), you can download the Markdown/reStructuredText source here.
  2. You can contact the author via e-mail.

About

Python program that compares a text file with a dictionary file and compiles a list of matches with frequency of occurance

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published