-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
16 changed files
with
3,967 additions
and
962 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,59 @@ | ||
# cophi-toolbox | ||
A repository containing general functions for processing and accessing text corpora | ||
# A library for preprocessing | ||
`cophi` is a Python library for handling, modeling and processing text corpora. You | ||
can easily pipe a collection of text files using the high-level API: | ||
|
||
```python | ||
corpus, metadata = cophi.corpus(directory="british-fiction-corpus", | ||
filepath_pattern="**/*.txt", | ||
encoding="utf-8", | ||
lowercase=True, | ||
token_pattern=r"\p{L}+\p{P}?\p{L}+") | ||
``` | ||
|
||
## Requirements | ||
This library is tested on Python 3.4 and higher. Some additional packages (pandas, numpy, lxml, regex) are required. | ||
|
||
## Getting started | ||
To install the latest **stable** version: | ||
``` | ||
$ pip install git+https://github.com/cophi-wue/cophi-toolbox.git | ||
``` | ||
|
||
To install the latest **development** version: | ||
``` | ||
$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing | ||
``` | ||
|
||
## Contents | ||
- [`api`](src/cophi_toolbox/api.py): High-level API. | ||
- [`model`](src/cophi_toolbox/model.py): Low-level model classes. | ||
- [`complexity`](src/cophi_toolbox/complexity.py): Measures that assess the linguistic and stylistic complexity of (literary) texts. | ||
- [`utils`](src/cophi_toolbox/utils.py): Low-level helper functions. | ||
|
||
|
||
## Available complexity measures | ||
Measures that use sample size and vocabulary size: | ||
* Type-Token Ratio TTR | ||
* Guiraud’s R | ||
* Herdan’s C | ||
* Dugast’s k | ||
* Maas’ a<sup>2</sup> | ||
* Dugast’s U | ||
* Tuldava’s LN | ||
* Brunet’s W | ||
* Carroll’s CTTR | ||
* Summer’s S | ||
|
||
Measures that use part of the frequency spectrum: | ||
* Honoré’s H | ||
* Sichel’s S | ||
* Michéa’s M | ||
|
||
Measures that use the whole frequency spectrum: | ||
* Entropy S | ||
* Yule’s K | ||
* Simpson’s D | ||
* Herdan’s V<sub>m</sub> | ||
|
||
Parameters of probabilistic models: | ||
* Orlov’s Z |
4 changes: 2 additions & 2 deletions
4
docs/reference/cophi_toolbox.rst → docs/reference/cophi-toolbox.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,8 @@ | ||
cophi_toolbox | ||
cophi-toolbox | ||
|
||
.. testsetup:: | ||
|
||
from cophi_toolbox import * | ||
from cophi import * | ||
|
||
.. automodule:: cophi_toolbox.preprocessing | ||
:members: |
File renamed without changes.
Oops, something went wrong.