Skip to content

Commit

Permalink
release: v1.0.1
Browse files Browse the repository at this point in the history
  • Loading branch information
severinsimmler committed Aug 27, 2018
2 parents 34cee02 + 2653179 commit a26d839
Show file tree
Hide file tree
Showing 16 changed files with 3,967 additions and 962 deletions.
61 changes: 59 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,59 @@
# cophi-toolbox
A repository containing general functions for processing and accessing text corpora
# A library for preprocessing
`cophi` is a Python library for handling, modeling and processing text corpora. You
can easily pipe a collection of text files using the high-level API:

```python
corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
filepath_pattern="**/*.txt",
encoding="utf-8",
lowercase=True,
token_pattern=r"\p{L}+\p{P}?\p{L}+")
```

## Requirements
This library is tested on Python 3.4 and higher. Some additional packages (pandas, numpy, lxml, regex) are required.

## Getting started
To install the latest **stable** version:
```
$ pip install git+https://github.com/cophi-wue/cophi-toolbox.git
```

To install the latest **development** version:
```
$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing
```

## Contents
- [`api`](src/cophi_toolbox/api.py): High-level API.
- [`model`](src/cophi_toolbox/model.py): Low-level model classes.
- [`complexity`](src/cophi_toolbox/complexity.py): Measures that assess the linguistic and stylistic complexity of (literary) texts.
- [`utils`](src/cophi_toolbox/utils.py): Low-level helper functions.


## Available complexity measures
Measures that use sample size and vocabulary size:
* Type-Token Ratio TTR
* Guiraud’s R
* Herdan’s C
* Dugast’s k
* Maas’ a<sup>2</sup>
* Dugast’s U
* Tuldava’s LN
* Brunet’s W
* Carroll’s CTTR
* Summer’s S

Measures that use part of the frequency spectrum:
* Honoré’s H
* Sichel’s S
* Michéa’s M

Measures that use the whole frequency spectrum:
* Entropy S
* Yule’s K
* Simpson’s D
* Herdan’s V<sub>m</sub>

Parameters of probabilistic models:
* Orlov’s Z
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
cophi_toolbox
cophi-toolbox

.. testsetup::

from cophi_toolbox import *
from cophi import *

.. automodule:: cophi_toolbox.preprocessing
:members:
File renamed without changes.

0 comments on commit a26d839

Please sign in to comment.