-
Notifications
You must be signed in to change notification settings - Fork 3
/
__init__.py
49 lines (39 loc) · 1.4 KB
/
__init__.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
r"""
**cophi** is a Python library for handling, modeling and processing text
corpora. You can easily pipe a collection of text files using the
high-level API:
.. code-block:: python
corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
pathname_pattern="**/*.txt",
encoding="utf-8",
lowercase=True,
n=1,
token_pattern=r"\p{L}+\p{P}?\p{L}+")
There are also a plenty of complexity metrics for measuring lexical
richness of (literary) texts.
Measures that use sample size and vocabulary size:
* Type-Token Ratio :math:`TTR`
* Guiraud’s :math:`R`
* Herdan’s :math:`C`
* Dugast’s :math:`k`
* Maas’ :math:`a^2`
* Dugast’s :math:`U`
* Tuldava’s :math:`LN`
* Brunet’s :math:`W`
* Carroll’s :math:`CTTR`
* Summer’s :math:`S`
Measures that use part of the frequency spectrum:
* Honoré’s :math:`H`
* Sichel’s :math:`S`
* Michéa’s :math:`M`
Measures that use the whole frequency spectrum:
* Entropy :math:`S`
* Yule’s :math:`K`
* Simpson’s :math:`D`
* Herdan’s :math:`V_m`
Parameters of probabilistic models:
* Orlov’s :math:`Z`
For a more detailed description and the used formulas, have a look at the
:module:`complexity` module.
"""
from cophi.api import document, corpus