zipfanalysis

Tools in python for analysing Zipf's law from text samples.

This can be installed as a package from the python3 package library using the terminal command:

>>> pip install zipfanalysis

WARNING: This tool is still in development and should not be relied upon wholly for academic research.

Usage

The package can be used from within python scripts to estimate Zipf exponents, assuming a simple power law model for word frequencies and ranks. To use the pacakge import it using

import zipfanalysis

Simple Method

The easiest way to carry out an analysis on a book or text file, using different estimators, is:

alpha_clauset = zipfanalysis.clauset("path_to_book.txt")

alpha_pdf = zipfanalysis.ols_pdf("path_to_book.txt", min_frequency=3)

alpha_cdf = zipfanalysis.ols_cdf("path_to_book.txt", min_frequency=3)

alpha_abc = zipfanalysis.abc("path_to_book.txt")

In Depth Method

Convert a book or text file to the frequency of words, ranked from highest to lowest:

word_counts = zipfanalysis.preprocessing.preprocessing.get_rank_frequency_from_text("path_to_book.txt")

Carry out different types of analysis to fit a power law to the data:

# Clauset et al estimator
alpha_clauset = zipfanalysis.estimators.clauset.clauset_estimator(word_counts)

# Ordinary Least Squares regression on log(rank) ~ log(frequency)
# Optional low frequency cut-off
alpha_pdf = zipfanalysis.estimators.ols_regression_pdf.ols_regression_pdf_estimator(word_counts, min_frequency=2)

# Ordinary least squares regression on the complemantary cumulative distribution function of ranks
# OLS on log(P(R>rank)) ~ log(rank)
# Optional low frequency cut-off
alpha_cdf = zipfanalysis.estimators.ols_regression_cdf.ols_regression_cdf_estimator(word_counts)

# Approximate Bayesian computation (regression method)
# Assumes model of p(rank) = C prob_rank^(-alpha)
# prob_rank is a word's rank in an underlying probability distribution
alpha_abc = zipfanalysis.estimators.approximate_bayesian_computation.abc_estimator(word_counts)

Development Notes

General workflow to use should be:

1. Import data to n vector. E.g. n = zipfanalysis.import_book("filename.txt") n = zipfanlysis.import_list([list of words]) n = zipfanlysis.import_counter(counter_of_words)

2. Carry out analsyis on data e.g. zipfanalysis.n_pdf_regression(n)

3. Also convert to different representations zipfanalysis.convert_to_f(n)

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
zipfanalysis		zipfanalysis
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py
setup_notes.txt		setup_notes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zipfanalysis

Usage

Simple Method

In Depth Method

Development Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zipfanalysis

Usage

Simple Method

In Depth Method

Development Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages