Skip to content

amira-khan/zipf

Repository files navigation

Zipf's Law

The pyzipf package tallies the occurrences of words in text files and plots each word's rank versus its frequency together with a line for the theoretical distribution for Zipf's Law.

Motivation

Zipf's Law is often stated as an observational pattern seen in the relationship between the frequency and rank of words in a text:

"…the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc." — wikipedia

Many books are available to download in plain text format from sites such as Project Gutenberg, so we created this package to qualitatively explore how well different books align with the word frequencies predicted by Zipf's Law.

Installation

pip install pyzipf

Usage

After installing this package, the following three commands will be available from the command line

  • countwords for counting the occurrences of words in a text.
  • collate for collating multiple word count files together.
  • plotcounts for visualizing the word counts.

A typical usage scenario would include running the following from your terminal:

countwords dracula.txt > dracula.csv
countwords moby_dick.txt > moby_dick.csv
collate dracula.csv moby_dick.csv > collated.csv
plotcounts collated.csv --outfile zipf-drac-moby.jpg

Additional information on each function can be found in their docstrings and appending the -h flag, e.g. countwords -h.

Contributors

Contributing

Interested in contributing? Check out the CONTRIBUTING.md file for guidelines on how to contribute. Please note that this project is released with a Contributor Code of Conduct (CONDUCT.md). By contributing to this project, you agree to abide by its terms. Both of these files can be found in our GitHub repository.