Zipf's Law in the Greek language
Shell
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
doc
COPYING
README.md
retrieve

README.md

Zipf

A script to fetch data, process them, and make word lists. Manipulate the lists to find word frequencies and sort according to rank. Calculate related data to prove and hold Zipf's Law for the Greek language. Create related graph-plot.

Info

  • Total documents: 10.000
  • Total words: 4.984.085
  • Vocabulary size: 174.258 (unique words)
  • Words occuring more than 10 times: 31.133
  • Words occuring once: 70.247
  • Final b for all words is -1.06015791025300522471

Review Paper

Grab the paper here or read it online here

Plot Graph

gnuplot-graph

Usage

usage: retrieve options

	Options are:
		-a	all, same as -t -m -b -g [Note: no fetch]
		-f	fetch files
		-t	tokenize files
		-m	sort and map tokens - get rank and freq
		-b	calculate b
		-g	create graph plot
		-h	help, print this help message

Results

Results are placed on /tmp/zipf/results

  • rank.freq.map Holds all words and their frequency sorted by their rank
  • bvalues.map Report-like file. Holds all words, their frequency, their relational frequency, each word's b-value (also refered as 'a') sorted by the word's rank, including the b-value average rate.
  • rank.freq.plot Includes just the values (rank and frequency) fed to the graph.
  • zipf_plot_greek.png The graph image

Data

Collected data are placed on /tmp/zipf/dumpfiles
Processed data are placed on /tmp/zipf/tokens
Data are collected from the Greek Wikipedia using it's random page generator. The script currently collects 10.000 random pages as default.

Dependencies

Dependencies include

  • elinks An advanced and well-established feature-rich text mode web browser.
  • gnuplot a portable command-line driven graphing utility for linux, OS/2, MS Windows, OSX, VMS, and many other platforms

License

Zipf Law for the Greek Language by Ivan Kanakarakis is licensed under GNU GPLv3 license.