A script to fetch data, process them, and make word lists. Manipulate the lists to find word frequencies and sort according to rank. Calculate related data to prove and hold Zipf's Law for the Greek language. Create related graph-plot.
- Total documents: 10.000
- Total words: 4.984.085
- Vocabulary size: 174.258 (unique words)
- Words occuring more than 10 times: 31.133
- Words occuring once: 70.247
- Final b for all words is -1.06015791025300522471
usage: retrieve options Options are: -a all, same as -t -m -b -g [Note: no fetch] -f fetch files -t tokenize files -m sort and map tokens - get rank and freq -b calculate b -g create graph plot -h help, print this help message
Results are placed on
rank.freq.mapHolds all words and their frequency sorted by their rank
bvalues.mapReport-like file. Holds all words, their frequency, their relational frequency, each word's b-value (also refered as 'a') sorted by the word's rank, including the b-value average rate.
rank.freq.plotIncludes just the values (rank and frequency) fed to the graph.
zipf_plot_greek.pngThe graph image
Collected data are placed on
Processed data are placed on
Data are collected from the Greek Wikipedia using it's random page generator. The script currently collects 10.000 random pages as default.
elinksAn advanced and well-established feature-rich text mode web browser.
gnuplota portable command-line driven graphing utility for linux, OS/2, MS Windows, OSX, VMS, and many other platforms