====================================== This is a set of scripts that help download Google NGram files from the web.
-
Install the mechanize package. I used the command:
sudo easy_install mechanize
-
Create a file of URLs. E.g., to get all English 5-gram files, type:
mkdir eng-5gram; python collect_file_urls.py googlebooks-eng-all-5gram-20120701-\(.*\).gz eng-5gram/urls.txt
-
Download the files at the URLs using wget -i:
wget -i eng-5gram/urls.txt -P eng-5gram/
-
Create a processed file of actual n-gram counts in the downloaded files:
python merge_files.py eng-5gram eng-5gram.txt