Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Open Library dump scripts
Python Java
Branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
GetMergeURL.java
author2csv.py
countclasses.py
countclassifications.py
countformats.py
countkeys.py
dump2csv.py
dump2mysql.py
edition2-json.txt
edition2csv.py
exportclasses.py
exportcsv.py
findoddauthorkeys.py
findoddclassifications.py
findoddeditionkeys.py
findoddids.py
findoddworkkeys.py
getconfusedrecords.py
geterrors.py
readme.txt
stats.py
work2csv.py

readme.txt

Open Library dump scripts

A collection of scripts to extract statistics from Open Library dump files.

Stats.py: produce statistics of a dump file
stats.py reads the standard in, line by line. It expects a complete JSON record, so before feeding dump files, you should remove everything before the JSON record. For example: sed -nre "s/^[^{]*//p" <ol_dump_file> | python stats.py output.json

During execution, it keeps the statistics in a dict. Each type found in the dump, except the ones with confused identities, gets a key in this dict. The values for these keys are dicts themselves, with keys: 
countr - an int count of records (of this type),
keys - a dict with keys found in the records as keys and a list as value. The list contains the number of records each key is found in, followed by the number of values: if the specific key has a list value in the records, the length of all lists is accumulated; otherwise this is the same as the number of records, 
identifiers - a dict with keys found in the identifiers object as keys and the number of records and the number of instances of each key as value, 
si - a dict with identifiers found in the record object as keys and a list of the number of records and the number of instances of each key as value, 
classifications - same as for identifiers, but for classifications,
sc - same as for si, but for sc.
Keys and types of records with confused identities are in a list under key confused. If an exception is caught during processing of a record, a 2-tuple containing the complete record and the exception message is appended to the list under key error.

Exportcsv.py: export data from JSON stats file to separate CSV files
Expects a file generated by stats.py.

Countformats.py: count the values in the physical_format field
Expects Edition JSON records, outputs a tab separated UTF-8 file.
Something went wrong with that request. Please try again.