GitHub - bencomp/oldumpscripts: Open Library dump scripts

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
GetMergeURL.java		GetMergeURL.java
author2csv.py		author2csv.py
countclasses.py		countclasses.py
countclassifications.py		countclassifications.py
countformats.py		countformats.py
countkeys.py		countkeys.py
dump2csv.py		dump2csv.py
dump2mysql.py		dump2mysql.py
edition2-json.txt		edition2-json.txt
edition2csv.py		edition2csv.py
exportclasses.py		exportclasses.py
exportcsv.py		exportcsv.py
findoddauthorkeys.py		findoddauthorkeys.py
findoddclassifications.py		findoddclassifications.py
findoddeditionkeys.py		findoddeditionkeys.py
findoddids.py		findoddids.py
findoddworkkeys.py		findoddworkkeys.py
getconfusedrecords.py		getconfusedrecords.py
geterrors.py		geterrors.py
readme.txt		readme.txt
stats.py		stats.py
work2csv.py		work2csv.py

Repository files navigation

Open Library dump scripts

A collection of scripts to extract statistics from Open Library dump files.

Stats.py: produce statistics of a dump file
stats.py reads the standard in, line by line. It expects a complete JSON record, so before feeding dump files, you should remove everything before the JSON record. For example: sed -nre "s/^[^{]*//p" <ol_dump_file> | python stats.py output.json

During execution, it keeps the statistics in a dict. Each type found in the dump, except the ones with confused identities, gets a key in this dict. The values for these keys are dicts themselves, with keys:
countr - an int count of records (of this type),
keys - a dict with keys found in the records as keys and a list as value. The list contains the number of records each key is found in, followed by the number of values: if the specific key has a list value in the records, the length of all lists is accumulated; otherwise this is the same as the number of records,
identifiers - a dict with keys found in the identifiers object as keys and the number of records and the number of instances of each key as value,
si - a dict with identifiers found in the record object as keys and a list of the number of records and the number of instances of each key as value,
classifications - same as for identifiers, but for classifications,
sc - same as for si, but for sc.
Keys and types of records with confused identities are in a list under key confused. If an exception is caught during processing of a record, a 2-tuple containing the complete record and the exception message is appended to the list under key error.

Exportcsv.py: export data from JSON stats file to separate CSV files
Expects a file generated by stats.py.

Countformats.py: count the values in the physical_format field
Expects Edition JSON records, outputs a tab separated UTF-8 file.