This repository contains a collection of reference datasets that can be used for different data cleaning tasks. The repository is part of the openclean project. In openclean the datasets can easily be downloaded and accessed using the openclean.data.refdata.RefStore
.
The following datasets are currently contained in the repository:
- encyclopaedia_britannica:us_cities: Names of cities in the U.S. together with the name of the state parsed from the Encyclopaedia Britannica US Cities web site.
- restcountries.eu: Information about countries in the world available from the restcountries.eu project.
- usps:street_abbrev: Mapping of common street type abbreviations to a standard format parsed from the C1 Street Suffix Abbreviations web site.
- usps:secondary_unit_designators: C2 Secondary Unit Designators.
- wikipedia:us_states: List of the 50 states in the United States, with their current capital, largest city, the date they ratified the U.S. Constitution or were admitted to the Union, population and area data, and number of representative(s) in the U.S. House of Representatives (from Wikipedia).
This repository also contains a collection of Python scripts that were used to download and parse the different datasets.