Skip to content

VIDA-NYU/openclean-reference-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openclean - Reference Data Repository

openclean Logo

About

This repository contains a collection of reference datasets that can be used for different data cleaning tasks. The repository is part of the openclean project. In openclean the datasets can easily be downloaded and accessed using the openclean.data.refdata.RefStore.

Datasets

The following datasets are currently contained in the repository:

  • encyclopaedia_britannica:us_cities: Names of cities in the U.S. together with the name of the state parsed from the Encyclopaedia Britannica US Cities web site.
  • restcountries.eu: Information about countries in the world available from the restcountries.eu project.
  • usps:street_abbrev: Mapping of common street type abbreviations to a standard format parsed from the C1 Street Suffix Abbreviations web site.
  • usps:secondary_unit_designators: C2 Secondary Unit Designators.
  • wikipedia:us_states: List of the 50 states in the United States, with their current capital, largest city, the date they ratified the U.S. Constitution or were admitted to the Union, population and area data, and number of representative(s) in the U.S. House of Representatives (from Wikipedia).

Parser and Downloader

This repository also contains a collection of Python scripts that were used to download and parse the different datasets.

About

Collection of Reference Datasets for Data Cleaning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages