GitHub - billmcmillin/refine: Garbage Reconciliation: Data Cleaning and Linking. Documents for the presentation on OpenRefine at Ohio IR day.

#Garbage Reconciliation: Data Cleaning and Linking ##Documents for the presentation on OpenRefine at Ohio IR day

###Acknowledgements

Major thanks to Christina Harlow for not only creating LC-Reconcile, but also for creating and sharing this excellent documentation.
Data Carpentry has some great materials for getting started with OpenRefine. ###The setup - OpenRefine
Ubuntu running in VirtualBox
Installation instructions
requires jre

###The goals

Perform authority control on all names and subject headings
Save a copy of controlled entities as URLs for future updates
Apply a list of new subject headings to existing records

###The setup - LC Reconcile

git clone https://github.com/cmh2166/lc-reconcile.git
cd lc-reconcile
sudo pip install -r requirements.txt
python reconcile.py
The service should be running on http://0.0.0.0:5000
Leave that terminal open and open OpenRefine.

###Starting OR

Switch to directory where OR is installed and ./refine
A browser should open and go to http://127.0.0.1:3333/

###The data

In a DRC repository, go to a collection and click 'Export metadata.'
Save in your working directory as 'data/original_data.csv'
The data must be saved in csv format with UTF-8 encoding.
To convert a csv to UTF-8 encoding, you can use the encode-utf8.rb in this repository with ./encode-utf8.rb infile.csv outfile.csv

###Import

In OR, browse to your file and click 'next.'
Make sure the encoding is UTF-8.
Columns are separated by commas (make sure nothing else is checked).
Check 'parse next 1 line(s) as column headers.'
Check 'quotation marks are used to enclose cells containing column separators.'
Create project
Make sure the number of rows matches the number of records in your collection.

###Letting OR know where to find LC Reconcile

in OR, under dc.contributor.author go to Reconcile > Start reconciling
Add Standard Service
Enter http://0.0.0.0:5000/
For now, cancel the reconciliation because we haven't looked at the data yet.

###Identifying data to clean ####Names

dc.contributor.author
dc.subject
Notice that there is more than one entry per column: Ashbery, John, 1927-||Lehman, David, 1948-

####Splitting values

We'll need to separate these so we can work on each piece of data.
Go to the arrow next to dc.contributor.author
Edit column > Split into several columns...
Separator = ||
Now we have over a dozen columns with dc.contributor.author and it will be hard to work on these. Another solution is to separate the authors by creating a new record for each one.
In the undo/redo pane, click on 0 to undo the changes.4. Now we have over a dozen columns with dc.contributor.author and it will be hard to work on these. Another solution is to separate the authors by creating a new row for each one. For more on rows/records, see the Programming Historian's post on Cleaning Data with OpenRefine.
On the dc.contributors.author column, click on Edit Cells > Split multivalue cells
Additional authors are now placed in their own row, so we can work on all authors in one column.

####Cleaning the values

We will want to get the data as close to the format with which it will be reconciled as possible.
Any time we use GREL, it can be applied via the column header triangle > Edit Cells > Transform...
Remove the trailing years from the names with GREL value.replace(/[0-9]/"")
Remove the extra whitespace with GREL trim(value)
Remove the trailing comma with value.replace(/,$/,"")

###Reconciliation

Now we can reconcile the names following the steps outlined above.
When reconciliation is complete, we may still need to select the best match. Look at Claudia Emerson. There are 3 possible matches for her name. Clicking on an item will take you to its page at id.loc.gov. After looking determining the correct match, click on the double check box to apply that heading to all identical cells.
We can view only those that need attention by selecting 'none' under the judgment facet

###Find and replace

There are likely still values that will need to be updated that the reconciliation service missed.

On the column arrow, select Edit Cells > Transform
In expression, you can use GREL
For find and replace: value.replace(/Giovanni.*/,"Giovanni, Nikki")
We may want to reconcile again if values are more likely to be recognized.
Removing values: value.replace("The Elliston Project: ","")
Replace all occurrences of anything starting with the word 'Dwelling' with 'Dwelling, a Poetry Podcast: value.replace(/^Dwelling/,"Dwelling, a Poetry Podcast")

###Bringing them back together

wait to do this if we're assigning addtional subjects to the records

Are we done working on this column? If so, let's bring the values back together.
In the column triangle, select Edit Cells > Join multi-valued cells
Choose a separator that won't be found in the data such as ||.
Export as a csv with a name like 'subject_clean.csv'

###Adding data

We have a list of subject headings and the authors to which those headings are to be applied. Do we just search for names and paste in the headings?
For large datasets, it's better to compare each record to a key-value structure that maps authors to subjects.
E.g. {"Levertov, Denise" => ["Frost Medal", "New American Poetry", "Feminism"]}
If we're going to match the data to the keys in our subjects, we'll need to reconcile both.
The author names we were given had names in direct order. Use invert_order.rb to put the last name first. It's not perfect, but gets us closer to inverse order.
In the file with our records, make sure the multi-value cells have been split into mulitple rows
Reconcile the subject dictionary file just like we did for the records.
Export the dictionary file as 'dictionary.csv'.
With the reconciled subject dictionary, run the script apply_mapping.rb with the command ruby apply_mapping.rb data/subj_clean.csv data/mapping_output.csv 2 31 data/dictionary.csv where 2 is the column in the input data (subj_clean.csv) that holds the values we want to compare against the subject dictionary. 31 is the number of the column in the input data to which we'd like to append the subjects when a match is found.
Import mapping_out.csv into OR and join multi-valued cells as we did above.
Export as CSV and upload to the repository. Be sure to save a copy of the OR project as well.

###The result ####We now have

A set of records on which we've performed cleaning and authority control.
An OR Project with URIs pointing to id.loc.gov that can be used for updating headings or beginning a linked data project.
A subject > author dictionary that has been reconciled with the LCNAF that can be applied to future records.

###The takeaway

Learning regular expressions (regex) is the first and most important step in mastering data wrangling. All of these tools are at their core ways to help you apply regex to your data.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
README.md		README.md
apply_mapping.rb		apply_mapping.rb
author_paren_sep.sh		author_paren_sep.sh
author_year_separate.sh		author_year_separate.sh
dedup_within_column.rb		dedup_within_column.rb
encode-utf8.rb		encode-utf8.rb
invert_order.rb		invert_order.rb
remove_element.rb		remove_element.rb
remove_nums.rb		remove_nums.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

billmcmillin/refine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages