Skip to content

Materials for the useR!2021 tutorial on data validation

Notifications You must be signed in to change notification settings

data-cleaning/useR2021_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

useR2021_tutorial

Materials for the useR!2021 tutorial on data validation

Watch the video of this tutorial here.

Preparations

  • Make sure you have a recent version of R (>= 4.1.0) installed.
  • During the tutorial we will use the RStudio IDE, but this is not mandatory for participants. Please note that you need RStudio version 1.4.1717 or higher to work with R >= 4.1.0.

Install the following packages, by copying the code below.

install.packages(c("validate","validatetools","validatedb"
                  ,"RSQLite","lumberjack"))
install.packages("simputation", dependencies=TRUE)

Required knowledge

We expect participants to have some basic knowledge of (base) R. No knowledge about particular packages is required. You should be familiar with data frames, reading from and writing to csv, selecting columns and rows, and working with R scripts.

If you want to brush up your R knowledge, you can follow the excellent free online tutorial by Norm Matloff.

If you work with RStudio, we strongly advise you to work in an RStudio Project so data and scripts are found within the local project path.

Time outline

Part 1: Introduction to validate and data validation. (60 min)

  • Opening and hands-on introduction to 'validate' workflow (20 min)
  • Presentation: theory of data validation + Q & A (20 min)
  • Breakout assignment & discussions in groups (20 min)

Break (10 min)

Part 2: Expressing data checks (50 minutes)

  • Feedback on results of the breakout groups (10 min)
  • Focus on different validation tasks. Small programming assignments, from simple to complex tasks. (40 min)

Break (10 min)

Part 3: Automating data quality measurement with lumberjack (50 min)

  • Hands-on introduction to lumberjack (15 min)
  • Presentation: monitoring data in R and Q&A (15 min)
  • Hands-on: lumberjack and validate (20 min)

Break (10 min)

Part 4: Managing data validation rules (50 min)

  • hands-on introduction to the {validatetools} package (20 min)
  • Presentation and Q&A validation rule management (20 min)
  • Closing and Q&A (10 min)

Materials for further reading

About

Materials for the useR!2021 tutorial on data validation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published