Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
499 lines (360 sloc) 15.2 KB
title author date css output
IRBs & Data Sharing
Rick Gilmore & Gustav Nilsonne
`r Sys.time()`
ioslides_presentation github_document
incremental widescreen smaller logo
toc toc_levels
knitr::opts_chunk$set(echo = FALSE)

IRBs & Data Sharing

2017-07-30 3:30-4:15 pm




  • Make data from psychological research as widely available as possible
    • Increase reuse potential
    • Reduce bias
    • Make published analyses as transparent as possible
  • Avoid harming research participants


  • Ethical challenges in sharing data
  • Sharing de-identified data
  • Sharing identifiable data

Ethical challenges in sharing data

Belmont principles

  • Beneficence
    • Data sharing increases value (good)
    • Data sharing may pose risk of loss of privacy & confidentiality (bad)
  • Autonomy
    • Data sharing may pose risk of unintended use of data
    • Participants should participate in decisionmaking

  • Justice
    • Benefits (and costs) of research participation should be equitable

Meeting the challenges

  • Tension b/w protecting participants & advancing discovery
  • Tension b/w requirements/expectations/desires to share and practical, regulatory/legal, ethical constraints

What data are you collecting?

  • Personally identifying or sensitive data?
  • What risks does data sharing pose?
  • How should data be protected?

Who will (& should) have access?

  • Public
  • Community of authorized individuals (researchers)
  • Individuals selected by data owner or repository

What have participants been told, approved, understood?

  • What data collected, what will be shared
  • Who will have access
  • Where stored, how accessed
  • Purposes of use, types of questions

Are your data subject to statutory, regulatory, or contractual restrictions?

Sharing de-identified data

What is personally identifying information (PII)?

  • PII definitions vary by use case, context
  • Likelihood of identification depends on uniqueness in target and reference populations
  • Health Insurance Portability and Accountability Act (HIPAA) identifiers
In the U.S., much behavioral research is funded by the NIH and has at least a nominal relationship to health. So, the HIPAA identifiers serve as a guideline for PII.

HIPAA identifiers

  • Name
  • Address (all geographic subdivisions smaller than state, including street address, city, county, and zip code)
  • All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89)

  • Telephone
  • Fax numbers
  • Email address
  • Social Security Number

  • Medical record number
  • Health plan beneficiary number
  • Account number
  • Certificate or license number
  • Any vehicle or other device serial number
  • Web URL
  • Internet Protocol (IP) Address

  • Finger or voice print
  • Photographic image - not limited to images of the face
  • Any other characteristic that could uniquely identify the individual

Other potentially identifiable information

  • Structural MRI
    • Emerging standard: deface
  • Genetic profiles

Examples of possibly sensitive data

  • Health-related information
    • Medical history
    • Medical risk factors including genetic data

  • Information about other potentially stigmatizing characteristics (situation-dependent)
    • Religious/philosophical convictions
    • Sexual identity and preferences
    • Political affiliation, trade union membership
    • Ethnicity, nationality, citizenship status

Weighing benefits (of sharing) vs. risks

  • How useful are data?
  • How sensitive are data?
  • How likely is it that reidentification could be achieved, and by whom?

Risk scenarios

  • Reidentification by participants themselves
    • Can be harmful e.g. if dataset contains uncommunicated health risk information
  • Reidentification by insider
  • Reidentification by targeted search (nemesis scenario)
  • Reidentification by mass matching (dystopian AI scenario)

Ways to mitigate risk

  • Aggregate or censor sensitive variables
  • Aggregate or censor secondary identifying variables
  • Perturb or add noise to variables
  • Review data for disclosure risk

  • Stepped or restricted access
    • Data enclaves (e.g., Census data)
    • Virtual data enclaves

Example language for consent forms

Case study:

Sharing identifiable data

Canadian Policy

  • Researchers must obtain consent for secondary use of identifiable data unless
    • identifiable information is essential to the research
    • use of identifiable information without consent is unlikely to adversely affect participants
    • researchers take appropriate measures to protect privacy of individuals and safeguard identifiable information

  • Researchers must obtain consent for secondary use of identifiable data unless
    • researchers comply with any known preferences previously expressed by individuals about any use of their information
    • it is impossible or impracticable to seek consent
    • researchers have obtained any other necessary permission for secondary use of information for research purposes.

Case studies in sharing identifiable data

  • Specializes in storing, sharing video
  • Video captures behavior more fully than other methods, but is identifiable
  • Policy framework for sharing identifiable data
    • Permission to share -> builds on informed consent
    • Restricted access for (institutionally) authorized researchers

Seeking permission to share

Your browser does not support the video tag.

Every researcher who wants access to Databrary must have formal written approval from their institution.

Lessons learned

  • Research consent ≠ permission to share
    • Seek permission to share after data collection.
  • "Cloud" storage vs. institutionally housed
  • Comfort with data sharing varies among IRBs
  • Laws differ among countries
In securing agreements with more than 330 institutions, we've learned some valuable lessons. One is to try to keep separate the consent to participate in research from the permission to share data. The risks and benefits differ. Participants more informed about what they're sharing *after* a session has ended. A second is that some institutions distinguish between data stored in the cloud like OSF or Databrary from data stored on servers the institution controls. A third is that IRBs, like the local communities they are intended to reflect, differ. Some aren't comfortable with Databrary's distributed model of responsibility and won't let their researchers participate. And finally, national laws differ. Some researchers can't store identifiable data on U.S. servers.

Open Humans

Public sharing of identifiable data

Risks of public sharing

  • Identity theft
  • Embarrassment
  • Discrimination
  • Data may later become sensitive
  • Can withdraw, but can't "unshare"

Open Humans: Public Data Sharing Consent

Specific risks for sharing these data types

  • Demographic data
  • Genetic data
  • Location data

Open Humans: Public Data Sharing Consent

Benefits of public sharing

  • Public data as a public resource
  • Serves diverse individuals not part of standard research groups
  • Participants can advance their own understanding

Open Humans: Public Data Sharing Consent


Prepare for sharing

  • Get IRB/ethics board approval
  • Get participant approval (even if planning to anonymize)

Alert participants

  • Where data will be stored
    • e.g. in "cloud" servers (e.g., SurveyMonkey, Qualtrics, Databrary, OpenNeuro, OSF, etc.)
    • Be explicit, but not specific

  • Who will have access
    • Public/anyone
    • Researchers
  • And for how long
    • indefinitely
    • stopping sharing possible, unsharing not-so

  • Why:

    • Give motivation for recording sensitive variables (beneficence)
  • Consult data repository experts (e.g., ICPSR, Dataverse, Databrary)

  • Avoid making promises you cannot keep: - "no one except the researchers in the project will ever see the data"
  • Avoid data destruction clauses:
    • "Your data will be stored for X years then destroyed."
    • NOT REQUIRED by U.S. or Canadian law

  • Avoid describing overly specific use cases for data:
    • "Your data will be used to study the relationship between X and Y."

Share as openly as practicable

  • Consider approved, authorized, trusted data repositories for sensitive data
  • Share as much individual-level, item-specific data as practicable
    • Finest grain data == highest value for reuse, new discovery


How anonymous is 'anonymous' data?

Name + DOB + ZIP uniquely identifies most Americans

Other issues

  • Self-reported vs. medical records
  • Sponsor requirements (or constraints) vs. open sharing
  • Do IRBs overstep regulatory boundaries when considering risks and benefits outside an approved study (Burnam, 2014)?
  • Policies for restricting access but promoting openness
  • Who owns data?

DataTags initiative

  • From Dataverse @ Harvard
  • Checklist/workflow for 'tagging' data based on risk



This talk was produced on r Sys.Date() in RStudio 1.0.143 using R Markdown. The code and materials used to generate the slides may be found at An OSF wiki is here. Information about the R Session that produced the code is as follows: