ura
provides a set of tools for calculating inter-rater reliability
(IRR) statistics by rater, allowing for real-time monitoring of rater
reliability. While not the first package to provide users access to IRR
diagnostics (e.g., irr), ura
aims to provide a simple set of tools for quickly monitoring rater
progress and precision. You can use ura
to, for instance, find the
percentage agreement or Krippendorf’s Alpha of all of the subjects coded
by your raters. Another helpful use is to calculate percentage agreement
values by rater, providing an efficient way to monitor the relative
reliability of your raters.
This package complements a
paper
published in PS: Political Science & Politics, entitled “Improving
Content Analysis: Tools for Working with Undergraduate Research
Assistants.” Please refer to this paper for a more general discussion
about training and monitoring student raters. Also, be sure to check out
the paper for more information about how to use the tools in ura
to
monitor progress without compromising reproducibility.
ura
is available on CRAN and can be installed using:
install.packages("ura")
You can install the most recent development version of ura
using the
devtools
package. First, you have to install devtools
using the
following code. Note that you only have to do this once:
if(!require(devtools)) install.packages("devtools")
Then, load devtools
and use the function install_github()
to install
ura
:
library(devtools)
install_github("bengoehring/ura", dependencies = TRUE)
ura
can be used to calculate key IRR statistics, such as percentage
agreement and Krippendorf’s Alpha via the irr_stats()
function. This
function largely serves as a wrapper around irr::agree()
and
irr::kripp.alpha()
but aims to simplify users’ lives by only requiring
the user to provide a dataframe and specify key columns.
For instance, below I calculate the percentage agreement and Krippendorf
Alpha of the diagnoses
dataset, which notes the psychiatric
evaluations of 30 patients from 6 raters. The diagnoses
dataset is
included with the ura
package and is simply a reshaped version of the
dataset with the same name in the irr
package.
library(ura)
irr_stats(diagnoses,
rater_column = 'rater_id',
subject_column = 'patient_id',
coding_column = 'diagnosis')
#> # A tibble: 2 × 3
#> statistic value n_subjects
#> <chr> <dbl> <int>
#> 1 Percentage agreement 16.7 30
#> 2 Krippendorf's Alpha 0.43 30
A few things to note here. First, the unit of analysis in diagnoses
is
rater-subject — that is, each row provides the coding decision of rater
i for subject j. All data inputted into a ura
function should be long
by rater-subject. Second, you will see that the dataframe returned by
irr_stats()
notes the number of subjects used to calculate the given
IRR statistic. In the case of diagnoses, this value is equal to the
number of unique subjects in the dataframe:
length(unique(diagnoses$patient_id))
#> [1] 30
This is not always the case. If your dataframe includes subjects that
were coded by more than one rater and subjects coded by a single rater
(this is a common approach for balancing efficiency with the need for
IRR statistics), ura
will automatically only use the subjects coded by
more than one rater. The resulting number of subjects will then appear
in the n_subjects
column.
The rater_agreement()
function is the key method for monitoring rater
reliability. While irr_stats()
provides pooled IRR statistics across
all raters, rater_agreement()
provides the percent share of a given
raters’ codings that agree with other raters’ codings. In other words,
it offers supervisors a method for checking the relative precision of
each rater in real time. Since interventions in coding procedures should
be used sparingly, I suggest taking a look at the paper linked above for
more information about when and why to intervene based on information
gleaned from rater_agreement()
.
In the snippet below, all raters have the same percent agreement: 17%. That is because, as implied by the n_multi_coded column, every rater codes every subject in the diagnoses dataset.
rater_agreement(diagnoses,
rater_column = 'rater_id',
subject_column = 'patient_id',
coding_column = 'diagnosis')
#> # A tibble: 6 × 3
#> rater percent_agree n_multi_coded
#> <dbl> <dbl> <int>
#> 1 1 17 30
#> 2 2 17 30
#> 3 3 17 30
#> 4 4 17 30
#> 5 5 17 30
#> 6 6 17 30
A more helpful use case is when you only have your raters multi-code a subset of subjects. Take this hypothetical dataset, for instance:
example_data <- tibble::tribble(
~rater, ~subject, ~coding,
1, 1, 1,
1, 2, 0,
1, 3, 1,
1, 4, 0,
2, 3, 1,
2, 9, 0,
2, 10, 1,
2, 4, 1,
2, 5, 1,
2, 6, 1,
3, 5, 1,
3, 6, 1,
3, 7, 1,
3, 8, 1,
)
Here, some subjects are coded by multiple raters while others are coded by a single rater. As a result:
rater_agreement(example_data,
rater_column = 'rater',
subject_column = 'subject',
coding_column = 'coding')
#> # A tibble: 3 × 3
#> rater percent_agree n_multi_coded
#> <dbl> <dbl> <int>
#> 1 3 100 2
#> 2 2 75 4
#> 3 1 50 2
In terms of interpretation, row 3 shows that of the 2 subjects coded by rater 1 that were also coded by another rater, rater 1 agrees with the other rater(s) 50% of the time. Looking back at example_data, it appears that rater 1 agreed with rater 2 on the coding of subject 3 but not on subject 4.