## Inter Rater Reliability

- In stat, it is the degree of agreement among independent raters or observers who rate, code, or assess the same phenomenon.
- Known by several names: **inter-rater agreement**, **inter-rater concordance**, **inter-observer reliability**, **inter-coder reliability** and so on.
- Measurement of the consistency of a single study or research by different raters/observers.
- A high **IRR** value indicates consistency of measurement across different observers.

#### Determining IRR

There are number of stats that can be used to determine inter-rater reliability, they are:
1) **Cohen's Kappa**:
	- Denoted by lowercase Greek kappa, $κ$ 
	- Measures inter-rater reliability (also intra-rater reliability) for categorical data.
	- Takes into account the **possibility of agreement occurring by chance**.
	- It measures the agreement between two raters who each classify $N$ items into $C$  mutually exclusive categories.
	- The definition of $κ$ is:
	$${\displaystyle \kappa \equiv {\frac {p_{o}-p_{e}}{1-p_{e}}}=1-{\frac {1-p_{o}}{1-p_{e}}}}$$

		where, <br />
		$p_{o}$ = the relative observed agreement among raters<br />
		$p_{e}$ = the hypothetical probability of chance agreement, using the observed data to calculate the prob. of each observer randomly seeing each category.

		**Note**: If raters are in complete agreement $κ = 1$

##### Example:
To analyze the inter-rater reliability using Cohen's Kappa in Python for a dataset where two readers (A and B) evaluated grant proposals with "Yes" or "No" decisions, we can follow these steps:

- Create a **confusion matrix** (contingency table) to represent the agreement and disagreement counts.

  What is a **confusion matrix**?<br />
  In the field of machine learning and specifically the problem of statistical classification, a confusion matrix (also known as error matrix) is a specific table layout that allows visualization of the performance of an algorithm. The name (confusion) stems from the fact that it makes it easy to see whether the system or model is confusing two classes (i.e. commonly mislabeling one as another).

- Compute Cohen's Kappa score based on this confusion matrix.

  Assume the following confusion matrix where:

    - a is the count of "Yes-Yes" agreements.
    - b is the count of "Yes-No" disagreements.
    - c is the count of "No-Yes" disagreements.
    - d is the count of "No-No" agreements.
 
|                 | Reader B: Yes | Reader B: No |
|-----------------|---------------|--------------|
| Reader A: Yes   | a = 20        | b = 5        |
| Reader A: No    | c = 10        | d = 15       |

In [1]:
import numpy as np
from sklearn.metrics import cohen_kappa_score

In [21]:
"""
Step 1: Create the confusion matrix
We'll first convert this matrix into individual ratings for each proposal
Suppose we have 20 "Yes-Yes", 5 "Yes-No", 10 "No-Yes", and 15 "No-No"
"""
reader_a = np.array([1]*20 + [1]*5 + [0]*10 + [0]*15) # 1 = Yes and 0 = No
reader_b = np.array([1]*20 + [0]*5 + [1]*10 + [0]*15)

In [18]:
# Print the matrix
reader_a

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

In [19]:
# Calculate the cohen kappa score between reader_a and reader_b
kappa_score = cohen_kappa_score(reader_a, reader_b)

In [20]:
f"Cohen's Kappa: {kappa_score:.5f}"

"Cohen's Kappa: 0.40000"

It indicates that the degree of agreement between the two readers beyond chance. A kappa value of 0.400 indicates moderate level of agreement between the readers.

## Fleiss' Kappa

- Extension of Cohen's kappa for multiple raters.
- It can be used to assess reliability of agreement between a fixed number of raters when classifying items.
- It measures the degree of agreement in classification over that which would be expected by chance.
- It can be thought of as: if a fixed number of people assign numerical ratings to a number of items then the kappa will give a measure for how consistent the ratings are.

- The definition of $\kappa$ is:

$${\displaystyle \kappa ={\frac {{\bar {P}}-{\bar {P_{e}}}}{1-{\bar {P_{e}}}}}}$$

<br />

where, <br />
${\displaystyle 1-{\bar {P_{e}}}}$ = the factor that gives the degree of agreement attainable above chance <br />
${\displaystyle {\bar {P}}-{\bar {P_{e}}}}$ = the degree of agreement actually achieved above chance

**Note**: If raters are in complete agreement, $\kappa = 1$. If there is no agreement among the raters (other than what would      be expected by chance), then ${\displaystyle \kappa \leq 0}$.

#### Example Data:

Let's assume we have 50 proposals rated by 3 raters (A, B, C) with "Yes" (1) or "No" (0). How do we calculate the Fleiss' Kappa?

Steps:

- Create a rating matrix where rows represent proposals and columns represent raters.
- Use `statsmodel` library to compute Fleiss' Kappa. 

In [25]:
from statsmodels.stats.inter_rater import fleiss_kappa

In [41]:
"""
Example data: 50 proposals rated by 3 raters: A, B, C
Each row represents a proposal, each column a rater
1 for "Yes", 0 for "No"
"""
ratings = np.array([
    [1, 1, 0], [1, 0, 1], [0, 1, 1], [1, 1, 1], [0, 0, 0], 
    [1, 1, 0], [0, 1, 0], [1, 0, 0], [0, 0, 1], [1, 1, 1],
    # ... (we can add more rows)
])

In [43]:
# Print the ratings matrix
ratings

array([[1, 1, 0],
       [1, 0, 1],
       [0, 1, 1],
       [1, 1, 1],
       [0, 0, 0],
       [1, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [1, 1, 1]])

The below cell calculates the occurance of each rating per proposal and creates an array where each row represents the count of "No" and "Yes" ratings for each proposal.

In [46]:
rating_counts = np.apply_along_axis(lambda x: np.bincount(x, minlength=2), axis=1, arr=ratings)

In [47]:
rating_counts

array([[1, 2],
       [1, 2],
       [1, 2],
       [0, 3],
       [3, 0],
       [1, 2],
       [2, 1],
       [2, 1],
       [2, 1],
       [0, 3]])

As we can see, the first row in the array is [1, 2] which implies there are two 1's and a single 0. Similarly, the fourth row is [0, 3] which means there are 3 1's and no any 0.

In [48]:
# Calculate the Fleiss' Kappa.
fleiss_score = fleiss_kappa(ratings_counts)

In [49]:
f"Fleiss' Kappa: {fleiss_score:.3f}"

"Fleiss' Kappa: 0.050"

This result falls into the range of 0.40 - 0.60, indicating **moderate agreement** between the raters. This suggests a reasonable level of consistency, but there is still some variability in the ratings.

3)  **Intra-Class Correlation (ICC)**:
	- It is commonly used to quantify the degree to which individuals with a fixed degree of relatedness (eg: full siblings) resemble each other in terms of quantitative trait ([[heritability]]).
	- It describes how strongly units in the same group resemble each other.
	- It is useful when ratings are numerical (continuous) rather than categorical.
	- Example: In a psychological study, multiple raters might score/rate the same set of participants on a psychological scale. ICC would measure the consistency of these ratings or scores.
	- Formula:
	$$
	{\displaystyle Y_{ij}=\mu +\alpha _{j}+\varepsilon _{ij},}
	$$
		where, <br>
		$Y_{ij}$ is  the $i^{th}$ observation in the $j^{th}$ group <br>
		$μ$ = an unobserved overall mean <br>
		$α_{j}$ = an unobserved random effect shared by all values in group $j$ <br>
		$ε_{ij}$ = an unobserved noise term

		**Note**: For the model to be identified, the $α_{j}$ and $ε_{ij}$ are assumed to have expected value zero and to be uncorrelated with each other. Also, the $α_{j}$ are assumed to be identically distributed, and the $ε_{ij}$ are assumed to be identically distributed. The variance of $α_{j}$ is denoted $σ_{α}^2$ and the variance of $ε_{ij}$ is denoted $σ_{ε}^2$.

#### Example Data:

Let's assume we have continuous ratings by 3 raters for 10 proposals. So let's calculate the intra-class correlation!

Steps:

- Create a dataset (dummy) where each row should represent a proposal and each column represents a rater.
- Compute the Intra-class correlation using `pingouin` library

In [51]:
import pandas as pd
from pingouin import intraclass_corr

In [59]:
# Creating a dataset
data = {
    'proposal': np.tile(np.arange(10), 3),
    'rater': np.repeat(['Rater1', 'Rater2', 'Rater3'], 10),
    'rating': [4.2, 3.8, 5.0, 4.5, 4.1, 4.7, 3.9, 4.3, 4.4, 4.8,
               4.1, 3.7, 4.8, 4.6, 4.0, 4.8, 3.8, 4.2, 4.3, 4.7,
               4.3, 3.9, 4.9, 4.4, 4.2, 4.6, 4.0, 4.4, 4.5, 4.9]
}

df = pd.DataFrame(data)

In [65]:
# Print the data
df.head()

Unnamed: 0,proposal,rater,rating
0,0,Rater1,4.2
1,1,Rater1,3.8
2,2,Rater1,5.0
3,3,Rater1,4.5
4,4,Rater1,4.1


As we can observe in the above table, each row represents a proposal and raters are represented by the columns.

In [62]:
# Calculate ICC
icc = intraclass_corr(data=df, targets='proposal', raters='rater', ratings='rating')
icc_df = pd.DataFrame(icc)

In [63]:
icc_df

Unnamed: 0,Type,Description,ICC,F,df1,df2,pval,CI95%
0,ICC1,Single raters absolute,0.930982,41.466667,9,20,5.374008e-11,"[0.82, 0.98]"
1,ICC2,Single random raters,0.93135,54.086957,9,18,3.370376e-11,"[0.8, 0.98]"
2,ICC3,Single fixed raters,0.946512,54.086957,9,18,3.370376e-11,"[0.85, 0.99]"
3,ICC1k,Average raters absolute,0.975884,41.466667,9,20,5.374008e-11,"[0.93, 0.99]"
4,ICC2k,Average random raters,0.976019,54.086957,9,18,3.370376e-11,"[0.92, 0.99]"
5,ICC3k,Average fixed raters,0.981511,54.086957,9,18,3.370376e-11,"[0.95, 1.0]"
