### Reliability Score in Statistics

In statistics, **reliability** refers to the **consistency or repeatability** of a measurement or test. A **Reliability Score** quantifies how much you can trust that a given measurement or test will produce the **same result under consistent conditions**.

---


> "If I repeat the test under the same conditions, will I get a similar result?"

---

###  Common Reliability Scores:

* **Cronbach’s Alpha**: For internal consistency (e.g. surveys).
* **Intraclass Correlation Coefficient (ICC)**: For ratings across observers or sessions.
* **Test-Retest Reliability**: Correlation of scores over time.
* **Split-Half Reliability**: Split test into two parts, compare consistency.

---

###  Example: Measuring Basketball Free Throw Skill

Let’s say you want to measure how good a basketball player is at free throws.

#### Step 1: You design a test

* Player takes **10 shots** per session.
* You do this **for 5 days**.

#### Step 2: Record scores over 5 days

| Day | Free Throws Made (out of 10) |
| --- | ---------------------------- |
| 1   | 7                            |
| 2   | 6                            |
| 3   | 7                            |
| 4   | 6                            |
| 5   | 7                            |

#### Step 3: Analyze for **Test-Retest Reliability**

You calculate the **correlation** of scores across days. If the correlation is high (say **r = 0.95**), this means:

>  The player's performance is consistent → **High Reliability Score**

If the scores had bounced around like this:

| Day | Free Throws Made |
| --- | ---------------- |
| 1   | 3                |
| 2   | 8                |
| 3   | 2                |
| 4   | 9                |
| 5   | 4                |

And correlation was **r = 0.35**, then:

>  The test has **low reliability** → Not a stable skill measurement.

---

###  Why It Matters

If your test is unreliable, your conclusions (about player skill, patient condition, student ability, etc.) may be flawed or random.

---

###  Summary

| Term              | Meaning                                                                |
| ----------------- | ---------------------------------------------------------------------- |
| Reliability Score | A number showing **how consistent** a test or measurement is           |
| Range             | 0 (not reliable) → 1 (perfectly reliable)                              |
| Interpretation    | High score = repeatable results under same conditions                  |
| Example           | Free throw success over time → stable scores = reliable skill estimate |

## **Four Common Types of Reliability Scores** 

###  1. **Cronbach’s Alpha (Internal Consistency Reliability)**

What it Measures:

* **How well the items in a test measure the same underlying concept** (e.g. depression questionnaire).
* Used for surveys or psychological tests with multiple questions targeting the same construct.

###  Equation:

$$
\alpha = \frac{K}{K - 1} \left(1 - \frac{\sum_{i=1}^{K} \sigma_i^2}{\sigma_T^2}\right)
$$

Where:

* $K$: number of items
* $\sigma_i^2$: variance of each item
* $\sigma_T^2$: variance of total score (sum of items)

---

### Example 1: Questionnaire

A questionnaire has 4 questions (each scored from 1 to 5). 5 people answer them.

| Person | Q1 | Q2 | Q3 | Q4 |
| ------ | -- | -- | -- | -- |
| A      | 4  | 4  | 5  | 5  |
| B      | 3  | 3  | 4  | 4  |
| C      | 2  | 2  | 3  | 3  |
| D      | 4  | 3  | 4  | 5  |
| E      | 1  | 2  | 2  | 2  |

* Calculate variance for each question and the total score.
* Plug into the formula → Suppose $\alpha = 0.87$

**Interpretation:** High internal consistency; all questions are measuring the same concept.

---

## Example 2: Extroversion (Personality Traits) and Income 
A example involving:

* A **latent variable**: Extroversion
* **Observable items**: 4 personality traits
* A **dependent variable**: Salary

You want to test the hypothesis:

> **"Extroverted people earn more money than introverted people."**

* **Salary** is easy to measure — you can ask directly.
* But **extroversion** is a **latent variable** — you can't measure it directly.

So, you build a **scale** to measure it.

---

### What is a Scale?

A **scale** is a group of related survey questions (called **items**) designed to collectively measure a **latent variable**.

Here, we measure **extroversion** using **4 items** from the Big Five personality traits:

| Item # | Question (Trait)          |
| ------ | ------------------------- |
| Q1     | I am outgoing             |
| Q2     | I am talkative            |
| Q3     | I am sociable             |
| Q4     | I enjoy social situations |

Responses range from **"Applies" (e.g. 5)** to **"Does not apply" (e.g. 1)**.

---

###  Step 1: Collect Data

Suppose 5 participants filled out the survey:

| Person | Q1 (Outgoing) | Q2 (Talkative) | Q3 (Sociable) | Q4 (Enjoying Social Situations) |
| ------ | ------------- | -------------- | ------------- | ------------------------------- |
| A      | 5             | 4              | 5             | 5                               |
| B      | 4             | 3              | 4             | 4                               |
| C      | 2             | 3              | 2             | 2                               |
| D      | 3             | 2              | 3             | 3                               |
| E      | 1             | 2              | 1             | 2                               |

---
### Step 2: What Does Cronbach’s Alpha Do?

Cronbach’s Alpha estimates the **internal consistency** of these 4 items.

It tells us **how well these items measure the same underlying construct** (extroversion).

---

###  Step 3: Cronbach’s Alpha Formula

$$
\alpha = \frac{N \cdot \bar{c}}{\bar{v} + (N - 1) \cdot \bar{c}}
$$

Where:

* $N$ = number of items
* $\bar{v}$ = average of item variances
* $\bar{c}$ = average of inter-item covariances

---

Cronbach's Alpha was calculated for the 4 items:

*  **Alpha = 0.71** → **Acceptable internal consistency**

Then they tested removing one item at a time:

| Removed Item             | New Alpha |
| ------------------------ | --------- |
| Q1: "Outgoing"           | 0.66      |
| Q2: "Talkative"          | 0.48      |
| Q4: "Enjoying Social..." | **0.79**  |

Removing Q4 improved the alpha, meaning **this item might not align well** with the others.

---

###  Interpretation

| Cronbach’s Alpha Value | Interpretation |
| ---------------------- | -------------- |
| ≥ 0.9                  | Excellent      |
| 0.8 – 0.9              | Good           |
| 0.7 – 0.8              | Acceptable     |
| 0.6 – 0.7              | Questionable   |
| 0.5 – 0.6              | Poor           |
| < 0.5                  | Unacceptable   |

So, **0.71** means the scale is **acceptable**, but there’s room to improve — possibly by refining or removing one question.

---

### Summary

* Cronbach’s Alpha measures **how closely related a set of items are**.
* It helps assess the **reliability of a scale** that aims to measure a **latent variable**.
* A low alpha could mean your questions are not measuring the same thing.
* In the video, removing one item increased alpha — showing how important good item design is.

---





In [1]:
import pandas as pd
import numpy as np

# Step 1: Create the data
data = {
    "Q1_Outgoing": [5, 4, 2, 3, 1],
    "Q2_Talkative": [4, 3, 3, 2, 2],
    "Q3_Sociable": [5, 4, 2, 3, 1],
    "Q4_Enjoy_Social": [5, 4, 2, 3, 2],
}

df = pd.DataFrame(data)

# Step 2: Cronbach's Alpha calculation
def cronbach_alpha(df):
    # Number of items
    k = df.shape[1]
    
    # Variance of each item
    item_variances = df.var(ddof=1)

    # Total score variance
    total_score = df.sum(axis=1)
    total_variance = total_score.var(ddof=1)

    # Cronbach's alpha formula
    alpha = (k / (k - 1)) * (1 - item_variances.sum() / total_variance)
    return alpha

# Compute and display result
alpha = cronbach_alpha(df)
print(f"Cronbach's Alpha: {alpha:.3f}")


Cronbach's Alpha: 0.954


##  2. **Test-Retest Reliability**

What it Measures:

* **Stability over time**. Give the same test to the same people twice → are results similar?

###  Equation:

Pearson correlation coefficient between the two sets of scores.

$$
r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}
$$

Where:

* $X_i$: score at Time 1
* $Y_i$: score at Time 2

---

###  Example:

Test a student’s math score at two time points.

| Student | Test 1 | Test 2 |
| ------- | ------ | ------ |
| A       | 85     | 87     |
| B       | 78     | 80     |
| C       | 90     | 91     |
| D       | 70     | 72     |
| E       | 88     | 85     |

Correlation $r = 0.95$ → **Excellent reliability over time**

---

##  3. **Split-Half Reliability**


* **Internal consistency** by splitting the test into two halves (e.g. odd vs even items).
* Measures if both halves give similar results.

###  Equation:

Use **Spearman-Brown formula**:

$$
r_{SB} = \frac{2r_{12}}{1 + r_{12}}
$$

Where:

* $r_{12}$: correlation between the two halves

---

###  Example:

A 10-question test is split into two sets (Q1–5 and Q6–10).

* Total score of Q1–5 and Q6–10 is computed for each student.
* Pearson correlation between these halves = 0.8
* Plug into formula:

$$
r_{SB} = \frac{2 \times 0.8}{1 + 0.8} = \frac{1.6}{1.8} ≈ 0.89
$$

**Interpretation:** Test has good internal consistency.

---



##  4. **Intraclass Correlation Coefficient (ICC)**


* **Consistency among raters** or **repeated measures**.
* More general than Pearson correlation. Works when:

  * Multiple raters score the same thing (e.g. doctors rating X-rays)
  * Measurements on the same subject across conditions

### Equation:

There are many forms of ICC. For **consistency of k raters across n subjects**, this is one type:

$$
\text{ICC} = \frac{MS_B - MS_W}{MS_B + (k - 1)MS_W}
$$

Where:

* $MS_B$: between-subjects mean square
* $MS_W$: within-subjects mean square
* $k$: number of raters

---

### Example:

Three doctors rate pain levels of 5 patients (scale 0–10):

| Patient | Doctor 1 | Doctor 2 | Doctor 3 |
| ------- | -------- | -------- | -------- |
| A       | 6        | 7        | 6        |
| B       | 8        | 9        | 8        |
| C       | 3        | 4        | 3        |
| D       | 5        | 5        | 6        |
| E       | 7        | 8        | 7        |

Use an ANOVA-based method → ICC = 0.93 → **Excellent agreement among raters**

---

##  Summary Table

| Method               | Measures                      | Use Case                        | Range   |
| -------------------- | ----------------------------- | ------------------------------- | ------- |
| **Cronbach’s Alpha** | Internal consistency          | Survey/questionnaire            | 0 to 1  |
| **Test-Retest**      | Stability over time           | Skill measurement, scores       | -1 to 1 |
| **Split-Half**       | Internal consistency          | Exams, questionnaires           | 0 to 1  |
| **ICC**              | Rater/measurement consistency | Medical ratings, repeated tests | 0 to 1  |