###  What Is a Percentile?

A **percentile** is a measure used in statistics to understand the **relative standing** of a value within a dataset.

> A **Pth percentile** is a value below which **P%** of the data falls.

* The **50th percentile** is the **median** — half of the values are below it.
* The **25th percentile** is the **first quartile (Q1)**.
* The **75th percentile** is the **third quartile (Q3)**.
* The **100th percentile** is the maximum value.

---

###  Simple Intuition

If you're in the **90th percentile** in an exam, you scored better than **90% of the students**.

---

###  Example 1: Small Dataset

Let’s say we have these **sorted test scores** of 10 students:

```
[45, 50, 55, 60, 65, 70, 75, 80, 85, 90]
```

#### 1. What is the **50th percentile** (Median)?

* 10 values → median = average of 5th and 6th:
  $(65 + 70)/2 = 67.5$

#### 2. What is the **25th percentile** (Q1)?

* 25% of 10 = 2.5 → take the 3rd value (approx):
  **Q1 ≈ 55**

#### 3. What is the **90th percentile**?

* 90% of 10 = 9 → 9th value is **85**

So:

* **25th percentile (Q1) ≈ 55**
* **50th percentile (Median) = 67.5**
* **75th percentile (Q3) ≈ 80**
* **90th percentile ≈ 85**

---

###  Example 2: Heights of 5 students (in cm)

```
[150, 160, 165, 170, 180]
```

#### 1. 50th percentile:

* 3rd value (middle) = **165**

#### 2. 25th percentile:

* 25% of 5 = 1.25 → round up to 2nd value = **160**

#### 3. 75th percentile:

* 75% of 5 = 3.75 → round up to 4th value = **170**

---

###  Notes

Percentiles help in:

* **Understanding distribution**
* **Identifying outliers**
* **Comparing scores or performance**

---



## **Percentile and Percentage**




1. **Definition**

| Term           | Definition                                                                         |
| -------------- | ---------------------------------------------------------------------------------- |
| **Percentage** | A **part of 100**; used to express proportions, e.g., “45% of people passed.”      |
| **Percentile** | A **position** in a sorted dataset below which a certain percentage of data falls. |

---

##  2. **Key Difference**

* **Percentage** answers:
  → "What **portion** of the total is this?"

* **Percentile** answers:
  → "Where does this value **stand** relative to others?"

---

##  3. Example: Test Scores

Let’s say 100 students take a math test.

###  Percentage

* You scored **85 out of 100** →
  → That’s **85%** — a direct measure of how many questions you got right.

###  Percentile

* If you scored **85** and **90 students scored lower** than you →
  → You are in the **90th percentile**
  (you did better than 90% of students, even if your raw score is 85%).

So:

* **Percentage**: "You got 85% on the test."
* **Percentile**: "You are in the 90th percentile (better than 90 students)."

---

##  4. Another Example

### Heights of 10 people (in cm):

```
[150, 152, 155, 158, 160, 162, 165, 168, 170, 180]
```

* If your height is **160 cm**:

  * **Percentage of maximum** = 160 / 180 × 100 ≈ **88.9%**
  * **Percentile** = You are the 5th value → 5/10 = **50th percentile**

So:

* **88.9%** → is a **percentage** of max height.
* **50th percentile** → means 50% of people are shorter than you.

---

##  Summary

| Feature     | **Percentage**                 | **Percentile**                                  |
| ----------- | ------------------------------ | ----------------------------------------------- |
| Type        | Proportion or ratio            | Rank or position                                |
| Use Case    | Grades, discounts, probability | Scores, health stats, test ranking              |
| Example     | "You got 80% correct"          | "You’re in the 80th percentile"                 |
| Computation | (Part / Whole) × 100           | Position in sorted data relative to total count |




* A **percentile** tells you where a value lies relative to the rest of the data.
* A **normal distribution** is a bell-shaped curve where:

  * The **mean (μ)** is the center.
  * The **standard deviation (σ)** tells you how spread out the data is.
* In a **standard normal distribution**, we can **map percentiles to σ values**.

---

##  Key Fact: Percentiles ↔ Z-scores (σ)

In a standard normal distribution:

* The **Z-score** is the number of standard deviations a value is from the mean.
* Every **Z-score corresponds to a percentile**.

Here’s how they map:

| Z-score (σ) | Percentile (approx) | Meaning           |
| ----------- | ------------------- | ----------------- |
| -3σ         | 0.13%               | Way below average |
| -2σ         | 2.3%                | Bottom 2.3%       |
| -1σ         | 15.9%               | Below average     |
| **0σ**      | **50%**             | **Mean/Median**   |
| +1σ         | 84.1%               | Above average     |
| +2σ         | 97.7%               | Top 2.3% remain   |
| +3σ         | 99.87%              | Very high value   |

> **Example**: If your test score is **+1σ**, you are at the **84th percentile** — better than 84% of people.

---

## 📊 Visual Explanation

In a normal distribution:

```
               -3σ   -2σ   -1σ    0σ   +1σ   +2σ   +3σ
                |     |     |     |     |     |     |
  Percentiles:  0%   2.3% 16%   50%   84%   98%   99.9%
```

Most of the data (\~68%) lies between **-1σ and +1σ**.

---

##  Concrete Example

Let’s say the heights of adult men follow a **normal distribution**:

* Mean (μ) = 175 cm
* Standard deviation (σ) = 7 cm

**Q: What is the height at the 84th percentile?**

* 84th percentile ≈ **+1σ**
* So:
  **Height = μ + 1σ = 175 + 7 = 182 cm**

→ A man who is 182 cm tall is taller than about **84%** of men.

---

##  Why This Matters

* **Percentiles** are useful when you want **relative performance**: e.g., “top 10%”.
* **Z-scores and σ** help you standardize and **compare across distributions**.
* This is critical in **standardized testing**, **machine learning**, **outlier detection**, etc.

---

##  What Does "20th Percentile" Really Mean?

Whether your data follows a **normal distribution** or **any other shape**, the **20th percentile** always means:

> **20% of the data lies below this value**.

But how we **find** that value depends on the **distribution** of the data.

---

##  Two Ways to Think About Percentiles

| Type of Data                               | How 20th Percentile Is Found                                                                                |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------------------- |
| **Raw data (empirical)**                   | Sort the data and find the value below which 20% of values lie.                                             |
| **Probability distribution (like normal)** | Compute the value **x** where the **area under the curve** from left to **x** = 20% (i.e., integral = 0.2). |

So yes — in **distributions** (continuous case), percentiles correspond to the **area under the curve** (integral of the PDF up to that point).

---

##  EXAMPLE 1: Non-Normal (Empirical) Data

Say we have:

```
Data = [12, 15, 20, 25, 26, 27, 30, 35, 40, 100]
```

Sorted.

To get the **20th percentile**:

* We use a formula like:
  `index = (P/100) × (n - 1)`
  `= 0.2 × (10 - 1) = 1.8`

* So we interpolate between the 2nd and 3rd values:
  15 + 0.8 × (20 - 15) = **19**

 **20th percentile ≈ 19**

 This is not from a curve — it’s from sorted data.

---

##  EXAMPLE 2: Normal Distribution (Theoretical Curve)

Let’s say:

* Mean (μ) = 0
* Std dev (σ) = 1 → **standard normal distribution**

To find the **20th percentile**:

* You find the **Z-score** such that area under curve from `-∞` to `z` is 0.2
* Use statistical table or `scipy.stats.norm.ppf(0.2)` in Python

👉 `Z ≈ -0.84`

That means:

* The 20th percentile is **0.84 standard deviations below the mean**
* For any normal distribution:

  * **Value = μ + Z × σ**

Example:

* If μ = 100, σ = 15
  → 20th percentile = 100 + (–0.84) × 15 ≈ **87.6**

 Here, the percentile is computed as the **area under the bell curve up to that point**.

---

##  Key Takeaway

| Type                     | 20th Percentile Means                   | How It's Computed                              |
| ------------------------ | --------------------------------------- | ---------------------------------------------- |
| Empirical data           | 20% of sorted data is below this        | Sort & interpolate                             |
| Probability distribution | Area under PDF is 0.20 up to this point | Integrate PDF or use inverse CDF (e.g., `ppf`) |

---

##  Optional: Python Code to Visualize This

Would you like to see Python code that:

* Plots a normal distribution curve
* Shades the area for the 20th percentile
* Prints the Z-score and value?



### Interquartile Range (IQR) Explained

The **Interquartile Range (IQR)** is a measure of **statistical dispersion** — it tells us how spread out the **middle 50%** of values are in a dataset. It is resistant to outliers and gives a good idea of the variability in a distribution.

---

### **Definition:**

$$
\text{IQR} = Q_3 - Q_1
$$

* $Q_1$ (First Quartile) is the value below which 25% of the data lies.
* $Q_3$ (Third Quartile) is the value below which 75% of the data lies.

So, the **IQR** captures the range between the 25th percentile and 75th percentile — the middle half of the data.

---

### **Example:**

Suppose we have this dataset of 11 values (already sorted):

$$
\text{Data} = \{5, 7, 8, 9, 10, 12, 13, 14, 17, 18, 20\}
$$

#### Step 1: Find the Median (Q2)

There are 11 values → the median is the 6th value:

$$
Q_2 = 12
$$

#### Step 2: Find Q1 (Median of the lower half)

Lower half (before the median): {5, 7, 8, 9, 10} → median is the 3rd value:

$$
Q_1 = 8
$$

#### Step 3: Find Q3 (Median of the upper half)

Upper half (after the median): {13, 14, 17, 18, 20} → median is the 3rd value:

$$
Q_3 = 17
$$

#### Step 4: Calculate IQR

$$
\text{IQR} = Q_3 - Q_1 = 17 - 8 = 9
$$

---

### **Interpretation:**

The middle 50% of the data lies between 8 and 17, with an interquartile range of **9 units**.

---

### **Uses of IQR:**

* Identifying **outliers** (anything below $Q_1 - 1.5 \times \text{IQR}$ or above $Q_3 + 1.5 \times \text{IQR}$)
* Understanding **spread** of central data without influence from extreme values
* Useful in **boxplots**