# Descriptive Statistics
## These functions summarize the central tendency, dispersion, and shape of a dataset.

### Mean
#### Calculates the average of the array elements.

In [1]:
import numpy as np

In [3]:
data = np.array([10, 20, 30, 40, 50])
print("Mean:", np.mean(data))

Mean: 30.0


### Median
#### Finds the middle value in a sorted array.

In [13]:
data = [10, 15, 7, 9, 21]

median_value = np.median(data) # Calculate the median

print("Median:", median_value)

Median: 10.0


### Percentile
#### A measure that tells what percentage of values fall below a certain value in a dataset.

> The **P-th percentile** is the value below which **P percent** of the data falls.

For example:

* The **25th percentile (Q1)** is the value below which 25% of the data lies.
* The **50th percentile (Q2)** is the **median** — 50% of the data is below this point.
* The **75th percentile (Q3)** is the value below which 75% of the data lies.

In [2]:
# Sample dataset: exam scores
scores = [55, 60, 65, 70, 75, 80, 85, 90, 95, 100]

# Calculate specific percentiles
p25 = np.percentile(scores, 25)   # 25th percentile (Q1)
p50 = np.percentile(scores, 50)   # 50th percentile (Median / Q2)
p75 = np.percentile(scores, 75)   # 75th percentile (Q3)
p90 = np.percentile(scores, 90)   # 90th percentile

# Print the results
print("25th Percentile (Q1):", p25)
print("50th Percentile (Median):", p50)
print("75th Percentile (Q3):", p75)
print("90th Percentile:", p90)

25th Percentile (Q1): 66.25
50th Percentile (Median): 77.5
75th Percentile (Q3): 88.75
90th Percentile: 95.5


### Mode
#### Find the most frequent value in a dataset

In [14]:
from scipy import stats

data = [1, 2, 2, 3, 3, 3, 4]
result = stats.mode(data, keepdims=False) # 

print("Mode:", result.mode)
print("Count:", result.count)

Mode: 3
Count: 3


### Standard Deviation
#### Measures the spread of the dataset

In [19]:
data = [10, 12, 23, 23, 16, 23, 21, 16]

std_dev = np.std(data)  # Calculate the standard deviation

print("Standard Deviation:", std_dev)

Standard Deviation: 4.898979485566356


#### `std_dev = np.std(data)`

* This line calculates the **standard deviation** of the `data` using NumPy’s `std()` function.
* By default, `np.std()` computes the **population standard deviation**, which uses:

$$
\sigma = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 }
$$

Where:

* $N$ is the total number of values (8 in this case),
* $\mu$ is the **mean** of the dataset,
* $x_i$ are the individual values.

### 🧮 Example Calculation:

#### 1. Compute the Mean:

$$
\mu = \frac{10 + 12 + 23 + 23 + 16 + 23 + 21 + 16}{8} = \frac{144}{8} = 18
$$

#### 2. Compute Squared Differences from the Mean:

$$
(10 - 18)^2 = 64\\
(12 - 18)^2 = 36\\
(23 - 18)^2 = 25\\
(23 - 18)^2 = 25\\
(16 - 18)^2 = 4\\
(23 - 18)^2 = 25\\
(21 - 18)^2 = 9\\
(16 - 18)^2 = 4
$$

#### 3. Sum of Squared Differences:

$$
64 + 36 + 25 + 25 + 4 + 25 + 9 + 4 = 192
$$

#### 4. Compute Variance (Population):

$$
\sigma^2 = \frac{192}{8} = 24
$$

#### 5. Standard Deviation:

$$
\sigma = \sqrt{24} \approx 4.898979
$$

### Variance
#### The square of the standard deviation

In [17]:
data = [2, 4, 4, 4, 5, 5, 7, 9]

variance = np.var(data) # Calculate the variance

print("Variance:", variance)

Variance: 4.0


### 🧠 Explanation:

* The **mean** of the data is `5.0`.
* Variance = average of squared differences from the mean:

  $$
  \frac{(2-5)^2 + (4-5)^2 + (4-5)^2 + \dots + (9-5)^2}{N}
  = 4.0
  $$

### Range
#### The difference between the maximum and minimum values.

In [20]:
data = [3, 7, 2, 9, 4]

range_value = np.ptp(data) # Calculate the range (peak-to-peak)

print("Peak-to-peak (range):", range_value)

Peak-to-peak (range): 7


### Sum
#### Computes the total sum of the array elements

In [24]:
data = np.array([3, 5, 7, 2])

total = np.sum(data) # Calculate the sum of all elements

print("Sum:", total)
print("Count: ", len(data))  # Count Returns the number of elements in the array.

Sum: 17
Count:  4


### Correlation Coefficient
#### Measures the linear relationship between two datasets

In [25]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
print("Correlation Coefficient:\n", np.corrcoef(x, y)) # computes the Pearson correlation coefficient matrix between x and y.

Correlation Coefficient:
 [[1. 1.]
 [1. 1.]]


### 📊 Pearson Correlation Coefficient

* Measures the **linear relationship** between two variables.
* Ranges from `-1` to `1`:

  * `1`: Perfect positive correlation
  * `0`: No correlation
  * `-1`: Perfect negative correlation

* This is a **2x2 matrix**:

  ```
  [[corr(x, x), corr(x, y)],
   [corr(y, x), corr(y, y)]]
  ```
* Because `x` and `y` are perfectly correlated (one is exactly twice the other), all values are `1`.

### ✅ Interpretation:

* `np.corrcoef(x, y)[0, 1]` or `np.corrcoef(x, y)[1, 0]` gives the actual **correlation between `x` and `y`**, which is `1.0` in this case.


### Covariance
#### Measures the relationship between the variances of two datasets.

In [27]:
# Two variables (e.g., height and weight of people)
x = [65, 66, 67, 68, 69]  # Height
y = [120, 121, 123, 126, 129]  # Weight

# Compute the covariance matrix
cov_matrix = np.cov(x, y)

print("Covariance matrix:\n", cov_matrix)

Covariance matrix:
 [[ 2.5   5.75]
 [ 5.75 13.7 ]]


### 🔍 Step-by-Step: What Is Covariance?

Covariance measures **how two variables change together**:

* If both increase together → covariance is **positive**.
* If one increases while the other decreases → covariance is **negative**.
* If they vary independently → covariance is **around 0**.

### 🧠 How to Interpret the Output:

The matrix returned by `np.cov()` is:

```
[[Var(x), Cov(x, y)],
 [Cov(y, x), Var(y)]]
```

* `Var(x)` = variance of x (height)
* `Var(y)` = variance of y (weight)
* `Cov(x, y)` = `Cov(y, x)` = covariance between height and weight

So in this case:

* Variance of `x` = 2.5
* Variance of `y` = 13.0
* Covariance between `x` and `y` = 5.75 (positive → they increase together)


#### 📐 Step 1: Find the means

```
mean_x = (65 + 66 + 67 + 68 + 69) / 5 = 67
mean_y = (120 + 121 + 123 + 126 + 129) / 5 = 123.8
```

---

#### 📐 Step 2: Compute each pair’s deviation from mean and multiply

| x  | y   | x−mean\_x | y−mean\_y | (x−mean\_x)(y−mean\_y) |
| -- | --- | --------- | --------- | ---------------------- |
| 65 | 120 | -2        | -3.8      | 7.6                    |
| 66 | 121 | -1        | -2.8      | 2.8                    |
| 67 | 123 | 0         | -0.8      | 0                      |
| 68 | 126 | 1         | 2.2       | 2.2                    |
| 69 | 129 | 2         | 5.2       | 10.4                   |

Sum of products:
`7.6 + 2.8 + 0 + 2.2 + 10.4 = 23`

---

#### 📐 Step 3: Divide by `n - 1` (Bessel's correction)

```
Cov(x, y) = sum / (n - 1) = 23 / 4 = 5.75
```
---

### ✅ Final Insight

* **Positive covariance (e.g., 5.75)** means:

  > As **x increases**, **y tends to increase** too.

  In our case: taller people tend to weigh more.

* If covariance were **negative**, that would mean: as one increases, the other tends to **decrease**.

### Skewness
#### Measures the asymmetry of the dataset.

In [30]:
from scipy.stats import skew

data = [1, 2, 2, 3, 4, 10]  # Positively skewed due to the high value 10

skewness = skew(data)  # Calculate skewness

print("Skewness:", skewness)

skewness = skew(data, bias=False) # uses the unbiased estimator (recommended for statistical accuracy in most cases).
print("Unbiased Skewness:", skewness)

Skewness: 1.418505623226429
Unbiased Skewness: 1.942368819472598


### 📊 Interpretation:

* **Skewness > 0**: Right-skewed (tail on the right)
* **Skewness < 0**: Left-skewed (tail on the left)
* **Skewness ≈ 0**: Symmetrical distribution (like a normal distribution)

### Kurtosis
#### Measures the "tailedness" of the dataset.

In [32]:
from scipy.stats import kurtosis

data = [2, 4, 4, 4, 5, 5, 7, 9]

kurt = kurtosis(data)  # Calculate kurtosis

print("Kurtosis:", kurt)

Kurtosis: -0.21875


### 🧠 What It Means:

* A **negative kurtosis** means the distribution is **platykurtic** — flatter than a normal distribution (lighter tails).
* A **positive kurtosis** means **leptokurtic** — heavier tails, more outliers.
* A kurtosis of **0** (when using `fisher=True`, the default) indicates a **normal distribution** (mesokurtic).

### 🔧 Optional Parameters:

```python
kurtosis(data, fisher=True, bias=True, nan_policy='propagate')
```

* **fisher=True** (default): Returns **excess kurtosis** (normal dist = 0).

  * Set `fisher=False` for **regular kurtosis** (normal dist = 3).
* **bias=True**: Corrects for statistical bias.
* **nan\_policy='propagate'**: How to handle NaNs (`'omit'`, `'raise'`, `'propagate'`).

### Working with Multidimensional Arrays

In [34]:
matrix = np.array([[10, 20, 30], [40, 50, 60]])

# Mean along columns
print("Column-wise Mean:", np.mean(matrix, axis=0))  # Output: [25. 35. 45.]

# Mean along rows
print("Row-wise Mean:", np.mean(matrix, axis=1)) # Output: [20. 50.]

Column-wise Mean: [25. 35. 45.]
Row-wise Mean: [20. 50.]


### Practical Example

In [36]:
# Generate 100 random integers between 1 and 99
data = np.random.randint(1, 100, size=100)

# Descriptive statistics
print("Mean:", np.mean(data))                      # Calculate mean
print("Median:", np.median(data))                  # Calculate median
print("Standard Deviation:", np.std(data))         # Calculate standard deviation

# Percentiles
print("90th Percentile:", np.percentile(data, 90)) # Calculate 90th percentile

# Compare two variables
x = np.random.random(100) * 10                     # Generate 100 random float values between 0 and 10
y = x + np.random.normal(0, 1, 100)                # Add Gaussian noise to x to create y

# Correlation and covariance
print("Correlation Coefficient:\n", np.corrcoef(x, y)) # Correlation matrix between x and y
print("Covariance Matrix:\n", np.cov(x, y))            # Covariance matrix between x and y

# Data transformation (standardization)
data = np.random.random((10, 3))                      # Generate 10x3 array of random values
standardized = (data - np.mean(data, axis=0)) / np.std(data, axis=0) # Standardize each column

print("Original Data:\n", data)
print("Standardized Data:\n", standardized)

Mean: 48.84
Median: 50.5
Standard Deviation: 29.260116199359153
90th Percentile: 86.40000000000003
Correlation Coefficient:
 [[1.         0.93757185]
 [0.93757185 1.        ]]
Covariance Matrix:
 [[8.44634719 8.03553704]
 [8.03553704 8.69664541]]
Original Data:
 [[0.86288238 0.19619927 0.37434338]
 [0.27776956 0.9993055  0.96311429]
 [0.7282521  0.65568695 0.95532326]
 [0.26126675 0.15616517 0.87614895]
 [0.46255776 0.23470976 0.51695414]
 [0.05627009 0.52479825 0.48432393]
 [0.63925236 0.26413424 0.35641493]
 [0.42120411 0.1998025  0.2981779 ]
 [0.04818422 0.36447112 0.93200033]
 [0.15177201 0.09723376 0.49222742]]
Standardized Data:
 [[ 1.7581832  -0.64924841 -0.96681948]
 [-0.42161257  2.3638188   1.30503709]
 [ 1.25662778  1.07464217  1.27497429]
 [-0.4830926  -0.799447    0.96946889]
 [ 0.26680258 -0.50476604 -0.4165355 ]
 [-1.24679293  0.58357831 -0.54244384]
 [ 0.92506563 -0.39437226 -1.03599894]
 [ 0.11274254 -0.63572994 -1.26071483]
 [-1.27691627 -0.0179317   1.18497942]
 [-0.