In [1]:
import numpy as np
from scipy import stats

In [2]:
rng = np.random.default_rng(2022)
arr = rng.integers(low=0, high=10, size=10)
arr

array([7, 2, 7, 0, 1, 6, 9, 0, 0, 6])

# [Descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics)

## Center of Distribution

Mean and median both try to measure the "central tendency" in a data set. The goal of each is to get an idea of a "typical" value in the data set. The mean is commonly used, but sometimes the median is preferred.

### [Mean](https://en.wikipedia.org/wiki/Mean)

For a [data set](https://en.wikipedia.org/wiki/Data_set "Data set"), the _[arithmetic mean](https://en.wikipedia.org/wiki/Arithmetic_mean "Arithmetic mean")_, also known as arithmetic average, is a central value of a finite set of numbers: specifically, the sum of the values divided by the number of values.

If the data set were based on a series of observations obtained by [sampling](https://en.wikipedia.org/wiki/Sampling_(statistics) "Sampling (statistics)") from a [statistical population](https://en.wikipedia.org/wiki/Statistical_population "Statistical population"), the arithmetic mean is the _[sample mean](https://en.wikipedia.org/wiki/Sample_mean "Sample mean")_ (denoted ${\displaystyle {\bar {x}}}$) to distinguish it from the mean, or [expected value](https://en.wikipedia.org/wiki/Expected_value "Expected value"), of the underlying distribution, the _[population mean](https://en.wikipedia.org/wiki/Population_mean "Population mean")_ (denoted ${\displaystyle \mu }$ or ${\displaystyle \mu _{x}}$.

- Population mean: $\mu$ or $\mu_x$
- Sample mean: $\bar x$

In [3]:
mu = np.mean(arr)
mu

3.8

### [Median](https://en.wikipedia.org/wiki/Median)

In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics") and [probability theory](https://en.wikipedia.org/wiki/Probability_theory "Probability theory"), the **median** is the value separating the higher half from the lower half of a [data sample](https://en.wikipedia.org/wiki/Sample_(statistics) "Sample (statistics)"), a [population](https://en.wikipedia.org/wiki/Statistical_population "Statistical population"), or a [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution "Probability distribution"). For a [data set](https://en.wikipedia.org/wiki/Data_set "Data set"), it may be thought of as "the middle" value.

In [4]:
median = np.median(arr)
median

4.0

### [Mode](https://en.wikipedia.org/wiki/Mode_(statistics))

The **mode** is the value that appears most often in a set of data values.[[1]](https://en.wikipedia.org/wiki/Mode_(statistics)#cite_note-1) If **X** is a discrete random variable, the mode is the value x (i.e, _**X**_ = _x_) at which the [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function "Probability mass function") takes its maximum value. In other words, it is the value that is most likely to be sampled.

In [5]:
mode = stats.mode(arr)[0]
mode

array([0])

---

## Data Patterns

### [Percentile](https://en.wikipedia.org/wiki/Percentile)

In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), a **_k_-th** **percentile** (**percentile score** or **centile**) is a [score](https://en.wikipedia.org/wiki/Raw_score "Raw score") _below which_ a given [percentage](https://en.wikipedia.org/wiki/Percentage "Percentage") _k_ of scores in its [frequency distribution](https://en.wikipedia.org/wiki/Frequency_distribution "Frequency distribution") falls (exclusive definition) or a score _at or below which_ a given percentage falls (inclusive definition). For example, the 50th percentile (the [median](https://en.wikipedia.org/wiki/Median "Median")) is the score below which (exclusive) or at or below which (inclusive) 50% of the scores in the distribution may be found. Percentiles are expressed in the same [unit of measurement](https://en.wikipedia.org/wiki/Unit_of_measurement "Unit of measurement") as the input scores; for example, if the scores refer to [human weight](https://en.wikipedia.org/wiki/Human_weight "Human weight"), the corresponding percentiles will be expressed in kilograms or pounds.

The 25th percentile is also known as the first [quartile](https://en.wikipedia.org/wiki/Quartile "Quartile") ($Q_1$), the 50th percentile as the [median](https://en.wikipedia.org/wiki/Median "Median") or second quartile ($Q_2$), and the 75th percentile as the third quartile ($Q_3$).

In [6]:
np.percentile(arr, 25, method="lower")

0

In [7]:
np.percentile(arr, 25, method="midpoint")

0.5

In [8]:
np.percentile(arr, 25, method="higher")

1

In [9]:
np.percentile(arr, 75, method="lower")

6

In [10]:
np.percentile(arr, 75, method="midpoint")

6.5

In [11]:
np.percentile(arr, 75, method="higher")

7

### [IQR](https://en.wikipedia.org/wiki/Interquartile_range)

In [descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics "Descriptive statistics"), the **interquartile range** (**IQR**) is a measure of [statistical dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion "Statistical dispersion"). It is the spread of the data or observations.[[1]](https://en.wikipedia.org/wiki/Interquartile_range#cite_note-:1-1) The IQR may also be called the **midspread**, **middle 50%**, or **H‑spread.** It is defined as the spread difference between the 75th and 25th [percentiles](https://en.wikipedia.org/wiki/Percentiles "Percentiles") of the data.[[2]](https://en.wikipedia.org/wiki/Interquartile_range#cite_note-Upton-2)[[3]](https://en.wikipedia.org/wiki/Interquartile_range#cite_note-ZK-3)[[4]](https://en.wikipedia.org/wiki/Interquartile_range#cite_note-4) To calculate the IQR, the data set is divided into [quartiles](https://en.wikipedia.org/wiki/Quartile "Quartile"), or four rank-ordered even parts via linear interpolation.[[1]](https://en.wikipedia.org/wiki/Interquartile_range#cite_note-:1-1) These quartiles are denoted by $Q_1$ (also called the lower quartile), $Q_2$ (the [median](https://en.wikipedia.org/wiki/Median "Median")), and $Q_3$ (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so $IQR = Q_3 − Q_1$.

In [12]:
stats.iqr(arr)

6.5

In [13]:
stats.iqr(arr, interpolation="midpoint")

6.0

In [14]:
# IQR function, Q1, Q3, and IQR are not the same value as np.percentile and scipy.stats.iqr
def iqr(arr):
    arr.sort()
    pivot = len(arr) // 2
    if len(arr) % 2 == 1:
        arr.pop(pivot)
    Q1, Q3 = np.median(arr[:pivot]), np.median(arr[pivot:])
    IQR = Q3 - Q1
    return Q1, Q3, IQR

iqr(arr)

(0.0, 7.0, 7.0)

## Data Variability

### [Variance](https://en.wikipedia.org/wiki/Variance)

In [probability theory](https://en.wikipedia.org/wiki/Probability_theory "Probability theory") and [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), **variance** is the [expectation](https://en.wikipedia.org/wiki/Expected_value "Expected value") of the squared [deviation](https://en.wikipedia.org/wiki/Deviation_(statistics) "Deviation (statistics)") of a [random variable](https://en.wikipedia.org/wiki/Random_variable "Random variable") from its [population mean](https://en.wikipedia.org/wiki/Population_mean "Population mean") or [sample mean](https://en.wikipedia.org/wiki/Sample_mean "Sample mean"). Variance is a measure of [dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion "Statistical dispersion"), meaning it is a measure of how far a set of numbers is spread out from their average value. Variance has a central role in statistics, where some ideas that use it include [descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics "Descriptive statistics"), [statistical inference](https://en.wikipedia.org/wiki/Statistical_inference "Statistical inference"), [hypothesis testing](https://en.wikipedia.org/wiki/Hypothesis_testing "Hypothesis testing"), [goodness of fit](https://en.wikipedia.org/wiki/Goodness_of_fit "Goodness of fit"), and [Monte Carlo sampling](https://en.wikipedia.org/wiki/Monte_Carlo_method "Monte Carlo method"). Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation "Standard deviation"), the second [central moment](https://en.wikipedia.org/wiki/Central_moment "Central moment") of a [distribution](https://en.wikipedia.org/wiki/Probability_distribution "Probability distribution"), and the [covariance](https://en.wikipedia.org/wiki/Covariance "Covariance") of the random variable with itself, and it is often represented by ${\displaystyle \sigma ^{2}}$, ${\displaystyle s^{2}}$, ${\displaystyle \operatorname {Var} (X)}$, ${\displaystyle V(X)}$, or ${\displaystyle \mathbb {V} (X)}$.[[1]](https://en.wikipedia.org/wiki/Variance#cite_note-1)

- Population Variance: $\sigma^2$
- Sample Variance: $s^2$

In [15]:
# Population Variance
variance = np.var(arr)
variance

11.16

In [16]:
# Sample Variance
variance = np.var(arr, ddof=1)
variance

12.4

### [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation)

In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), the **standard deviation** is a measure of the amount of variation or [dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion "Statistical dispersion") of a set of values.[[1]](https://en.wikipedia.org/wiki/Standard_deviation#cite_note-StatNotes-1) A low standard deviation indicates that the values tend to be close to the [mean](https://en.wikipedia.org/wiki/Mean "Mean") (also called the [expected value](https://en.wikipedia.org/wiki/Expected_value "Expected value")) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Standard deviation may be abbreviated **SD**, and is most commonly represented in mathematical texts and equations by the lower case [Greek letter](https://en.wikipedia.org/wiki/Greek_alphabet "Greek alphabet") sigma **[σ](https://en.wikipedia.org/wiki/Sigma "Sigma")**, for the population standard deviation, or the [Latin letter](https://en.wikipedia.org/wiki/Latin_alphabet "Latin alphabet") [s](https://en.wikipedia.org/wiki/S "S"), for the sample standard deviation.

- Population SD: $\sigma$
- Sample SD: $s$

In [17]:
# Population SD
sd = np.std(arr)
sd

3.340658617698013

In [18]:
# Sample SD
sd = np.std(arr, ddof=1)
sd

3.521363372331802

### [Mean Absolute Deviation](https://en.wikipedia.org/wiki/Average_absolute_deviation)

The **average absolute deviation** (**AAD**) of a data set is the [average](https://en.wikipedia.org/wiki/Average "Average") of the [absolute](https://en.wikipedia.org/wiki/Absolute_value "Absolute value") [deviations](https://en.wikipedia.org/wiki/Deviation_(statistics) "Deviation (statistics)") from a [central point](https://en.wikipedia.org/wiki/Central_tendency "Central tendency"). It is a [summary statistic](https://en.wikipedia.org/wiki/Summary_statistics "Summary statistics") of [statistical dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion "Statistical dispersion") or variability. In the general form, the central point can be a [mean](https://en.wikipedia.org/wiki/Arithmetic_mean "Arithmetic mean"), [median](https://en.wikipedia.org/wiki/Median "Median"), [mode](https://en.wikipedia.org/wiki/Mode_(statistics) "Mode (statistics)"), or the result of any other measure of central tendency or any reference value related to the given data set. AAD includes the **mean absolute deviation** and the **median absolute deviation** (both abbreviated as **MAD**).

In [19]:
def mad(arr):
    mu = np.mean(arr)
    mad = sum(map(lambda x: abs(x - mu), arr)) / len(arr)
    return mad

mad(arr)

3.1999999999999997

---

## [Outlier](https://en.wikipedia.org/wiki/Outlier)

In [statistics](https://en.wikipedia.org/wiki/Statistics "Statistics"), an **outlier** is a [data point](https://en.wikipedia.org/wiki/Data_point "Data point") that differs significantly from other observations.[[1]](https://en.wikipedia.org/wiki/Outlier#cite_note-1)[[2]](https://en.wikipedia.org/wiki/Outlier#cite_note-2) An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the [data set](https://en.wikipedia.org/wiki/Data_set "Data set").[[3]](https://en.wikipedia.org/wiki/Outlier#cite_note-3) An outlier can cause serious problems in statistical analyses.

**Tukey's fences**

Other methods flag observations based on measures such as the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range "Interquartile range"). For example, if ${\displaystyle Q_{1}}$ and ${\displaystyle Q_{3}}$ are the lower and upper [quartiles](https://en.wikipedia.org/wiki/Quartile "Quartile") respectively, then one could define an outlier to be any observation outside the range:

${\displaystyle {\big [}Q_{1}-k(Q_{3}-Q_{1}),Q_{3}+k(Q_{3}-Q_{1}){\big ]}}$

for some nonnegative constant $k$. [John Tukey](https://en.wikipedia.org/wiki/John_Tukey "John Tukey") proposed this test, where ${\displaystyle k=1.5}$ indicates an "outlier", and ${\displaystyle k=3}$ indicates data that is "far out".[[16]](https://en.wikipedia.org/wiki/Outlier#cite_note-16)

In [20]:
def tukey_outlier(arr):
    Q1, Q3, IQR = iqr(arr)
    low_outlier = Q1 - 1.5 * IQR
    high_outlier = Q3 + 1.5 * IQR
    return low_outlier, high_outlier

tukey_outlier(arr)

(-10.5, 17.5)

## Describe

In [21]:
def describe(arr):
    print("mean:", np.mean(arr))
    print("Population Variance:", np.var(arr))
    print("Sample Variance:", np.var(arr, ddof=1))
    print("Population SD:", np.std(arr))
    print("Sample SD:", np.std(arr, ddof=1))
    print("MAD:", mad(arr))
    print("median:", np.median(arr))
    print("IQR:", iqr(arr))
    print("outlier:", tukey_outlier(arr))
    
describe(arr)

mean: 3.8
Population Variance: 11.16
Sample Variance: 12.4
Population SD: 3.340658617698013
Sample SD: 3.521363372331802
MAD: 3.1999999999999997
median: 4.0
IQR: (0.0, 7.0, 7.0)
outlier: (-10.5, 17.5)
