# Central tendency

Refers to the central or typical value in a data set or distribution. In statistics, there are three main measures of central tendency: mean, median, and mode.
<br />

- Mean: The mean is the sum of all the values in a data set, divided by the number of values. It gives us the average value of the data.
It's given by:
$$ \mu = \frac{1}{n} \sum\limits_{i=1}^n x_i $$
    - Suitable for: Roughly normally distributed data
    - Suitable data types: Interval ratio
<br />

- Median: The median is the middle value in a data set when the values are ordered from least to greatest. It is a robust measure of central tendency as it is not affected by extreme values.
<br />

- Mode: The mode is the most frequently occurring value in a data set. If there are multiple values with the same highest frequency, the data set is said to have multiple modes.
<br />

In the context of Artificial Neural Networks (ANNs), central tendency is often used in pre-processing to standardize the input data, as well as in performance evaluation to summarize the outputs generated by the network.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch

## Concept of Dispersion

- Dispersion refers to the extent to which a set of values or observations varies or spreads out from the central value.
- In statistics, dispersion is a measure of how much a data set is spread out from its central tendency. Following are several measures of dispersion
<br />


- Range: The range is the difference between the largest and smallest values in a data set.
<br />

  - Standard Deviation: The standard deviation is the square root of the variance. It is a more interpretable measure of dispersion as it is in the same units as the original data.
    Given by:
    $$ \sigma = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \mu)^2} $$
<br />

  - Variance: The variance is a measure of the average of the squared deviations from the mean. It is calculated by summing the squared differences between each value and the mean, and then dividing the sum by the number of values minus one (sample variance)
     Given by:

    $$ \sigma^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \mu)^2 $$
      - Suitable for: Any distribution
      - Suitable data types: Numerical Ordinal (but requires mean)

- Variance indicates the dospersion around average, 2 diferent datasets can have the same variance
<br />
 These measures of dispersion give us a sense of how much the data deviates from the central tendency, which can be useful in understanding the distribution of the data and making predictions.
<br /><br />
In the context of ANNs, dispersion is important for pre-processing the input data and for understanding the performance of the network.

### Why we square the (x-u) term in variance
 we need the sum of the distances from the mean value, if we don't square it'll equate to zero

### Why not take the absolute value ("mean absolute difference")?

#### This is because Squaring:
- emphasizes large values
- is better for optimization (continuous and differentiable)
- is closer to Euclidean distance
- is the second "moment" of the distribution
- better link to least-square regression

In [None]:
# import libraries
import numpy as np

# create a list of numbers to compute the mean and variance of
x=[1,2,3,4,5,4,0]
n=len(x)

# compute mean

mean1=np.mean(x)
mean2=np.sum(x)/n

# print them
print(mean1)
print(mean2)

2.7142857142857144
2.7142857142857144


In [None]:
# variance

var1=np.var(x) # gives not accurate value (because of degree of freedom error)
var2=(1/(n-1))*np.sum((x-mean1)**2) # This is correct formula, gives a different value than var1

var3=(1/(n))*np.sum((x-mean1)**2) # gives the same value as var1

print(var1)
print(var2)
print(var3)

2.7755102040816326
3.2380952380952377
2.775510204081632


#### We can fix this error using the following code

In [None]:
var1=np.var(x,ddof=1)
print(var1)
print(var2)

3.2380952380952377
3.2380952380952377


### It doesn't matter for a large N tho

In [None]:
N=10000
x=np.random.randint(0,high=20,size=N)

var0=np.var(x,ddof=0) # default (biased variance)
var1=np.var(x,ddof=1) # unbiased variance

print(var0)
print(var1)

33.04489244
33.04819725972597
