## Mean, median, mode

- *mean* - average
- *median* - the data in the middle
- *mode* - count the numbers, give the highest one

Few examples of calculating with numpy:

In [4]:
import numpy as np
s = np.random.normal(100, 20, 50)

np.mean(s)
np.median(s)

100.40610892626047

## Variance

> to tell how "spread-out" the data is

1. find the mean
2. find the difference from the mean (subtract the num with mean)
3. squared the difference
4. find the average of the squared differences

In [6]:
mean = 100
stand_deviation = 10
size = 20  # size means, how many values in the array
s = np.random.normal(mean, stand_deviation, size)

# by object method
s.var()

# by numpy function
np.var(s)

98.30815242523478

## Standard deviation

標準差

**The square root of the variance** - variance 開根號

**Example:** 
2 classes: A and B. Class A with 5 students, scored (20, 80, 10, 90, 50). Class B with 5 students, scored (50, 49, 51, 52, 48). Even though the average is 50 respectively, but the standard deviation of these 2 classes are very far apart.

- It's usually used as a way to identify outliers. Some data of 1 standard deviation can be considered unusual. Some might consider 1.5, or more.

In [8]:
s = np.random.normal(100, 10, 20)
d = s.std()

# or numpy method
np.std(s)

8.869444775629821

### Sample variance, divide by `N-1`

The complete dataset calculate the variance by how many records are. For example, if it's 7, then divide by 7.

However, if you're sampling, divide by *N-1*. In the above example, divide by `7-1 = 6`.

## Percentile

> The point at which x% of value is less than the given value

More generally speaking, it's: upto this point, it's x% of data.

Percentile 30% equals to `$20000`. It means upto 30% of people earn less than `$20000`.

In [9]:
mu = 0
std = 1
size = 100_000
x = np.random.normal(mu, std, size)

np.percentile(x, 20)

-0.8399415333957866

### Quartile

Quartile 1 (Q1) and quartile 3 (Q3) in the middle are just the points that contain together 50% of the data, so 25% are on left side of the median and 25% are on the right side of the median. 

That can short for *interquartile range (IQR)*

## Moments

4 moments in statistics. It's a quantitative measures of the shape of a probability density function 

1. mean - `np.mean(data)`
2. variance - `np.var(data)`
2. skew - `np.skew(data)`
3. kurtosis - `np.kurtosis(data)`

### Skew

> Skew is associated with the shape of the distribution, not its actual offset in X.

if I have a longer tail on the left, that is a negative skew. If I have a longer tail on the right then,that's a positive skew.

- negative skew
- positive skew

The dotted lines show what the shape of a normal distribution would look like without skew. The dotted line out on the left side then I end up with a negative skew, or on the other side, a positive skew in that example. 

![skew](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Negative_and_positive_skew_diagrams_%28English%29.svg/446px-Negative_and_positive_skew_diagrams_%28English%29.svg.png) 

In [10]:
import scipy.stats as sp

nums = np.random.normal(100, 2, 300);

sp.skew(nums)

-0.09576016940475295

### Kurtosis

Kurtosis: how thick is the tail, how sharp is the peak, compared to normal distribution?

![kurtosis](https://www.daytrading.com/wp-content/uploads/2021/09/Kurtosis.jpg) 

In [11]:
import scipy.stats as sp
sp.kurtosis(nums)

-0.10957535296411258