# Python Statistics Libraries

### statistics
- Python‚Äôs statistics is a built-in Python library for descriptive statistics. You can use it if your datasets are not too large or if you can‚Äôt rely on importing other libraries.

### Numpy
- NumPy is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis.

### SciPy
- SciPy is a third-party library for scientific computing based on NumPy. It offers additional functionality compared to NumPy, including scipy.stats for statistical analysis.

### Pandas
- Pandas is a third-party library for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.

### Matplotlib
- Matplotlib is a third-party library for data visualization. It works well in combination with NumPy, SciPy, and Pandas.

#### Outliers :
An outlier is a data point that differs significantly from the majority of the data taken from a sample or population. There are many possible causes of outliers, but here are a few to start you off:

- Natural variation in data
- Change in the behavior of the observed system
- Errors in data collection


### Note that you should always be aware of whether you‚Äôre working with a sample or the entire population whenever you‚Äôre calculating the variance!

In [4]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [5]:
x = [8.0, 1, 2.5, 4, 28.0]
xWithNan = [8.0, 1, 2.5, math.nan, 4, 28.0]
print("X : ",x)
print("\n")
print("X with NaN value : ",xWithNan)

X :  [8.0, 1, 2.5, 4, 28.0]


X with NaN value :  [8.0, 1, 2.5, nan, 4, 28.0]


In [6]:
y = np.array(x)
yWithNan = np.array(xWithNan)
z = pd.Series(x)
zWithNan = pd.Series(xWithNan)
print("Y : ",y)
print("\n")
print("Y with NaN value : ",yWithNan)
print("\n")
print("Z : \n",z)
print("\n")
print("Z with NaN value :\n",zWithNan)

Y :  [ 8.   1.   2.5  4.  28. ]


Y with NaN value :  [ 8.   1.   2.5  nan  4.  28. ]


Z : 
 0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64


Z with NaN value :
 0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64


In [7]:
# How do you get a nan value?
print(math.isnan(np.nan))
print(np.isnan(math.nan))
print(math.isnan(yWithNan[3]))
print(np.isnan(yWithNan[3]))
print(math.isnan(yWithNan[2]))
print(np.isnan(yWithNan[2]))

True
True
True
True
False
False


# Measures of Central Tendency
The measures of central tendency show the central or middle values of datasets. There are several definitions of what‚Äôs considered to be the center of a dataset. In this tutorial, you‚Äôll learn how to identify and calculate these measures of central tendency:

- Aritmetic mean (Mean)
- Weighted mean
- Geometric mean
- Harmonic mean
- Median
- Mode

## Arithmetic mean (AM)

- The arithmetic mean (or simply mean) of a list of numbers, is the sum of all of the numbers divided by the amount of numbers. 

<img src="images/arithmeticMean.png">

In [29]:
mean_ = sum(x) / len(x)
print("Mean Calculation Without Library : ", mean_)

# mean() and fmean() from the built-in Python statistics library
# fmean() is introduced in Python 3.8 as a faster alternative to mean(). 
# It always returns a floating-point number.

mean_ = statistics.mean(x)
print("Mean Calculation With 'statistics.mean' : ", mean_)

mean_ = statistics.fmean(x)
print("Mean Calculation With 'statistics.fmean' : ", mean_)

mean_ = np.mean(y)
print("Mean Calculation With 'numpy.mean' : ", mean_)



# However, if there are nan values among your data, then statistics.mean() and statistics.fmean() 
# will return nan as the output:

mean_ = statistics.mean(xWithNan)
print("Mean Calculation, 'nan' value inclued: ", mean_)

mean_ = statistics.fmean(xWithNan)
print("Mean Calculation, 'nan' value inclued: ", mean_)
mean_ = yWithNan.mean()
print("Mean Calculation With 'nan' value inclued : ", mean_)

mean_ = np.nanmean(yWithNan)
print("Mean Calculation With 'nan' value inclued : ", mean_)

# mean calculation with pandas library
mean_ = z.mean()
print("Mean Calculation With 'pandas.mean' : ", mean_)
# Pandas ignores nan values by default
zWithNan.mean()
print("Mean Calculation With 'nan' value inclued : ", mean_)

Mean Calculation Without Library :  11.622222222222222
Mean Calculation With 'statistics.mean' :  11.622222222222222
Mean Calculation With 'statistics.fmean' :  11.622222222222222
Mean Calculation With 'numpy.mean' :  11.622222222222222
Mean Calculation, 'nan' value inclued:  nan
Mean Calculation, 'nan' value inclued:  nan
Mean Calculation With 'nan' value inclued :  nan
Mean Calculation With 'nan' value inclued :  11.622222222222222
Mean Calculation With 'pandas.mean' :  11.622222222222222
Mean Calculation With 'nan' value inclued :  11.622222222222222


## Weighted Mean (WM)

- The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others.

<img src="images/weightedMean.png">

In [9]:
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]
wmean_ = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
print(" Sum of the weights with 'range()' : ", wmean_)
print("\n")
wmean_ = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
print(" Sum of the weights with 'zip()': ", wmean_)


y, z, w = np.array(x), pd.Series(x), np.array(w)

wmean_ = np.average(y, weights=w)
wmean_

wmean_ = np.average(z, weights=w)
wmean_

(w * y).sum() / w.sum()


w = np.array([0.1, 0.2, 0.3, 0.0, 0.2, 0.1])

(w * yWithNan).sum() / w.sum()

np.average(yWithNan, weights=w)

np.average(zWithNan, weights=w)

 Sum of the weights with 'range()' :  6.95


 Sum of the weights with 'zip()':  6.95


nan

## Geometric Mean (AM)

- The geometric mean is an average that is useful for sets of positive numbers, that are interpreted according to their product (as is the case with rates of growth) and not their sum.

<img src="images/geometricMean.png">

In [11]:
gmean = 1
for item in x:
    gmean *= item

gmean **= 1 / len(x)
gmean


gmean = statistics.geometric_mean(x)
gmean

gmean = statistics.geometric_mean(xWithNan)
gmean

scipy.stats.gmean(y)

scipy.stats.gmean(z)

4.67788567485604

## Harmonic Mean (HM)

- The harmonic mean is an average which is useful for sets of numbers which are defined in relation to some unit, as in the case of speed (i.e., distance per unit of time).

<img src="images/harmonicMean.png">

In [34]:
hmean = len(x) / sum(1 / item for item in x)
hmean

#hmean = statistics.harmonic_mean(x)
hmean

statistics.harmonic_mean(xWithNan)

statistics.harmonic_mean([1, 0, 2])

#statistics.harmonic_mean([1, 2, -2])  # Raises StatisticsError

#scipy.stats.hmean(y)

#scipy.stats.hmean(z)

0

## Median

- In statistics and probability theory, a median is a value separating the higher half from the lower half of a data sample, a population or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic advantage of the median in describing data compared to the mean (often simply described as the "average") is that it is not skewed so much by a small proportion of extremely large or small values, and so it may give a better idea of a "typical" value.

<img src="images/median.png">

- If the number of elements is odd, then there‚Äôs a single middle value, so these functions behave just like median().

- If the number of elements is even, then there are two middle values. In this case, median_low() returns the lower and median_high() the higher middle value.

In [13]:
n = len(x)
if n % 2:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5 * n)
    median_ = 0.5 * (x_ord[index-1] + x_ord[index])

median_

median_ = statistics.median(x)
median_

median_ = statistics.median(x[:-1])
median_

median_ = np.median(y)
median_

median_ = np.median(y[:-1])
median_

np.nanmedian(y_with_nan)

np.nanmedian(y_with_nan[:-1])


z.median()

z_with_nan.median()


NameError: name 'y_with_nan' is not defined

## Mode

- The mode is the value that appears most often in a set of data values.[1] If X is a discrete random variable, the mode is the value x (i.e, X = x) at which the probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled.

- Like the statistical mean and median, the mode is a way of expressing, in a (usually) single number, important information about a random variable or a population. The numerical value of the mode is the same as that of the mean and median in a normal distribution, and it may be very different in highly skewed distributions.

In [14]:
u = [2, 3, 2, 8, 12]
mode_ = max((u.count(item), item) for item in set(u))[1]
mode_


mode_ = statistics.mode(u)
mode_
mode_ = statistics.multimode(u)
mode_

# If there‚Äôs more than one modal value, then mode() raises StatisticsError, 
# while multimode() returns the list with all modes:
# statistics.multimode() is introduced in Python 3.8.
v = [12, 15, 12, 15, 21, 15, 12]
statistics.mode(v)  # Raises StatisticsError
statistics.multimode(v)



statistics.mode([2, math.nan, 2])

statistics.multimode([2, math.nan, 2])

statistics.mode([2, math.nan, 0, math.nan, 5])

statistics.multimode([2, math.nan, 0, math.nan, 5])



u, v = np.array(u), np.array(v)
mode_ = scipy.stats.mode(u)
mode_

mode_ = scipy.stats.mode(v)
mode_


#numpy array
mode_.mode

mode_.count


#pandas series
u, v, w = pd.Series(u), pd.Series(v), pd.Series([2, 2, math.nan])
u.mode()


v.mode()



w.mode()


0    2.0
dtype: float64

# Measures of Variability

The measures of central tendency aren‚Äôt sufficient to describe data. You‚Äôll also need the measures of variability that quantify the spread of data points. In this section, you‚Äôll learn how to identify and calculate the following variability measures:

- Variance
- Standard deviation
- Skewness
- Percentiles
- Ranges

## Variance

- In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value.

<br>

- In statistics, Bessel's correction is the use of n ‚àí 1 instead of n in the formula for the sample variance and sample standard deviation,[1] where n is the number of observations in a sample. This method corrects the bias in the estimation of the population variance. It also partially corrects the bias in the estimation of the population standard deviation. However, the correction often increases the mean squared error in these estimations.

<img src="images/variance.png">

<br>

- You calculate the population variance similarly to the sample variance. However, you have to use ùëõ in the denominator instead of ùëõ ‚àí 1. In this case, ùëõ is the number of items in the entire population. You can get the population variance similar to the sample variance, with the following differences:

1- Replace (n - 1) with n in the pure Python implementation. <br>
2- Use statistics.pvariance() instead of statistics.variance(). <br>
3- Specify the parameter ddof=0 if you use NumPy or Pandas. In NumPy, you can omit ddof because its default value is 0.


In [17]:
n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
var_


var_ = statistics.variance(x)
var_


statistics.variance(xWithNan)

# You can also calculate the sample variance with NumPy. 
# You should use the function np.var() or the corresponding method .var():
var_ = np.var(y, ddof=1)
var_


# It‚Äôs very important to specify the parameter ddof=1. 
# That‚Äôs how you set the delta degrees of freedom to 1. 
# This parameter allows the proper calculation of ùë†¬≤, with (ùëõ ‚àí 1) in the denominator instead of ùëõ.
var_ = y.var(ddof=1)
var_

# If you have nan values in the dataset, then np.var() and .var() will return nan
np.var(yWithNan, ddof=1)

yWithNan.var(ddof=1)

# If you want to skip nan values, then you should use np.nanvar():
np.nanvar(yWithNan, ddof=1)


# pd.Series objects have the method .var()
z.var(ddof=1)

zWithNan.var(ddof=1)

123.19999999999999

## Standard Deviation

- In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

<img src="images/std1.png">

<br>

- The population standard deviation refers to the entire population. It‚Äôs the positive square root of the population variance. 

<img src="images/std2.png">



In [37]:
# standard deviation with pure Python:
std_ = var_ ** 0.5
std_

# we can use statistics.stdev()
std_ = statistics.stdev(x)
std_

# You can get standard deviation with NumPy  
# If there are nan values in the dataset, then they‚Äôll return nan.
np.std(y, ddof=1)

y.std(ddof=1)

np.std(yWithNan, ddof=1)

yWithNan.std(ddof=1)


# To ignore nan values, you should use np.nanstd().
# Don‚Äôt forget to set the delta degrees of freedom to 1!
np.nanstd(yWithNan, ddof=1)


# pd.Series objects also have the method .std() that skips nan by default
# The parameter ddof defaults to 1
z.std(ddof=1)

zWithNan.std(ddof=1)


15.12454774346805

## Skewness

- In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

- The skewness defined like this is called the adjusted Fisher-Pearson standardized moment coefficient

<img src="images/skewnessGraphs.png">


- The skewness of a random variable X is the third standardized moment , defined as:

<img src="images/skewness.png">

In [20]:
x = [8.0, 1, 2.5, 4, 28.0]
n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
std_ = var_ ** 0.5
skew_ = (sum((item - mean_)**3 for item in x)
         * n / ((n - 1) * (n - 2) * std_**3))
skew_
# The skewness is positive, so x has a right-side tail.


# You can also calculate the sample skewness with scipy.stats.skew()
# The optional parameter nan_policy can take the values 'propagate', 'raise', or 'omit'. 
# It allows you to control how you‚Äôll handle nan values.
y, yWithNan = np.array(x), np.array(xWithNan)
scipy.stats.skew(y, bias=False)

scipy.stats.skew(yWithNan, bias=False)

# Pandas Series objects have the method .skew() that 
# also returns the skewness of a dataset:
# .skew() ignores nan values by default, 
# because of the default value of the optional parameter skipna.
z, zWithNan = pd.Series(x), pd.Series(xWithNan)
z.skew()

zWithNan.skew()

1.9470432273905924

## Percentiles

- A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found. Equivalently, 80% of the observations are found above the 20th percentile.

<br>

- Each dataset has three quartiles, which are the percentiles that divide the dataset into four parts: 

#### The first quartile :
It is the sample 25th percentile. It divides roughly 25% of the smallest items from the rest of the dataset.

#### The second quartile :
It is the sample 50th percentile or the median. Approximately 25% of the items lie between the first and second quartiles and another 25% between the second and third quartiles.

#### The third quartile :
It is the sample 75th percentile. It divides roughly 25% of the largest items from the rest of the dataset.

<img src="images/percentiles.png">

In [23]:
# If you want to divide your data into several intervals :
# use statistics.quantiles():

x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
statistics.quantiles(x, n=2)

statistics.quantiles(x, n=4, method='inclusive')

# statistics.quantiles() is introduced in Python 3.8.

# You can also use np.percentile() 
# to determine any sample percentile in your dataset. 
# percentile() takes several arguments. 
# You have to provide the dataset as the first argument and 
# the percentile value as the second. 

# find the 5th and 95th percentiles:
y = np.array(x)
np.percentile(y, 5)

np.percentile(y, 95)


np.percentile(y, [25, 50, 75])

np.median(y)

# If you want to ignore nan values, then use np.nanpercentile() instead
yWithNan = np.insert(y, 2, np.nan)
yWithNan

np.nanpercentile(yWithNan, [25, 50, 75])

# NumPy also offers you very similar functionality in quantile() and nanquantile().
np.quantile(y, 0.05)

np.quantile(y, 0.95)

np.quantile(y, [0.25, 0.5, 0.75])

np.nanquantile(yWithNan, [0.25, 0.5, 0.75])

# pd.Series objects have the method .quantile():
# .quantile() also needs you to provide the quantile value as the argument. 
# This value can be a number between 0 and 1 or a sequence of numbers
z, zWithNan = pd.Series(y), pd.Series(yWithNan)
z.quantile(0.05)

z.quantile(0.95)

z.quantile([0.25, 0.5, 0.75])


zWithNan.quantile([0.25, 0.5, 0.75])

0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64

## Ranges

- In statistics, the range of a set of data is the difference between the largest and smallest values. It can give you a rough idea of how the outcome of the data set will be before you look at it actually [1] Difference here is specific, the range of a set of data is the result of subtracting the smallest value from largest value.

<br>

- max() and min() from the Python standard library
- amax() and amin() from NumPy
- nanmax() and nanmin() from NumPy to ignore nan values
- .max() and .min() from NumPy
- .max() and .min() from Pandas to ignore nan values by default

In [24]:
# You can get the range of data is the difference between the maximum and minimum element  
# with the function np.ptp():

np.ptp(y)

np.ptp(z)

np.ptp(yWithNan)

np.ptp(zWithNan)

# some examples of how you would use these routines
np.amax(y) - np.amin(y)

np.nanmax(yWithNan) - np.nanmin(yWithNan)

y.max() - y.min()

z.max() - z.min()

zWithNan.max() - zWithNan.min()

# The interquartile range is the difference between the first and third quartile. 
# Once you calculate the quartiles, you can take their difference
quartiles = np.quantile(y, [0.25, 0.75])
quartiles[1] - quartiles[0]

quartiles = z.quantile([0.25, 0.75])
quartiles[0.75] - quartiles[0.25]

20.9

# Simple Summary

SciPy and Pandas offer useful routines to quickly get descriptive statistics with a single function or method call. You can use scipy.stats.describe() 

The optional parameter nan_policy can take the values 'propagate' (default), 'raise' (an error), or 'omit'. This parameter allows you to control what‚Äôs happening when there are nan values.

describe() returns an object that holds the following descriptive statistics:

- nobs: the number of observations or elements in your dataset
- minmax: the tuple with the minimum and maximum values of your dataset
- mean: the mean of your dataset
- variance: the variance of your dataset
- skewness: the skewness of your dataset
- kurtosis: the kurtosis of your dataset

Pandas has similar, if not better, functionality. Series objects have the method .describe()

It returns a new Series that holds the following:

- count: the number of elements in your dataset
- mean: the mean of your dataset
- std: the standard deviation of your dataset
- min and max: the minimum and maximum values of your dataset
- 25%, 50%, and 75%: the quartiles of your dataset

In [28]:
# SciPy .describe()
result = scipy.stats.describe(y, ddof=1, bias=False)
result

result.nobs

result.minmax[0]  # Min

result.minmax[1]  # Max

result.mean

result.variance

result.skewness

result.kurtosis

# Pandas .describe()

result = z.describe()
result

result['mean']

result['std']

result['min']

result['max']

result['25%']

result['50%']

result['75%']



21.0