<a href="https://colab.research.google.com/github/cagBRT/Statistics-with-Python/blob/main/Python_and_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Statistics-with-Python.git cloned-repo
%cd cloned-repo

In [None]:
from IPython.display import Image

There are many Python statistics libraries out there for you to work with, but in this tutorial, you’ll be learning about some of the most popular and widely used ones:<br>

**Python’s statistics is a built-in Python library for descriptive statistics. You can use it if your datasets are not too large or if you can’t rely on importing other libraries**.<br>

**NumPy** is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis.<br>

**SciPy** is a third-party library for scientific computing based on NumPy. It offers additional functionality compared to NumPy, including scipy.stats for statistical analysis.<br>

**pandas** is a third-party library for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.<br>

**Matplotlib** is a third-party library for data visualization. It works well in combination with NumPy, SciPy, and pandas.<br>

**Import the necessary libraries**
<br>
The math and statistics libraries are Python built-in statistics<br>

In [None]:
import math
import statistics

In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt
import scipy.stats

**Create some data**

In [None]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
x

x_with_nan

To create a NaN value: <br>

>float('nan')<br>
math.nan<br>
np.nan<br>

You can use all of these functions interchangeably:<br>

math.isnan(np.nan), np.isnan(math.nan)<br>

math.isnan(y_with_nan[3]), np.isnan(y_with_nan[3])<br>

You can see that the functions are all equivalent.

However, comparing two nan values for equality returns False. <br>

In other words, math.nan == math.nan is False!

In [None]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
new_line='\n'
print(f"y={y}{new_line}")
print(f"y_with_nan: {y_with_nan}{new_line}")
print(f"z: {z}{new_line}")
print(f"z_with_nan: {z_with_nan}")

## Measures of Central Tendency<br>
The measures of central tendency show the central or middle values of datasets.

There are several definitions of what’s considered to be the center of a dataset. In this notebook, you’ll learn how to identify and calculate these measures of central tendency:

>Mean<br>
Weighted mean<br>
Geometric mean<br>
Harmonic mean<br>
Median<br>
Mode

**Mean**

In the code below we use an underscore to distinguish our mean of x from the built-in function mean

In [None]:
print(f"{x}")
mean_ = sum(x) / len(x)
print(f"mean_ {mean_}")

#or
mean_x = sum(x) / len(x)
print(f"mean_x {mean_x}")

Using the built-in mean function<br>
fmean() is introduced in Python 3.8 as a faster alternative to mean(). It always returns a floating-point number.

In [None]:
mean_ = statistics.mean(x)
print(f"mean_ {mean_}")

fmean_ = statistics.fmean(x)
print(f"mean_{fmean_}")


If there are nan values among your data, then statistics.mean() and statistics.fmean() will return nan as the output

In [None]:
mean_x_with_nan= statistics.mean(x_with_nan)
print(f"mean_x_with_nan: {mean_x_with_nan}")

fmean_x_with_nan = statistics.fmean(x_with_nan)
print(f"fmean_x_with_nan: {fmean_x_with_nan}")


**Numpy mean**

In [None]:
#y is a numpy array
print(f"np.mean(y): {np.mean(y)}")

#or
print(f"y.mean(): {y.mean()}")

**Numpy mean with Nan values**

Numpy can handle Nan, the nanmean() ignores them

In [None]:
print(f"np.nanmean(y_with_nan): {np.nanmean(y_with_nan)}")

**Pandas mean with Nan values**

In [None]:
#z is a Pandas dataframe
print(z_with_nan.mean())

## Weighted Mean<br>
The weighted mean, also called the weighted arithmetic mean or weighted average, is a generalization of the arithmetic mean that enables you to define the relative contribution of each data point to the result.<br>

It’s convenient (and usually the case) that all weights are nonnegative, 𝑤ᵢ ≥ 0, and that their sum is equal to one, or Σᵢ𝑤ᵢ = 1.

The weighted mean is very handy when you need the mean of a dataset containing items that occur with given relative frequencies.<br>

For example, say that you have a dataset:<br>
 - 20% of all items are equal to 2,
 - 50% of the items are equal to 4,
 - the remaining 30% of the items are equal to 8.

In [None]:
#the weight mean of the bove dataset:
weighted_mean = 0.2 * 2 + 0.5 * 4 + 0.3 * 8
print(weighted_mean)

**Pure Python Weighted mean**

In [None]:
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]
wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
print(wmean)

wmean_zip = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
print(wmean_zip)


**Numpy Weighted Mean**<br>


In [None]:
y, z, w = np.array(x), pd.Series(x), np.array(w)
wmean_y = np.average(y, weights=w)
print(wmean_y)

wmean_z = np.average(z, weights=w)
print(wmean_z)

## Harmonic Mean<br>

The harmonic mean is the reciprocal of the mean of the reciprocals of all items in the dataset:<br>
> 𝑛 / Σᵢ(1/𝑥ᵢ), where 𝑖 = 1, 2, …, 𝑛 and 𝑛 is the number of items in the dataset 𝑥.

In [None]:
#x is a list
hmean = len(x) / sum(1 / item for item in x)
print(hmean)

**Harmonic_mean function<br>**
 If you have a nan value in a dataset, then it’ll return nan.

 If there’s at least one 0, then it’ll return 0.

 If there's at least one negative number, then you’ll get statistics.StatisticsError:

In [None]:
#using the harmonic_mean function
hmean = statistics.harmonic_mean(x)
print(hmean)

In [None]:
statistics.harmonic_mean(x_with_nan)

statistics.harmonic_mean([1, 0, 2])

In [None]:
statistics.harmonic_mean([1, 2, -2])  # Raises StatisticsError

**Scipy Harmonic Mean**<br>
f your dataset contains nan, 0, a negative number, or anything but positive numbers, then you’ll get a ValueError!


In [None]:
print(scipy.stats.hmean(y))
print(scipy.stats.hmean(z))


## Geometric Mean<br>
The geometric mean is the 𝑛-th root of the product of all 𝑛 elements 𝑥ᵢ in a dataset 𝑥: ⁿ√(Πᵢ𝑥ᵢ), where 𝑖 = 1, 2, …, 𝑛.

In [None]:
Image("means.png")

The value of the geometric mean, in this case, differs significantly from the values of the arithmetic (8.7) and harmonic (2.76) means for the same dataset x.

In [None]:
gmean = 1
for item in x:
    gmean *= item

gmean **= 1 / len(x)
print(gmean)

**The statistics.geometric_mean**<br>

This is the same result as in the previous example, but with a minimal rounding error.

In [None]:
gmean = statistics.geometric_mean(x)
print(gmean)

The functions: statistics.mean(), statistics.fmean(), statistics.harmonic_mean(), and geometric_mean behave the same with regards to zeros and negative numbers. If there’s a zero or negative number among your data, then statistics.geometric_mean() will raise the statistics.StatisticsError.



In [None]:
gmean = statistics.geometric_mean(x_with_nan)
print(gmean)

**Scipy Stats.gmean**

In [None]:
print(scipy.stats.gmean(y))

print(scipy.stats.gmean(z))

## Median<br>

The sample median is the middle element of a sorted dataset. <br>
The dataset can be sorted in increasing or decreasing order. <br>
>If the number of elements 𝑛 of the dataset is odd, then the median is the value at the middle position: 0.5(𝑛 + 1).

>If 𝑛 is even, then the median is the arithmetic mean of the two values in the middle, that is, the items at the positions 0.5𝑛 and 0.5𝑛 + 1.

The main difference between the behavior of the mean and median is related to dataset outliers or extremes.<br>

The mean is heavily affected by outliers, but the median only depends on outliers either slightly or not at all.
**bold text**


The upper dataset again has the items 1, 2.5, 4, 8, and 28. Its mean is 8.7, and the median is 5, as you saw earlier. <br>

The lower dataset shows what’s going on when you move the rightmost point with the value 28:

>**increase its value** (move it to the right), then the mean will rise, but the median value won’t ever change.


>**decrease its value** (move it to the left), then the mean will drop, but the median will remain the same until the value of the moving point is greater than or equal to 4.

You can compare the mean and median as one way to detect outliers and asymmetry in your data

In [None]:
Image("median&mean.png")

Two most important steps of this implementation are as follows:<br>

1. Sorting the elements of the dataset
2. Finding the middle element(s) in the sorted dataset

In [None]:
n = len(x)
if n % 2:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5 * n)
    median_ = 0.5 * (x_ord[index-1] + x_ord[index])

print(median_)

**statistics.median()**<br>


In [None]:
#The numbers sorted
median_ = statistics.median(x)
print(median_)

#The numbers sorted and the last value is removed
median_ = statistics.median(x[:-1])
print(median_)


**median_low() and median_high()** are two more functions related to the median in the Python statistics library.

They always return an element from the dataset:

- **If the number of elements is odd**, then there’s a single middle value, so these functions behave just like median().
- **If the number of elements is even**, then there are two middle values. In this case, median_low() returns the lower and median_high() the higher middle value.

These two functions don’t return nan when there are nan values among the data points.


In [None]:
print(statistics.median_low(x[:-1]))

print(statistics.median_high(x[:-1]))

**Numpy median**<br>

If there’s a nan value in your dataset, then np.median() issues the RuntimeWarning and returns nan.<br>

If this behavior is not what you want, then you can use nanmedian() to ignore all nan values

In [None]:
median_ = np.median(y)
print(median_)

median_ = np.median(y[:-1])
print(median_)

median_nan=np.nanmedian(y_with_nan)
print(median_nan)

median_nan=np.nanmedian(y_with_nan[:-1])
print(median_nan)

**Pandas Median**<br>

Pandas has a median built-in function

In [None]:
median_=z.median()
print(median_)

median_=z.median(skipna=True)
print(median_)

median_nan=z_with_nan.median()
print(median_nan)

## Mode<br>

The sample mode is the value in the dataset that occurs most frequently. If there isn’t a single such value, then the set is multimodal since it has multiple modal values.

Use u.count() to get the number of occurrences of each item in u. The item with the maximal number of occurrences is the mode. Note that you don’t have to use set(u). Instead, you might replace it with just u and iterate over the entire list.

set(u) returns a Python set with all unique items in u. You can use this trick to optimize working with larger data, especially when you expect to see a lot of duplicates

In [None]:
u = [2, 3, 2, 8, 12]
mode_ = max((u.count(item), item) for item in set(u))[1]
mode_

In [None]:
u = [2, 3, 2, 8, 3, 2, 3, 12]
mode_ = max((u.count(item), item) for item in set(u))[1]
mode_

**Using statistic.mode()**<br>

- Mode returns a single value <br>
- Multimode returns a list

In [None]:
mode_ = statistics.mode(u)
print(mode_)
mode_ = statistics.multimode(u)
print(mode_)

Notice mode() only returns one value, even for multimode

In [None]:
v = [12, 15, 12, 15, 21, 15, 12]
mode_=statistics.mode(v)
print(mode_)

mode_=statistics.multimode(v)
print(mode_)

**Statistics mode with Nan**<br>

If you have Nan values, statistics.mode() and multimode work

In [None]:
mode_nan=statistics.mode([2, math.nan, 2])
print(mode_nan)

mode_nan=statistics.mode([2, math.nan, 0, math.nan, 5])
print(mode_nan)

mode_nan=statistics.multimode([2, math.nan, 2])
print(mode_nan)

mode_nan=statistics.multimode([2, math.nan, 0, math.nan, 5, 2, 5])
print(mode_nan)

**Scipy.stats.mode()**<br>

In [None]:
u = [2, 3, 2, 8, 3, 2, 3, 12]
v = [12, 15, 12, 15, 21, 15, 12]

u, v = np.array(u), np.array(v)
mode_ = scipy.stats.mode(u)
print(mode_)

mode_ = scipy.stats.mode(v)
print(mode_)


**Numpy mode**<br>
The code uses .mode to return the smallest mode in the array v and .count to return the number of times it occurs. scipy.stats.mode() is also flexible with nan values.

In [None]:
u = [2, 3, 2, 8, 3, 2, 3, 12, 12, 12, 12, 12]
v = [12, 15, 12, 15, 21, 15, 12, 12, 12, math.nan]
u, v = np.array(u), np.array(v)

mode_ = scipy.stats.mode(u)
mode_np=mode_.mode
print(mode_np)

mode_ = scipy.stats.mode(v)
mode_np=mode_.count
print(mode_np)

Pandas mode

In [None]:
u = [2, 3, 2, 8, 3, 2, 3, 2, 2, 2, 12, 12]
v = [12, 15, 12, 15, 21, 15, 12, 12, 12, math.nan]
w=pd.Series([2, 2, math.nan])
u, v = pd.Series(u), pd.Series(v)


mode_u=u.mode()
print(mode_u)

mode_v=v.mode()
print(mode_v)

mode_w=w.mode()
print(mode_w)

## Variability<br>

The measures of central tendency aren’t sufficient to describe data. <br>

You’ll also need the measures of variability that quantify the spread of data points. <br>

In this section, you’ll learn how to identify and calculate the following variability measures:<br>

- Variance
- Standard deviation
- Skewness
- Percentiles
- Ranges

**Variance**<br>
The sample variance quantifies the spread of the data. It shows numerically how far the data points are from the mean. <BR>

Note the two datasets shown below have the same mean and median, even though they appear to differ significantly. Neither the mean nor the median can describe this difference.

In [None]:
Image("VARIABLE.png")

**Python Variance**

In [None]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]

n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
var_

In [None]:
var_ = statistics.variance(x)
var_

statistics.variance() can avoid calculating the mean if you provide the mean explicitly as the second argument: statistics.variance(x, mean_).

An error will occur if you have Nan values in your data.

In [None]:
statistics.variance(x_with_nan)