# Averages and measures of central tendency
link: https://en.wikipedia.org/wiki/Central_tendency

The most common measures of central tendency are the **arithmetic mean, the median and the mode**. A central tendency can be calculated for either a finite set of values or for a theoretical distribution, such as the normal distribution. Occasionally authors use central tendency to denote "the tendency of quantitative data to cluster around some central value."

In [23]:
# Import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [24]:
# Create a random number list of integer values 
data_array = np.random.randint(7, 10, 20)
print(data_array)

[9 9 7 9 7 8 9 7 8 7 7 9 9 9 7 7 7 8 9 7]


## Mean: 

link: https://en.wikipedia.org/wiki/Arithmetic_mean

For a data set, the arithmetic mean, also called the mathematical expectation or average, is the **central value of a discrete set of numbers:** specifically, the sum of the values divided by the number of values. 

<img src="images/static/Comparison_mean_median_mode.png",width=350, height=350>

In [25]:
# Calculate the mean 
print(np.mean(data_array))
print(f'The MEAN for the provided ORIGINAL DATA is: {np.mean(data_array)}')

7.95
The MEAN for the provided ORIGINAL DATA is: 7.95


## Median: 
link: https://en.wikipedia.org/wiki/Median

The median is the **value separating the higher half from the lower half of a data sample** (a population or a probability distribution). For a data set, it may be thought of as the "middle" value. 

<img src="images/static/Finding_the_median.png",width=350, height=350>

In [26]:
# Calculate the mediam 
print(np.median(data_array))
print(f'The MEDIAN for the provided ORIGINAL DATA is: {np.median(data_array)}')

8.0
The MEDIAN for the provided ORIGINAL DATA is: 8.0


## Mode: 
link: https://en.wikipedia.org/wiki/Mode_(statistics)

**Most frequent value** in the dataset. 

<img src="images/static/Visualisation_mode_median_mean.png",width=200, height=200>

In [27]:
# Calculate the mode 
# Import additional library for mode 
from statistics import mode

In [28]:
print(mode(data_array))
print(f'The MODE for the provided ORIGINAL DATA is: {mode(data_array)}')

7
The MODE for the provided ORIGINAL DATA is: 7


## Population: 
**Total** observation of the data.
    
## Sample: 
**Subset** of population that describes population that we will be able to draw conclusions.

<img src="images/static/pop-sample.png",width=350, height=350>

In [29]:
# Defining data for population 
population = np.random.randint(10,20,100)

In [30]:
# Calculate and print mean, median and mode for the whole population
print(np.mean(population))
print(np.median(population))
# print(mode(population))

print(f'The MEAN for the provided SAMPLE DATA is: {np.mean(population)}')
print(f'The MEDIAN for the provided SAMPLE DATA is: {np.median(population)}')
# print(f'The MODE for the provided SAMPLE DATA is: {mode(population)}')

14.6
14.0
The MEAN for the provided SAMPLE DATA is: 14.6
The MEDIAN for the provided SAMPLE DATA is: 14.0


In [31]:
# Creating a subset of population data
sample = np.random.choice(population, 30)

In [32]:
# Calculate and print mean, median and mode for the subset of data
print(np.mean(sample))
print(np.median(sample))
# print(mode(sample))

print(f'The MEAN for the provided SAMPLE DATA is: {np.mean(sample)}')
print(f'The MEDIAN for the provided SAMPLE DATA is: {np.median(sample)}')
# print(f'The MODE for the provided SAMPLE DATA is: {mode(sample)}')

13.733333333333333
13.5
The MEAN for the provided SAMPLE DATA is: 13.733333333333333
The MEDIAN for the provided SAMPLE DATA is: 13.5


In [33]:
# Creating several subset of data (random samples)
sample_1 = np.random.choice(population, 20)
sample_2 = np.random.choice(population, 20)
sample_3 = np.random.choice(population, 20)
sample_4 = np.random.choice(population, 20)
sample_5 = np.random.choice(population, 20)
sample_6 = np.random.choice(population, 20)
sample_7 = np.random.choice(population, 20)
sample_8 = np.random.choice(population, 20)

# Getting all samples in a dictionary
all_samples = [sample, sample_1, sample_2, sample_3, sample_4, sample_5, sample_6, sample_7, sample_8]

# Creating an empty array to store mean values from the sample
sample_mean = []

In [34]:
# For loop to iterate and get all mean values 
for i in all_samples:
    sample_mean.append(np.mean(i))
    print(np.mean(i))

13.733333333333333
15.55
13.8
14.35
14.4
14.0
14.3
15.4
13.8


In [35]:
# Find the mean of all the samples together 
print(np.mean(sample_mean))
print(f'The MEAN for the provided SAMPLE DATA is: {np.mean(sample_mean)}')

14.370370370370372
The MEAN for the provided SAMPLE DATA is: 14.370370370370372


In [36]:
# Find the mean of the original data called population 
print(np.mean(population))
print(f'The MEAN for the provided ORIGINAL DATA is: {np.mean(population)}')


# Conclusions:
print("*************")
print("*************")
print("CONCLUSIONS:") 
print("We can conclude that the mean of the population and the samples are coming close.")
print("*************")
print("*************")

14.6
The MEAN for the provided ORIGINAL DATA is: 14.6
*************
*************
CONCLUSIONS:
We can conclude that the mean of the population and the samples are coming close.
*************
*************


## Measure of spreads: 
The variability of the data and **how the data is distributed**.
<br><br>Measure of spreads main categories:

1. **Range**
1. **Quartile**
1. **Variance**
1. **Standard deviation**

## Range: 
The **difference between the lowest and highest values** detected on data.
<img src="images/static/range.svg",width=350, height=350>

In [37]:
# Lets make an example of Range
n = np.random.randn(9)
print(n)

[-1.28879876 -0.25844767 -0.414183    1.39946516  1.15136839 -0.57942381
  0.40796286 -1.07471273  0.48853291]


In [38]:
# Calculating Range for provided data
print(np.max(n)-np.min(n))
print(f'The RANGE for the provided data is: {np.max(n)-np.min(n)}')

2.688263924257339
The RANGE for the provided data is: 2.688263924257339


## Quartiles and Percentiles: 
link: https://en.wikipedia.org/wiki/Quartile

The quartiles of a ranked set of data values are the **four subsets whose boundaries are the three quartile points.** Thus an individual item might be described as being "on the upper quartile".
    
A quartile is a type of quantile:<br> 
- The **first quartile (Q1)** is defined as the middle number between the smallest number and the median of the data set.
- The **second quartile (Q2)** is the median of the data.
- The **third quartile (Q3)** is the middle value between the median and the highest value of the data set.

<img src="images/static/Boxplot_vs_PDF.png",width=350, height=350>

In [39]:
# Q1
    # Q1
    # first quartile
    # lower quartile
    # 25th percentile
    # splits off the lowest 25% of data from the highest 75%
Q1 = np.percentile(n, 25)

# Q2
    # second quartile
    # median
    # 50th percentile
    # cuts data set in half
Q2 = np.percentile(n, 50)

# Q3
    # third quartile
    # upper quartile
    # 75th percentile
    # splits off the highest 25% of data from the lowest 75%
Q3 = np.percentile(n, 75)

In [40]:
# IQR
# In descriptive statistics, the interquartile range (IQR), 
# also called the midspread or middle 50%, or technically H-spread, 
# is a measure of statistical dispersion, being equal to the difference 
# between 75th and 25th percentiles, or between upper and lower quartiles:
# IQR = Q3 −  Q1. 

IQR = Q3 - Q1

In [41]:
print(IQR)

1.0679567204298175


## Variance: 
Informally, it measures **how far a set of numbers are spread out from their average value(mean)**.

<img src="images/static/Comparison_standard_deviations.png",width=350, height=350>

In [42]:
# Getting variance from population and sample from before
print(np.var(population))
print(np.var(sample))

# Both variances are different

7.800000000000001
5.262222222222222


## Standard deviation: 
link: https://en.wikipedia.org/wiki/Standard_deviation

In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values.

- A **low standard deviation** indicates that the data points tend to be close to the mean (also called the expected value) of the set.
- A **high standard deviation** indicates that the data points are spread out over a wider range of values.

<img src="images/static/Standard_deviation_diagram.png",width=350, height=350>

In [43]:
# Getting variance from population and sample from before
print(np.std(population))
print(np.std(sample))

2.7928480087537886
2.2939534045447005


In [44]:
# Conclusions:
print("*************")
print("*************")
print("CONCLUSIONS:") 
print("Both standard deviations are pretty close in population and sample.")
print("*************")
print("*************")

*************
*************
CONCLUSIONS:
Both standard deviations are pretty close in population and sample.
*************
*************
