In [1]:
import numpy as np
from scipy.stats import mode

# Central Tendency

#### Measures of central tendency are measures of location within a distribution. They summarize, in a single value, the one score that best describes the centrality of the data. Of course, there are lots of scores in any data set. Nevertheless, one score is most representative of the entire set of scores

## Mode

Because of its simplicity, the mode is an adequate measure of central tendency to report if you need a summary statistic in a hurry. For most purposes, however, the mode is not the best measure of central tendency to report. It is simply too subject to the vagaries of the cases that happen to fall in a particular sample. Also, for very small samples, the mode may have a frequency only one or two higher than the other scores—not very informative. Finally, no additional statistics are based on the mode. For these reasons, it is not as useful as the median or the mean.

In [4]:
successful_hunts = [1,2,3,4,8,2,3,2,2,2,2]
mode(successful_hunts) #Using mode function of scipy.stats module -> Cool module check it out

ModeResult(mode=array([2]), count=array([6]))

__Mode Simply is the most frequent element. So clearly with six counts 2 is most frequent but you can think this doesn't go well with central tendencies__

## Median

The median, symbolized Mdn, is the middle score. It cuts the distribution in half, so that there are the same number of scores above the median as there are below the median.Because it is the middle score, the median is the 50th percentile. Here’s an example. Seven basketball players shoot 30 free throws during a practice session. The numbers of baskets they make are listed below. What is the median number of baskets made?

__22, 23, 11, 18, 22, 20, 15__


To find the median, use the following steps:
1. Put the scores in ascending or descending order. If you do not first do this, the median will merely reflect the arrangement of the numbers rather than the actual number of baskets made. Here are the scores in ascending order.

__11, 15, 18, 20, 22, 22, 23__

2. Count in from the lowest and highest scores until you find the middle score.

What is the median number of baskets? __The median number of baskets is 20__ because there are three scores above 20 and three scores below 20.

Here’s another example. Twelve members of a gym class, some in good physical condition and some in not-so-good physical condition, see how many sit-ups they can complete
in a minute. Here are their scores.

__2, 3, 6, 10, 12, 12, 14, 15, 15, 15, 24, 25__

What is the median number of sit-ups? Is it 12? 14? The median is 13, because there are six scores below 13 and six scores above 13. Note that the median does not necessarily have to be an existing score. In this case, no one completed exactly 13 sit-ups. Here is the rule: __With an odd number of scores, the median will be an actual score. But with an even number of scores, the median will not be an actual score. Instead, it will be the score midway between the two centermost scores. To get the midpoint, simply average the two centermost scores. In our example, this is (12 + 14)/2, which is 26/2, which is 13.__

In [7]:
race_turtles = np.arange(1,100,1)
np.median(race_turtles) #49 Up and 49 Down 

50.0

In [9]:
race_turtles = np.arange(1,101,1)
np.median(race_turtles) #50.5 up and 50.5 Down (49+51)/2

50.5

## Mean

In mathematics and statistics, the arithmetic mean or simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the count of numbers in the collection.

__While the arithmetic mean is often used to report central tendencies, it is not a robust statistic, meaning that it is greatly influenced by outliers (values that are very much larger or smaller than most of the values). Notably, for skewed distributions, such as the distribution of income for which a few people's incomes are substantially greater than most people's, the arithmetic mean may not coincide with one's notion of "middle", and robust statistics, such as the median, may be a better description of central tendency.__


### Now the question is : What is an outlier?

__A score that is way out of line with the rest of the data is called an outlier.__ _Sometimes outliers are legitimate—one person in the sample is simply much faster, smarter, or better along whatever scale is being measured. Other times an outlier represents a clerical error—the person was measured incorrectly or the score was entered into the data set incorrectly. Because outliers markedly affect the mean, researchers need to be especially alert for them so that they can determine whether the score legitimately belongs in the data set. Simply knowing the value of the mean does not, in itself, tell us that there is an outlier. Only visual inspection of the data tells us that. This is another reason why competent researchers always look at the data before calculating any statistic._

In [12]:
race_time_finish = np.array([x for x in range(1000,1050)])

In [13]:
race_time_finish

array([1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010,
       1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021,
       1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032,
       1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043,
       1044, 1045, 1046, 1047, 1048, 1049])

In [14]:
np.mean(race_time_finish)

1024.5

In [16]:
my_class_marks = np.array([1,2,3,3,1,2,3,4,5,2,3,4,6])
np.mean(my_class_marks)

3.0

In [19]:
my_class_marks = np.array([1,2,3,3,1,2,3,4,5,2,3,4,6,300])
np.mean(my_class_marks) #See how outliers can drastically effect the mean

24.214285714285715

### __External Resources__
 * [Measures of Central Tendency Visually Explained](https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php)
 * [Skewness : When Mean and Median are not the same thing](https://web.ma.utexas.edu/users/mks/statmistakes/skeweddistributions.html)

## Variance

__Variance ($σ^2$) is a measurement of the spread between numbers in a data set. It measures how far each number in the set is from the mean and is calculated by taking the differences between each number in the set and the mean, squaring the differences (to make them positive) and dividing the sum of the squares by the number of values in the set.__

\begin{align}
\sigma^2  = \frac{\sum_{i=1}^n(x_i-\mu)^2}{n}\\
\end{align}

__where__<br>
$\mu$ = mean of the observations<br>
n = no of observations<br>
$x_i$ = $i^{th} observation$

### Need of Variance?

_A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another. Variance is the average of the squared distances from each point to the mean._


### Advantages and Disadvantages

__The advantage of variance is that it treats all deviations from the mean the same regardless of direction; as a result, the squared deviations cannot sum to zero and give the appearance of no variability at all in the data. The drawback of variance is that it is not easily interpreted, and the square root of its value is usually taken to get the standard deviation of the data set in question.__

In [2]:
x = np.array([2,3,4,5,7,8,4,12,5,7,16,19])

In [4]:
var_x = x.var()
var_x

26.055555555555554

## Standard Deviation

Standard deviation shows how much variation (dispersion, spread, scatter) from the mean exists. It represents a "typical" deviation from the mean. It is a popular measure of variability because __it returns to the original units of measure of the data set.__

A low standard deviation indicates that the data points tend to be very close to the mean. A high standard deviation indicates that the data points are spread out over a large range of values.
The standard deviation can be thought of as a "standard" way of knowing what is normal (typical), what is very large, and what is very small in the data set.

\begin{align}
\sigma  = \sqrt\frac{\sum_{i=1}^n(x_i-\mu)^2}{n}\\
\end{align}

Standard deviation is a popular measure of variability because it returns to the original units of measure of the data set. __For example, original data containing lengths measured in feet has a standard deviation also measured in feet while variance will be in $feet^2$__

In [6]:
std_x = x.std()
std_x

5.104464277037851

### Additional Resources:
[Khan Academy - Tendency](https://www.youtube.com/watch?v=E4HAYd0QnRc)<br>
[Khan Academy - Variablity](https://www.youtube.com/watch?v=Cx2tGUze60s)