# Measures Of Central Tendency

This Jupyter Notebook covers the concept of measure of central tendency - Mean, Median and Mode. These concepts pave way to more advanced statistics.

# Mean

It is the sum of n observations divided by total number of observations. It is used to find the central value within a set of numbers. One of the practical use cases of finding mean is to fill in the missing values in the dataset. It should be kept in mind that the observations must be of the same range. Outliers can skew the value of mean and an incorrect value might be added in the missing observations.

In [1]:
# Declare and store some set of numbers in a list
N = [10, 20, 30, 40, 50]

# Find the sum of these numbers
Sum_N = sum(N)

# Divide the sum by total number of observations
Mean_N = Sum_N / len(N)
print("Mean of N is:" + str(Mean_N))

Mean of N is:30.0


## Weighted Mean

Suppose we have two columns, say employee salary and employee number, mean is not reliable since they do not represent different groups that we are interested in measuring. To solve this issue, we use something known as 'Weighted Mean'. We multiply something known as 'weights' (w) to it's corresponding value and calculate the mean. 

In [2]:
# Importing Pandas library and creating Employee dataframe

import pandas as pd

Data = {'Salary':[10000, 20000, 30000, 40000, 50000],'Employee_Number':[20, 30, 50, 10, 40]}
Employee_Data = pd.DataFrame(Data,columns = ['Salary','Employee_Number'])
Employee_Data

Unnamed: 0,Salary,Employee_Number
0,10000,20
1,20000,30
2,30000,50
3,40000,10
4,50000,40


In [3]:
# Calculating weighted mean using Numpy's average function.
# Weighted_Mean = (10000*20 + 20000*30 + 30000*50 + 40000*10 + 50000*40)/5

import numpy as np

Weighted_Mean = round(np.average(Employee_Data['Salary'], weights = Employee_Data['Employee_Number']),2)

print("Weighted Mean is:" + str(Weighted_Mean))

Weighted Mean is:31333.33


## Trimmed Mean 
Trimmed Mean is another variety of mean where the numbers are sorted, if not sorted, and then fixed number of values are removed at either end and then mean is calculated. This removes the effect of outliers, if any.

In [4]:
# As mentioned in the explanation, first we sort the list and then we trim the values at either end of the list 
# (in this case 1 and 1000). Had either 1 or 1000 had been in the list while calculating the mean, the calculated result
# would have been drastically different.

N = [40, 50, 60, 1, 1000, 30, 20, 10]

Sorted_N = sorted(N)[1:-1]

Trimmed_Mean = sum(Sorted_N) / len(Sorted_N)

print("Trimmed N is:" + str(Trimmed_Mean))

Trimmed N is:35.0


# Median 
It is the midpoint value in a list of numbers for which an equal number of samples are less than and greater than the value. Unlike Mean, Median is not affected by outliers. One should always keep in mind while calculating median is that the numbers should always be sorted before finding median.

In [5]:
# First step to calculate median is to sort the values. If there are odd number of values in the list the finding the median
# is pretty straight forward. If there are even number of values in the list then two central values are identified, and then
# the mean of those two values are calculated.

N = [40, 50, 60, 1, 1000, 30, 20, 10]

Sorted_N = sorted(N)

if len(Sorted_N) % 2 == 0:
   first_median = Sorted_N[len(Sorted_N) // 2]
   second_median = Sorted_N[len(Sorted_N) // 2 - 1]
   median = (first_median + second_median) / 2
else:
   median = Sorted_N[len(Sorted_N) // 2]

print("List after sorting:" + str(Sorted_N))
print("Median is: " + str(median))

List after sorting:[1, 10, 20, 30, 40, 50, 60, 1000]
Median is: 35.0


# Mode 
It is the number of elements occuring most frequently in a list. This can be used to calculate any duplicate values in your dataset.

In [6]:
# Most occuring item in the list is 10. Thus, 10 will be printed.

import statistics

N = [10, 20, 30, 40, 10, 50, 60, 10]

Mode = statistics.mode(N)

print("Mode is:" + str(Mode))

Mode is:10


In [7]:
# Multiple mode can be calculated where multiple items occur equal number of times. This is known as MultiMode.

N = [10, 20, 30, 40, 10, 50, 60, 10, 20, 70, 20]

MultiMode = statistics.multimode(N)

print("Multiple Mode is:" + str(MultiMode))

Multiple Mode is:[10, 20]
