## Measures of Central Tendency

In this notebook will discuss ways to summarize a set of data using a single number. The goal is to capture information about distribution of data.

## Arithmetic mean

The `arithmetic mean` is used very frequently to summarize numerical data, and is usually the one assumed to be meant by the word 'average.' It is defined as the sum of observations divided by the number of observations.

In [3]:
# two useful statistical libraries:
import scipy.stats as stats
import numpy as np

# will use these 2 data sets as exampels:
x1 = [1, 2, 2, 3, 4, 5, 5, 7]
x2 = x1 +[100]

print('Mean of x1: ', sum(x1), '/', len(x1), '=', np.mean(x1))
print('Mean of x2: ', sum(x2), '/', len(x2), '=', np.mean(x2))

Mean of x1:  29 / 8 = 3.625
Mean of x2:  129 / 9 = 14.333333333333334


Can also define a `weighted arithmetic mean`, which is usseful for explicitly specifying the number of times each observation should be counted. For example, in computing the average value of a portfolio, it is important to say that 70% of stocks are of ticker XYZ, rather than listing out all shares.

## Median

The median of a set of data is the number which appears in the middle of the list when it is sorted in increasing or decreasing order. When have an odd # of data points, the median is the value in position (n + 1)/2. When have an even # of data points, the median is the avg of the values in positions (n/2).


The median is less impacted by extreme values in the data, than the arithmetic mean. It tells the value that splits the data in half, but not how much larger or smaller the values are.

In [4]:
print('Median of x1: ', np.median(x1))
print('Median of x2: ', np.median(x2))

Median of x1:  3.5
Median of x2:  4.0


## Mode

The mode is the most frequently occuring value in a data set. It can be appleid to non-numerical data, unlike the mean or median. One situtiaton in which it is useful is for data whose possible values are independent, ex roling a die.

In [6]:
print('One mode of x1:', stats.mode(x1)[0][0])

One mode of x1: 2


In [7]:
# write own function
def mode(l):
    
    # count the numer of times each element appears in the list
    counts = {}
    for e in l:
        if e in counts:
            counts[e] +=1
        else:
            counts[e] = 1
    
    # return the elements that appear the msot times
    maxcount = 0
    modes = {}
    for (key, value) in counts.items():
        if value > maxcount:
            maxcount = value
            modes = {key}
        elif value == maxcount:
            modes.add(key)
            
    if maxcount > 1 or len(l) == 1:
        return list(modes)
    return 'No Mode'

print('All of the modes of x1: ', mode(x1))
            

All of the modes of x1:  [2, 5]


For data that can take on many different values, such as returns data, there may not be a mode. In this case, can bin values, similar to constructing a histogram, and then find the mode of the data set where each value is replaced with the name of the bin. So essentially finding which bin elements fall in msot often

In [8]:
start = '2020-01-01'
end = '2022-07-30'
ADBE_df = pd.read_csv('/Users/brendan/Desktop/Python/BoostedAI/prices/ADBE.csv')
returns = ADBE_df['adjClose'].pct_change()[1:]
print('Mode of returns: ', mode(returns))

Mode of returns:  No Mode


In [9]:
# since all of the returns are distinct, will use a frequency distribution
# np.histogram returns the frequency distribution over the bins and the endpoint

hist, bins = np.histogram(returns, 20) # use 20 bins
maxfreq = max(hist)

# find all the bins that are hit with freuency maxfreq
print('Mode of bins: ', [(bins[i], bins[i+1]) for i, j in enumerate(hist) if j == maxfreq])

Mode of bins:  [(-0.00495763656569323, 0.011216138209007853)]


## Geometric mean

While the arithmetic mean average uses addition, the geometic mean uses multiplication. The geometric mean is always less than or equal to the arithmetic mean, and are equal only when all values are the same

In [10]:
print('Geometric mean of x1:', stats.gmean(x1))
print('Geometric mean of x2:', stats.gmean(x2))

Geometric mean of x1: 3.0941040249774403
Geometric mean of x2: 4.552534587620071


When computing the geometric mean when have negative obseervations, can add 1 to get 1 + R(t). The quantity will always be nonnegative.

In [11]:
ratios = returns + np.ones(len(returns))
R_G = stats.gmean(ratios) - 1
print("Geometric mean of returns: ", R_G)

Geometric mean of returns:  -0.0008170849531605739


In [14]:
T = len(returns)
initial_price = ADBE_df['adjClose'][0]
final_price = ADBE_df['adjClose'][T]

print('Initial price: ', initial_price)
print("Final price: ", final_price)
print("Final price with Geo mean of returns: ", initial_price*(1+R_G)**T)


Initial price:  411.09
Final price:  147.13
Final price with Geo mean of returns:  147.12999999999113


## Point Estimates can be deceiving

Means by nature hide a lot of information, as they collapse entire distributions into one #. As a result 'point estimates' or metrics that use one number, can disguise alrge programs in the data. Should be careful to ensire that not losing key information by summarizing the data, adn should rarly use a mean without quantifying the spread.