## This is an example of how to calculate measures for continuous quantitative data, including position, dispersion and shape measures, using Python, Pandas, NumPy and Matplotlib

* the formulas shown at this notebook have been taken from the following reference:<br>
FÁVERO, L. P.; BELFIORE, P. **Manual de Análise de Dados: Estatística e Machine Learning com Excel®, SPSS®, Stata®, R® e Python®**. 2ª edição, 1288 p. Brasil: ccGEN LTC, 2024.<br>
Available in Brazil at:<br>
https://www.amazon.com.br/Manual-An-C3-A1lise-Dados-Estat-C3-ADstica-Learning-dp-8595159920/dp/8595159920

In [371]:
# importing libs and setting default plot style
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("dark_background")
import pandas as pd
import numpy as np

In [372]:
# importing data into a Pandas Dataframe
item_price_dataframe = pd.read_csv("item-price.csv")
# displaying head() and tail() altogether
from pandas import option_context
with option_context('display.max_rows', 10):
    print(item_price_dataframe)

    ITEM_ID  PRICE (US$)
0         1        189.0
1         2        195.0
2         3        199.0
3         4        189.0
4         5        197.0
..      ...          ...
95       96        189.0
96       97        179.0
97       98        189.0
98       99        199.0
99      100        195.0

[100 rows x 2 columns]


### Position measures

#### - Arithmetic Mean

![mean](mean.png)

In [373]:
# copying the data from the "PRICE (US$)" column of the item_price_dataframe to an ndarray of 1 dimension (1D) and transforming the data 
# from float to int type (to facilitate calculation)
item_price_ndarray = item_price_dataframe["PRICE (US$)"].to_numpy('int64')
print(f"item_price_ndarray =\n{item_price_ndarray}")
print(f"dtype = '{item_price_ndarray.dtype}'")
print(f"shape = {item_price_ndarray.shape} => d0 = {item_price_ndarray.shape[0]}")
print(f"dimensions = {item_price_ndarray.ndim}")
print(f"size = {item_price_ndarray.size}")

item_price_ndarray =
[189 195 199 189 197 189 199 202 199 209 189 179 175 199 205 219 229 205
 190 179 199 189 183 199 206 215 149 189 169 179 159 199 195 189 209 196
 189 165 170 179 170 175 169 189 195 199 199 199 189 182 199 209 229 199
 195 199 179 169 189 205 199 189 189 199 179 189 239 215 199 179 195 199
 209 205 179 185 179 169 179 189 199 209 169 159 179 185 189 179 199 199
 189 169 159 169 209 189 179 189 199 195]
dtype = 'int64'
shape = (100,) => d0 = 100
dimensions = 1
size = 100


In [374]:
# as it is a 1-dimension ndarray, specifying the axis for operations is optional - the axis of 1D ndarrays is always axis=0

In [375]:
# calculating the mean value
arithmetic_mean = np.sum(a=item_price_ndarray, axis=0)/np.size(a=item_price_ndarray)
arithmetic_mean

190.77

In [376]:
# or:

In [377]:
# calculating the mean value
arithmetic_mean = np.mean(a=item_price_ndarray, axis=0)
arithmetic_mean

190.77

#### - Median

![median](median.png)

In [378]:
# sorting the item_price_ndarray for calculating the median
sorted_item_price_ndarray = np.sort(a=item_price_ndarray)
sorted_item_price_ndarray

array([149, 159, 159, 159, 165, 169, 169, 169, 169, 169, 169, 169, 170,
       170, 175, 175, 179, 179, 179, 179, 179, 179, 179, 179, 179, 179,
       179, 179, 179, 182, 183, 185, 185, 189, 189, 189, 189, 189, 189,
       189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 189,
       190, 195, 195, 195, 195, 195, 195, 196, 197, 199, 199, 199, 199,
       199, 199, 199, 199, 199, 199, 199, 199, 199, 199, 199, 199, 199,
       199, 199, 199, 199, 202, 205, 205, 205, 205, 206, 209, 209, 209,
       209, 209, 209, 215, 215, 219, 229, 229, 239])

In [379]:
# resetting variable for avoiding cache issues
median = np.nan
# calculating the median
# when number of elements is odd
if(sorted_item_price_ndarray.size % 2 != 0):
    position_index_of_median_value = int((sorted_item_price_ndarray.size+1)/2)
    median = sorted_item_price_ndarray[position_index_of_median_value]
    # print(f"pos = ({sorted_item_price_ndarray.size}+1)/2={position_index_of_median_value}")
# when number of elements is even
else:
    position_index_of_median_value_1 = int(sorted_item_price_ndarray.size/2)
    position_index_of_median_value_2 = int(sorted_item_price_ndarray.size/2+1)
    median = (sorted_item_price_ndarray[position_index_of_median_value_1]+sorted_item_price_ndarray[position_index_of_median_value_2])/2
    # print(f"({sorted_item_price_ndarray[position_index_of_median_value_1]}+{sorted_item_price_ndarray[position_index_of_median_value_2]})/2=")
median

189.0

In [380]:
# or:

In [381]:
# if using the median method of NumPy, no need to sort the ndarray (it is sorted automatically, inside the method)
# calculating the median
np.median(a=item_price_ndarray)

189.0

#### - Mode

In [382]:
# calculating the single modal number
def get_single_mode(number_list):
    # sorting the received number list (in case it is not yet sorted)
    sorted_number_list = sorted(number_list)
    # creating a dict structure for temporarily storing each unique number along with its ocurrences at the reveived number_list
    number_count_dictionary={}
    for number in sorted_number_list:
        # if the unique number is not yet present at the number_count_dictionary, add it and set occurences count to 1
        if not number in number_count_dictionary:
            number_count_dictionary[number]=1
        # if the unique number is already present, increment in 1 the existing occurences count
        else:
            number_count_dictionary[number]+=1
    # print(f"number_count_dictionary =\n{number_count_dictionary}\n")
    # print(f"{number_count_dictionary.items()}\n")
    # print(f"{number_count_dictionary.keys()}\n")
    # print(f"{number_count_dictionary.values()}\n")
    # print(f"{len(number_count_dictionary.items())} unique numbers\n")
    # # inversely sorting number_count_dictionary according to items values (occurences count), so that the first item will contain the
    # # number which occurs the most, along with its occurences count
    # number_count_tuple_list_by_count_reversed = sorted(number_count_dictionary.items(), key=lambda x:x[1])[::-1]
    # print(f"number_count_tuple_list_by_count_reversed =\n{number_count_tuple_list_by_count_reversed}\n")
    # # getting the modal tuple: (modal number,number occurences) - the first tuple a the ordered dict above
    # single_mode_tuple = number_count_tuple_list_by_count_reversed[0]
    # print(f"single_mode_tuple = {single_mode_tuple}\n")
    # # getting the single modal number and its occurences count from the single_mode_tuple
    # mode = single_mode_tuple[0]
    # count = single_mode_tuple[1]
    # print(f"single_mode is {mode} with {count} occurrences")
    # using a list comprehension to return a list with the modal number, from the modal tuple, from the dict items, when the tuple has 
    # the max number of occurences count (therefore also the respective modal number), and returning the modal number from the list
    return [number for number,occurences in number_count_dictionary.items() if occurences==max(number_count_dictionary.values())][0]

get_single_mode(item_price_ndarray)

199

In [383]:
# or :

In [384]:
# calculating the single modal number
def get_single_mode_2(number_list):
    # print(f"original number_list\t\t= {number_list}\n")
    # using numpy unique() method to return a tuple with two arrays, the first being the sorted_unique_numbers and the other
    # being the respective counts of occurrences for each of these numbers. The indexes are the same for each number at one
    # array and its respective occurence count pair at the other. If we get the index for the max occurence count at the 2nd array, 
    # we also get the index for the modal number at the 1st array - the index is the same
    sorted_unique_numbers_list, occurences_of_each_number_list = np.unique(number_list, return_counts=True)
    # print(f"sorted_unique_numbers_list\t= {sorted_unique_numbers_list}\n")
    # print(f"occurences_of_each_number_list\t= {occurences_of_each_number_list}\n")
    # getting the index for the max occurence count value at occurences_of_each_number_list
    max_occurence_index = np.argmax(occurences_of_each_number_list)
    # print(f"max_occurence_index = {max_occurence_index}\n")
    # with that index, it's then possible to get the values for both the modal number at one array as for its occurences count at
    # the other array - the index is the same
    mode = sorted_unique_numbers_list[max_occurence_index]
    mode_occurrences = occurences_of_each_number_list[max_occurence_index]
    # print(f"single_mode is {mode} with {mode_occurrences} occurrences\n")
    return mode

get_single_mode_2(item_price_ndarray)

199

In [385]:
# or:

In [386]:
# calculating the single modal number
# using module stats from module scipy to get the modal tuple and extract both the modal number and its occurences count from this tuple
from scipy import stats
mode_tuple = stats.mode(item_price_ndarray)
# print(f"single_mode is {mode_tuple[0]} with {mode_tuple[1]} occurrences\n")
mode_tuple[0]

199

In [387]:
# or:

In [388]:
# using module statistics to get the modal number and then count its occurences at the item_price_ndarray
import statistics
mode = statistics.mode(item_price_ndarray)
count = item_price_ndarray.tolist().count(mode)
# print(f"single_mode is {mode} with {count} occurrences\n")
mode

199

In [389]:
# or:

In [390]:
# using class Counter from module collections to get a dictionary having as keys the modal number and as paired values the count of 
# occurences of the modal number occurences at the item_price_ndarray
from collections import Counter
counting_dictionary = Counter(item_price_ndarray)
# getting the max occurences count from the list of values from the counting_dictionary
count = np.max(list(counting_dictionary.values()))
# getting the corresponding key (modal number) for that max value (occurences count) - when both keys and values of the 
# counting_dictionary are converted to lists, the index for respective keys and values is the same
# In this case, the common index is gotten from the max occurences count value and then used to get the modal number at the other list -
# the index is the same
mode = list(counting_dictionary.keys())[list(counting_dictionary.values()).index(count)]
# print(f"single_mode is {mode} with {count} occurrences\n")
mode

199

#### - Quartiles

In [391]:
# use percentiles formula.

#### - Deciles

In [392]:
# use percentiles formula.

#### - Percentiles

![percentile](percentile.png)

In [393]:
# creating a percentile function for getting any kind of percentile, including median, quartile, decile, etc.
# k_score is the k-th score or centile, e.g., for median it's 50 (from 50th percentile), for 1st Quartile it's 25 (from 25th percentile),
# for 3rd decile it's 30 (from 30th percentile), etc.
import math
def get_percentile(k_score, number_list):
    # sorting the input number_list
    sorted_number_list = sorted(number_list)
    # calculating n
    n = len(sorted_number_list)
    # calculating the position index, which the percentile result corresponds to, at the sorted_number_list (as arrays start at 0, gotta
    # subtract 1 at the end to get the right position at ndarrays (when compared to other book formulas, whose arrays start at 1)
    pos = ((n-1)*k_score/100)+1-1
    # if the pos index is not an integer, but a float, the percentile result is a value between two numbers at the list. At this particular 
    # case, a ponderation is done, considering the integer part of the pos index as the index on the left and the subsequent index as 
    # the index on the right, each of these with its respective values (numbers) at the sorted_number_list. The number value corresponding 
    # to the index on the left is multiplied by the complement of the decimal part and added to the product of the number value 
    # corresponding to the index on the right multiplied by the decimal part. That way, the percentile number value is calculated between
    # two corresponding number values, with the right ponderation regarding the final number value.
    if(not isinstance(pos, int)):
        decimal_part, integer_part = math.modf(pos)
        integer_part = int(integer_part)
        return (sorted_number_list[integer_part]*(1-decimal_part)+sorted_number_list[integer_part+1]*(decimal_part))
    # whereas, if the pos index is an integer, then it is the only index needed to fetch the percentile number value directly at the 
    # sorted_number_list. No need for any ponderation or aditional calculations
    else:
        return sorted_number_list[pos]

# getting the 1st quartile, which is the same as 25th percentile
# get_percentile(25,item_price_ndarray)
# getting the 2nd quartile, which is the same as 50th percentile or median
# get_percentile(50,item_price_ndarray)
# getting the 3rd quartile, which is the same as 75th percentile
get_percentile(75,item_price_ndarray)

199.0

In [394]:
# or:

In [395]:
# getting the 1st quartile, which is the same as 25th percentile
# np.percentile(a=item_price_ndarray, q=25)
# getting the 2nd quartile, which is the same as 50th percentile or median
# np.percentile(a=item_price_ndarray, q=50)
# getting the 3rd quartile, which is the same as 75th percentile
np.percentile(a=item_price_ndarray, q=75)

199.0

### Dispersion measures

#### - Range

In [396]:
def get_range(number_list):
    return np.max(number_list) - np.min(number_list)

get_range(item_price_ndarray)

90

#### - Variance

![sample-variance](sample-variance.png)

In [397]:
# In short, variance is a dispersion measure that reflects how far each random value of a distribution is from the arithmetic mean of that
# distribution. If the variance is a population variance, it's calculated by dividing the sum of squares of the differences between each 
# value of the population analysed and the arithmetic mean by N, being N the number of TOTAL individuals at that analysed population. 
# If, on the other hand, the case is of a sample variance, that is, an estimate of the plation variance - when you don't have all data 
# available from the considered population, but only part of it, this sample variance is calculated the same way but, instead of N at the
# denominator, it's used (n-1), where n is the sample number, instead of the population number. The equation above is of a sample variation.

In [398]:
# calculating both population and sample variances for the data in item_price_ndarray
my_mean = np.mean(item_price_ndarray)
my_n = len(item_price_ndarray)

def calculate_variance(number_list, sample=False):
    accumulator = 0
    for number in number_list:
        accumulator += np.square(number-my_mean)
    if(sample):
        return accumulator/(my_n-1)
    else:
        return accumulator/(my_n)

print(f"sample_variance = {round(calculate_variance(item_price_ndarray, True),2)} --- population_variance = {round(calculate_variance(item_price_ndarray, False),2)}")

round(calculate_variance(item_price_ndarray, sample=True),2)

sample_variance = 244.02 --- population_variance = 241.58


244.02

In [399]:
# or:

In [400]:
# calculating both population and sample variances for the data in item_price_ndarray
# the np method is the same var(), except for the setting of the parameter "ddof" (from Delta Degrees of Freedom) as 1, which is the factor 
# to be subtracted from the denominator N of the equation (turning into "n-1"), in case you want to calculate the estimation of variance 
# using a sample and not the whole considered population. By default ddof=0, and the denominator is only the N, assuming the array has all
# elements of distribution at the whole analysed population - population variance.
# Below both variances are calculated and rounded to 2 decimals.
population_variance = np.round(np.var(a=item_price_ndarray), 2)
sample_variance = np.round(np.var(a=item_price_ndarray, ddof=1), 2)
print(f"sample_variance = {sample_variance} --- population_variance = {population_variance}")
sample_variance

sample_variance = 244.02 --- population_variance = 241.58


244.02

#### - Standard Deviation

![standard deviation](standard_deviation.png)

In [401]:
# standard deviation is another dispersion measure and is simply the square root of the variance of a certain data distribution, with a 
# better interpretability, as, differently from variance, the standard deviation has the same units as the original data at the distribution
# therefore, we can estimate the difference between random values at a certain distribution and the mean of that distribution, using the 
# same units

In [402]:
# calculating both population and sample standard_deviation (sd) for the data in item_price_ndarray
population_standard_deviation = np.round(np.sqrt(calculate_variance(item_price_ndarray)), 2)
sample_standard_deviation = np.round(np.sqrt(calculate_variance(item_price_ndarray, True)), 2)
print(f"sample_standard_deviation = {sample_standard_deviation} --- population_standard_deviation = {population_standard_deviation}")
sample_standard_deviation

sample_standard_deviation = 15.62 --- population_standard_deviation = 15.54


15.62

In [403]:
# or:

In [404]:
# calculating both population and sample standard_deviation (sd) for the data in item_price_ndarray
population_standard_deviation = np.round(np.std(a=item_price_ndarray), 2)
sample_standard_deviation = np.round(np.std(a=item_price_ndarray, ddof=1), 2)
print(f"sample_standard_deviation = {sample_standard_deviation} --- population_standard_deviation = {population_standard_deviation}")
sample_standard_deviation

sample_standard_deviation = 15.62 --- population_standard_deviation = 15.54


15.62

#### - Standard Error

![sample standard error](sample_standard_error.png)

In [405]:
# standard error is yet another dispersion measure, which is the standard deviation divided by the square root of the 
# number of elements n - both regarding the whole considered population or a sample of it: that is, precise population 
# standard error or estimated sample standard error, similar to variance and standard deviation above. Standard error
# is the standard deviation of the mean, that is, how precise the mean is when calculated multiple times at that same
# population or sample, giving an idea of probability that the may means calculated will be within a certain error range,
# which can be reported right on the side of the mean (e.g. mean = 550 +- 12.8 (SE)) or included at the confidence 
# interval (e.g. for 95% of confidence interval, CI = x̄ ± (1.96 × SE) => if x̄ = sample mean = 550 and SE = standard 
# error = 12.8 => 95% CI [525, 575]. This example means that the central mean of the calculated variable for a sample
# of a population is expected to stay between 525 and 575 with a 95% probability, though the current mean is 550. Using
# gives less work to the user as he gets the info already calculated, while when having the standard error alongside
# the mean (as above exemplified), the user still has to do a sum and subtraction to have the error limits. But, either
# way, the usefulness of the standard error is to tell the user how trustful that arithmetic mean is and the limits he 
# should expect new calculated means to be within. The higher the "n", the lower tends to be the standard error.

In [406]:
# calculating both population and sample standard errors (SE) for the data in item_price_ndarray
population_standard_error = round(population_standard_deviation/np.sqrt(my_n),2)
sample_standard_error = round(sample_standard_deviation/np.sqrt(my_n),2)
print(f"sample_standard_error = {sample_standard_error} --- population_standard_error = {population_standard_error}")
sample_standard_error

sample_standard_error = 1.56 --- population_standard_error = 1.55


1.56

In [407]:
# or:

In [408]:
# calculating both population and sample standard errors (SE) for the data in item_price_ndarray
# ddof (Delta Degrees of Freedom) of stats module defaults to 1 (sample). If population SE is desired, set ddof=0
population_standard_error = round(stats.sem(a=item_price_ndarray, ddof=0),2)
sample_standard_error = round(stats.sem(a=item_price_ndarray),2)
print(f"sample_standard_error = {sample_standard_error} --- population_standard_error = {population_standard_error}")
sample_standard_error

sample_standard_error = 1.56 --- population_standard_error = 1.55


1.56

#### - Coefficient of Variation

![coefficient of variation](coefficient-of-variation.png)

In [409]:
# the coefficient of variation is a dispersion measure calculated by dividing the standard deviation by the mean, that is, by comparing
# the standard deviation of a Series of continuous numerical data, either from the whole population or from a sample of it. As such, this
# measure reflects how disperse values are regarding the mean, and has no unit, it's a dimensionless number. Therefore, it is useful in 
# comparing dispersions at different distributions, with different units or widely different means, better than the standard deviation. It
# can be presented as a decimal number or as a percentage, so that, the lowest its value, the closest the data series values at the 
# distribution will be to their mean.

In [410]:
# calculating both population and sample coefficients of variation (CV) for the data in item_price_ndarray
population_coefficient_of_variation = population_standard_deviation/my_mean
sample_coefficient_of_variation = sample_standard_deviation/my_mean
print(f"sample_coefficient_of_variation = {round(sample_coefficient_of_variation,4)} --- population_coefficient_of_variation = {round(population_coefficient_of_variation,4)}")
sample_coefficient_of_variation_in_percent = f"{round(sample_coefficient_of_variation*100, 2)}%"
sample_coefficient_of_variation_in_percent

sample_coefficient_of_variation = 0.0819 --- population_coefficient_of_variation = 0.0815


'8.19%'

### Shape measures

#### - Coefficient of Asymmetry (Skewness)

![coefficient of asymmetry - skewness](fisher-asymmetry.png)

In [411]:
# the Fisher's coefficient of asymmetry, also known as skewness of a distribution curve, is a shape measure that tells us how distorted the 
# top of the curve is from where it would be if the distribution was a symmetrical one. In a symmetrical curve, the three central positional 
# measures are equal and perfectly centralized: mean, median and mode. In a curve skewed to the left, that is, with a g1 < 0, when we have 
# some negative outliers compared to the mean, these too negative values force the mean of the curve to the left as well, and throws the 
# median and mode to the right (mean<median<mode), so that, graphically, the tail of the curve is more prominet to the left while the top 
# is shifted to the right. The top always follows the mode, while the tail follows the mean... and the median always in between both. On 
# the other hand, a curve skewed to the right, that is, with a g1 > 0, when we have some positive outliers compared to the mean, these too 
# positive values force the mean of the curve to the right (mode<median<mean), so that, graphically, the tail of the curve is more prominent 
# to the right, while the top is shifted to the left, along with the mode and median. If the the curve has no outliers and all positive and 
# negative values are equally distributed and close to the center, making the mean, median and mode have the same values, the curve is 
# centered and the skewness is zero => g1 = 0.

In [412]:
# calculating the skewness (coefficient of asymmetry) for the data distribution at item_price_ndarray
my_skewness = round(stats.skew(a=item_price_ndarray, bias=False), 4)
my_skewness

0.0899

#### - Coefficient of Kurtosis

![coefficient of kurtosis](fisher-kurtosis.png)

In [413]:
# the Fisher's coefficient of kurtosis is another shape measure that refers to the flattening of the distribution curve. The more the 
# curve is flattened compared to the original curve, the more disperse its values are regarding the mean, while, on the contrary, when
# the curve is narrowed (deflattened) when compared to the original curve, its values tend to be closer and less disperse compared to
# the mean value. Kurtosis measures the outliers (extremes) of a distribution... and is poorly affected by central values. Distribution
# curves with zero excess kurtosis (g2=0) are called mesokurtic; distribution curves with positive excess kurtosis are called leptokurtic 
# (g2>0); while distribution curves with negative excess kurtosis (g2<0) are called platykurtic.

In [414]:
# calculating the coefficient of kurtosis for the data distribution at item_price_ndarray
my_kurtosis = round(stats.kurtosis(a=item_price_ndarray, bias=False), 4)
my_kurtosis

0.6695