## Functional Programming: Rudimentary Statistics

return obj (from function): Functions may return an object to be saved if a variable is defined by the function i.e., var1 = function(obj1, obj2, . . .)

# Building a Function

In [1]:
# def function_name(object1, object2, ..., objectn):
   # <operations> 

SyntaxError: invalid syntax (2814180692.py, line 1)

# Total Function:

$\sum_{i=0}^{n-1} x_{i}$

In [2]:
n = 0
total = 0
values = [i for i in range(10)]

print("total\t", "value")
for value in values:
    total += value
    print(total,"\t", value)
    
print("final total:", total)

total	 value
0 	 0
1 	 1
3 	 2
6 	 3
10 	 4
15 	 5
21 	 6
28 	 7
36 	 8
45 	 9
final total: 45


In [3]:
# This is a bad idea!! don't keep copying and pasting old code!
print("total\t", "value")
total = 0
values = [i for i in range(0,1000,2)]
for value in values:
    total += value
    print(total, "\t", value)

total	 value
0 	 0
2 	 2
6 	 4
12 	 6
20 	 8
30 	 10
42 	 12
56 	 14
72 	 16
90 	 18
110 	 20
132 	 22
156 	 24
182 	 26
210 	 28
240 	 30
272 	 32
306 	 34
342 	 36
380 	 38
420 	 40
462 	 42
506 	 44
552 	 46
600 	 48
650 	 50
702 	 52
756 	 54
812 	 56
870 	 58
930 	 60
992 	 62
1056 	 64
1122 	 66
1190 	 68
1260 	 70
1332 	 72
1406 	 74
1482 	 76
1560 	 78
1640 	 80
1722 	 82
1806 	 84
1892 	 86
1980 	 88
2070 	 90
2162 	 92
2256 	 94
2352 	 96
2450 	 98
2550 	 100
2652 	 102
2756 	 104
2862 	 106
2970 	 108
3080 	 110
3192 	 112
3306 	 114
3422 	 116
3540 	 118
3660 	 120
3782 	 122
3906 	 124
4032 	 126
4160 	 128
4290 	 130
4422 	 132
4556 	 134
4692 	 136
4830 	 138
4970 	 140
5112 	 142
5256 	 144
5402 	 146
5550 	 148
5700 	 150
5852 	 152
6006 	 154
6162 	 156
6320 	 158
6480 	 160
6642 	 162
6806 	 164
6972 	 166
7140 	 168
7310 	 170
7482 	 172
7656 	 174
7832 	 176
8010 	 178
8190 	 180
8372 	 182
8556 	 184
8742 	 186
8930 	 188
9120 	 190
9312 	 192
9506 	 194
9702 	 19

In [4]:
# build a function:
def total(lst):
    total_ = 0
    # in original I used the index of the list
    # ... 
    # n = len(lst)
    # for i in range(n)
    for val in lst:
        total_ += val
    return total_
total(values)


249500

In [5]:
total([i for i in range(-1000, 10000, 53)])

932984

In [6]:
# now you never have to build this code again, you just have to call it

In [7]:
import random
X1 = [3, 6, 9, 12, 15,18,21,24,27,30]
X2 = [random.randint(0,100) for i in range(10)]
total(X1), total(X2)

(165, 520)

## Statistical Functions
| New Concepts | Description |
| --- | --- |
| Operators e.g., !=, %, +=, \*\* | The operator != tests whether the values on either side of the operator are equal; _a % b_ returns the remainder of $a / b$; _a += b_ sets a equal to $a + b$; _a ** b_ raises a to the b power ($a^b$). |
| Dictionary | A dictionary is a datastructure that uses keys instead of index values. Each unique key references an object linked to that key. |
| Dictionary Methods e.g., _dct.values()_ | dct.values() returns a list of the objects that are referenced by the dictionaries keys.|
| Default Function Values | Function may assume a default value for values passed to it. e.g., _def function(val1 = 0, val2 = 2, …)_ | 

## Mean Function

# Let $X_1, X_2,...,X_n$ represent $n$ random variables. For a given dataset, useful descriptive statistics of central tendency include mean, median, and mode, which we built as functions in a previous chapter. 

# We define the mean of a set of numbers:
# $\bar{X} = \frac{\sum_{i=0}^{n-1} x_{i}} {n}$





In [8]:
def mean(lst):
    n = len(lst)
    mean_ = total(lst) / n
    return mean_
mean(X1), mean(X2)
    

(16.5, 52.0)

In [9]:
# Now let's build the rest of the summary statistical functions

# 1. median
# 2. mode
# 3. variance
# 4. standard deviation
# 5. covariance
# 6. correlation

# Median: the middle most number in a list. It is less sensitive to outliers than mean; it is the value in the middle of the dataset. 
 



# For a series of odd length defined by a range [i, n] starting with index  𝑖=0, the median is n/2

# For a series that is of even length but otherwise the same, the median is the mean value of the two values that comprise middle of the list. The indices of these numbers are equal defined:



# $$i_1 = \frac{n + 1}{2}; i_2\frac{n - 1}{2}$$

In [10]:
def median(lst):
    n = len(lst)
    lst = sorted(lst)
# two cases:
# 1. list of odd length
# i % j checks for remainder upon dividing i by j 
    if n % 2 != 0:
        #list length is odd
        middle_index = int((n - 1) / 2)
        median_ = lst[middle_index]
# 2. list of even length
    else:
        upper_middle_index = int(n / 2)
        lower_middle_index = upper_middle_index - 1
        # pass slice with two middle values to mean()
        median_ = mean(lst[lower_middle_index : upper_middle_index + 1])
        
    return median_
# . . . 
median1 = median(X1)
median2 = median(X2)
print("median of X1:", median1)
print("median of X2:", median2)

median of X1: 16.5
median of X2: 46.5


In [11]:
# transform X1 to be of odd length by removing the last index:
# this is to test the first case in the median() function
median(X1[:-1])

15

In [12]:
sorted(X2)

[25, 36, 37, 41, 43, 50, 52, 55, 89, 92]

## Mode: most occurring

In [13]:
def mode(lst):
    count_dct = {}
    # create entries for each value with 0
    for key in lst:
        count_dct[key] = 0
    # add up each occurence
    for key in lst:
        count_dct[key] += 1
    # calculate max_count up front
    max_count = max(count_dct.values())
    # now we can compare each count to the max count
    mode_ = []
    
    # call the key and value it is paired to:
    for key, count in count_dct.items():
        if count == max_count:
            mode_.append(key)
    
    return mode_

lst = [1,1,1,1,1,2,3,4,5,5,5,5,5,1000,1000]
mode(lst)

[1, 5]

## Variance: Average values do not provide a robust description of the data. An average does not tell us the shape of a distribution. In this section, we will build functions to calculate statistics describing distribution of variables and their relationships. The first of these is the variance of a list of numbers.


# We define population variance as:

# $$ \sigma^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n}$$

# Degrees of Freedom: 
# $$DoF = n - 1$$

# $$ S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$$

In [14]:
def variance(lst, sample = True):
    # save mean value of list:
    list_mean = mean(lst)
    # Use n to calculate average of sum squared diffs
    n = len(lst)
    DoF = n - 1
    # create value we can add squared diffs to: 
    sum_sq_diff = 0
    
    for val in lst:
        diff = val - list_mean
        sum_sq_diff += (diff) ** 2
        # print(val, list_mean, sum_sq_diff)
    if sample == False:
        # normalize result by dividing by n:
        variance_ = sum_sq_diff / n
    else: 
        # normalize by dividing by (n-1) for samples:
        variance_ = sum_sq_diff / DoF
    return variance_
variance(X1, sample = True), variance(X1, sample = False)
    

(82.5, 74.25)

In [15]:
variance(X2, sample = True), variance(X2, sample = False)

(488.22222222222223, 439.4)

# Standard Deviation: how far from the mean are we traveling
# $sd = \sqrt{S^2}$

In [23]:
# calculate standard deviation
def SD(lst, sample = True):
    SD_ = variance(lst, sample) ** (1/2)
    return SD_
SD(X1, sample = True), SD(X1, sample = False)

(9.082951062292475, 8.616843969807043)

In [24]:
SD(X2, sample = True), SD(X2, sample = False)

(22.095751225568733, 20.961870145576228)

# Standard Error: 
A reference to the distribution that the mean of your data is drawn from

This describes how likely a given random sample mean $\bar{x_i}$ is to deviate from the population mean $\mu$. It is the standard deviation of the probability distribution for the random variable $\bar{X}$, which represents all possible samples of a single given sample size $n$. As $n$ increases, $\bar{X}$ can be expected to deviate less from $\mu$, so standard error decreases. Because population standard deviation $\sigma$ is rarely given, we again use an _estimator_ for standard error, denoted $s_\bar{x}$. Populational data has no standard error as $\mu$ can only take on a single value. 

In [27]:
# Build out a standard error:
def STE(lst, sample = True):
    n = len(lst)
    se = SD(lst, sample) / n ** (1/2)
    
    return se

In [28]:
SD(X1, sample = True), STE(X1, sample = True)

(9.082951062292475, 2.872281323269014)

In [29]:
SD(X2, sample = True), STE(X2, sample = True)

(22.095751225568733, 6.987290048525409)

In [30]:
# standard error is significantly smaller than standard deviation

## Covariance

To calculate covariance, we multiply the sum of the product of the difference between the observed value and the mean of each list for value _i = 1_ through _n = number of observations_:

# $cov_{pop}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n}$

We pass two lists through the covariance() function. As with the _variance()_ and _SD()_ functions, we can take the sample-covariance.

# $cov_{sample}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n - 1}$


In [None]:
def covariance(lst1, lst2, sample = False):
    mean1 = mean(lst1)
    mean2 = mean(lst2)
    # prepare covariance of zero
    cov = 0
    # get length of each list:
    n1 = len(lst1)
    n2 = len(lst2)
    # check if lists are the same length:
    if n1 == n2:
        n = n1
        # sum the product of the dif