# Sort a List of 100 Numbers by Divisibility, Check Multiplicity, and Extract Summary Statistics

In this notebook, we manipulate in various ways a list of 100 natural numbers that I pulled off the internet. There is no significance to the numbers, I just wanted to test out various tricks, such as 

1. counting their multiplicities 
2. using modular arithmetic to separate evens from odds, and list those divisible by 3, or divisible by 5
3. sorting them by hand
4. treating them as a sample and finding summary statistics (min, max, mean, 1st-3rd quartiles, and standard deviation)

We attempt point 4 in two different ways:  directly using NumPy + formatting the printout, and using the .describe() function in Pandas.

## Import Libraries

In [1]:
import numpy as np
import math
import pandas as pd

## The Data: a List of 100 Numbers

Here's the number list we will manipulate below.  I couldn't think of a better name for it than "numbers".

In [2]:
numbers = [951, 402, 984, 651, 360, 69, 408, 319, 601, 485, 980,
           507, 725, 547, 544, 615, 83, 165, 141, 501, 263, 617,
           865, 575, 219, 390, 984, 592, 236, 105, 942, 941, 386,
           462, 47, 418, 907, 344, 236, 375, 823, 566, 597, 978,
           328, 615, 953, 345, 399, 162, 758, 219, 918, 237, 412,
           566, 826, 248, 866, 950, 626, 949, 687, 217, 815, 67,
           104, 58, 512, 24, 892, 894, 767, 553, 81, 379, 843, 831,
           445, 742, 717, 958, 609, 842, 451, 688, 753, 854, 685, 93,
           857, 440, 380, 126, 721, 328, 753, 470, 743, 527]

# Output Formatting

We first define a **printout formatting function**, ```print_rows()```, to display all rows evenly, with uniform length 10, each cell having width 4 digits.  For later purpose, I also define another function with length 23 rows to format a longer example list below.

In [3]:
def print_rows(x,modulus=10):
    for i,a in enumerate(x,start=1):
        print(f"{a: <4} ", end="")
        if i % modulus == 0:
            print()
    print()

def print_rows2(x,modulus=23):
    for i,a in enumerate(x,start=1):
        print(f"{a: <3} ", end="")
        if i % modulus == 0:
            print()
    print()

Test this format function out on the "numbers" list.

In [4]:
print()
print('{:^48s}'.format("Consider the Following List of 100 Numbers"))
print("------------------------------------------------")

print_rows(numbers)


   Consider the Following List of 100 Numbers   
------------------------------------------------
951  402  984  651  360  69   408  319  601  485  
980  507  725  547  544  615  83   165  141  501  
263  617  865  575  219  390  984  592  236  105  
942  941  386  462  47   418  907  344  236  375  
823  566  597  978  328  615  953  345  399  162  
758  219  918  237  412  566  826  248  866  950  
626  949  687  217  815  67   104  58   512  24   
892  894  767  553  81   379  843  831  445  742  
717  958  609  842  451  688  753  854  685  93   
857  440  380  126  721  328  753  470  743  527  



## Task 1:  Sort Evens from Odds

Our first task is to separate **evens** from **odds** in our "numbers" list using the % operator.

In [5]:
# evens
#######
evens = []

for i in range(len(numbers)):
    if numbers[i] % 2 == 0:
        evens.append(numbers[i])

# odds
#######
odds = []

for i in range(len(numbers)):
    if numbers[i] % 2 == 1:
        odds.append(numbers[i])

Printout the evens and odds using our print_rows() function.

In [6]:
n1 = len(evens)
n2 = len(odds)

# print evens and odds in this format
#####################################
text1 = "We Separate Out the {0} Evens".format(n1)
print()
print('{:^48s}'.format(text1))
print("------------------------------------------------")

print_rows(evens)

text2 = "From the {0} Odds".format(n2)
print()
print('{:^48s}'.format(text2))
print("------------------------------------------------")

print_rows(odds)


          We Separate Out the 45 Evens          
------------------------------------------------
402  984  360  408  980  544  390  984  592  236  
942  386  462  418  344  236  566  978  328  162  
758  918  412  566  826  248  866  950  626  104  
58   512  24   892  894  742  958  842  688  854  
440  380  126  328  470  

                From the 55 Odds                
------------------------------------------------
951  651  69   319  601  485  507  725  547  615  
83   165  141  501  263  617  865  575  219  105  
941  47   907  375  823  597  615  953  345  399  
219  237  949  687  217  815  67   767  553  81   
379  843  831  445  717  609  451  753  685  93   
857  721  753  743  527  


## Task 2: Sift Out All Numbers Divisible by 3, then All Divisible by 5

We generate two more sub-lists by the same methods, the numbers **divisible by 3** and the numbers **divisible by 5**, giving us a total of **5 lists**
1. all
2. evens
3. odds
4. $3|n$
5. $5|n$

which we will then treat as **five datasets** in order to extract summary statistics and format their display in columns.  

In [7]:
# divisible by 3
################
div_by_three = []

for i in range(len(numbers)):
     if numbers[i] % 3 == 0:
         div_by_three.append(numbers[i])

n3 = len(div_by_three)
text3 = "There are Also {0} Divisible by 3".format(n3)

print()
print('{:^48s}'.format(text3))
print("------------------------------------------------")

print_rows(div_by_three)


        There are Also 40 Divisible by 3        
------------------------------------------------
951  402  984  651  360  69   408  507  615  165  
141  501  219  390  984  105  942  462  375  597  
978  615  345  399  162  219  918  237  687  24   
894  81   843  831  717  609  753  93   126  753  



...and those **divisible by 5**

In [8]:
# divisible by 5
################
div_by_five = []

for i in range(len(numbers)):
     if numbers[i] % 5 == 0:   
         div_by_five.append(numbers[i])

n4 = len(div_by_five)
text4 = "And there are {0} Divisible by 5".format(n4)

print()
print('{:^48s}'.format(text4))
print("------------------------------------------------")

print_rows(div_by_five)


        And there are 20 Divisible by 5         
------------------------------------------------
360  485  980  725  615  165  865  575  390  105  
375  615  345  950  815  445  685  440  380  470  



## Task 3: Sort the Original List of Numbers by Hand

There is a python **sort function**, ```sorted()```, for lists, but I preferred to **sort them myself** to understand how such an algorithm works.  The standard sorting algorithm uses two simple nested for loops, which on each pass (the outer for loop) swaps any adjacent pairs ..., n, m,... in the list satisfying n > m, as the index runs from 0 to the penultimate (inner loop).  In my case I swap when n > m, since I'm sorting them in increasing order, but if we wanted decreasing order, we'd swap when n < m.  The outer loop runs the passes, and the inner loop swaps adjacents.  For example,  

start A = [12, 7, 9, 1, 5]      (swap 12 and 7)\
--->   A = [7, 12, 9, 1, 5]      (swap 12 and 9)\
--->   A = [7, 9, 12, 1, 5]      (swap 12 and 1)\
--->   A = [7, 9, 1, 12, 5]      (swap 12 and 5, end pass 1)\
--->   A = [7, 9, 1, 5, 12]      (begin pass 2, swap 9 and 1)\
--->   A = [7, 1, 9, 5, 12]      (swap 9 and 5, end pass 2)\
--->   A = [7, 1, 5, 9, 12]      (begin pass 3, swap 1 and 7)\
--->   A = [1, 7, 5, 9, 12]      (swap 7 and 5, end pass 3)\
--->   A = [1, 5, 7, 9, 12]      A is sorted

Three passes finishes the job.  In fact, 5 passes is the maximum required for a length 5 list, with 4 = 5 - 1 length sub-loops each. 

In [9]:
# sorted numbers
####################

snumbers = numbers.copy()

for i in range(len(snumbers)):          # n passes
    for j in range(len(numbers)-1):     # swap adjacents only on each pass
        if snumbers[j] > snumbers[j+1]:
            snumbers[j], snumbers[j+1] = snumbers[j+1], snumbers[j]

text9 = "The Sorted Numbers in Increasing Order"
print()
print('{:^48s}'.format(text9))
print("------------------------------------------------")

print_rows(snumbers)


     The Sorted Numbers in Increasing Order     
------------------------------------------------
24   47   58   67   69   81   83   93   104  105  
126  141  162  165  217  219  219  236  236  237  
248  263  319  328  328  344  345  360  375  379  
380  386  390  399  402  408  412  418  440  445  
451  462  470  485  501  507  512  527  544  547  
553  566  566  575  592  597  601  609  615  615  
617  626  651  685  687  688  717  721  725  742  
743  753  753  758  767  815  823  826  831  842  
843  854  857  865  866  892  894  907  918  941  
942  949  950  951  953  958  978  980  984  984  



## Task 4:  Count Multiplicities in the List of Numbers

Nice! Well, that was easy.  Let's now take a short detour into **counting the multiplicities** of our "numbers" before we go on to treat their statistical properties.  Since that list happens to be poverty-stricken for multiples, so we'll concoct another list for the job and try our hand at *that* list.

In [10]:
# count any multiples
#####################
counts = []
multiples = []

for i in numbers:
    n = numbers.count(i)
    mult = i
    if n > 1:
        counts.append(n)
        multiples.append(mult)
        
pairs = zip(multiples,counts) # create list of 2-tuples of (i,#i)

pairs_c = []

for j in pairs:           # to weed out the duplicates, use `not in'
    if j not in pairs_c:
        pairs_c.append(j)

multiples_b = [pairs_c[i][0] for i in range(len(pairs_c))]  # extract the multiplicity fr each
counts_b = [pairs_c[i][1] for i in range(len(pairs_c))] # extract the corrsponding number
multiples_b.insert(0,'n ')  # add line titles for each
counts_b.insert(0,'#n')


text10 = "Some Numbers Occur Multiple Times"
text11 = "We List Each With Its Corresponding Multiplicity"
print()
print('{:^48s}'.format(text10))
print('{:^48s}'.format(text11))
print("------------------------------------------------")

print_rows(multiples_b)
print_rows(counts_b)


       Some Numbers Occur Multiple Times        
We List Each With Its Corresponding Multiplicity
------------------------------------------------
n    984  615  219  236  566  328  753  
#n   2    2    2    2    2    2    2    


### Count Multiplicities in a List With More Duplicates

Let's try this with a smaller example list with more multiples.  We sort the list for the reader's convenience, to allow verification of the correct multiplicity count.

In [11]:
A = [12, 12, 7, 2, 2, 3, 7, 9, 1, 12, 7, 5, 1, 1, 4, 1, 11, 5, 12, 9, 7, 9, 5]
B = A.copy()

text1 = "Initial list A"
print()
print('{:^90s}'.format(text1))
print("------------------------------------------------------------------------------------------")

print_rows2(A)
print()

# sort A
k = 0
for i in range(len(B)):          # n passes
    for j in range(len(B)-1):     # swap adjacents only on each pass
        if B[j] > B[j+1]:
            B[j], B[j+1] = B[j+1], B[j]

text2 = "Sorted list A"
print()
print('{:^90s}'.format(text2))
print("------------------------------------------------------------------------------------------")
print_rows2(B)
print()
            
# count any multiples
################################################################
counts = []
multiples = []

for i in B:
    n = B.count(i)
    mult = i
    counts.append(n)
    multiples.append(mult)
        
pairs = list(zip(multiples,counts)) # create list of 2-tuples of (i,#i)

pairs_c = []

for j in pairs:           # to weed out the duplicates, use `not in'
    if j not in pairs_c:
        pairs_c.append(j)

multiples_b = [pairs_c[i][0] for i in range(len(pairs_c))]  # extract the multiplicity fr each
counts_b = [pairs_c[i][1] for i in range(len(pairs_c))] # extract the corrsponding number
multiples_b.insert(0,'n ')  # add line titles for each
counts_b.insert(0,'#n')

text10 = "Some Numbers Occur Multiple Times"
text11 = "We List Each With its Multiplicity"
print()
print('{:^48s}'.format(text10))
print('{:^48s}'.format(text11))
print("------------------------------------------------")

print_rows2(multiples_b)
print_rows2(counts_b)
print()


                                      Initial list A                                      
------------------------------------------------------------------------------------------
12  12  7   2   2   3   7   9   1   12  7   5   1   1   4   1   11  5   12  9   7   9   5   



                                      Sorted list A                                       
------------------------------------------------------------------------------------------
1   1   1   1   2   2   3   4   5   5   5   7   7   7   7   9   9   9   11  12  12  12  12  



       Some Numbers Occur Multiple Times        
       We List Each With its Multiplicity       
------------------------------------------------
n   1   2   3   4   5   7   9   11  12  
#n  4   2   1   1   3   4   3   1   4   



## Task 5: Treat Our 5 Number Lists as Dataset and Extract Their Summary Statistics (With NumPy as well as Pandas)

Next, we treat the list "numbers" as a sample data set, and extract the summary statistics (min, max, mean, 1st quartile, 2nd quartile = median, 3rd quartile, and standard deviation).  We first do this directly, with numpy, before looking at Pandas' ```.describe()```.  In fact, I gather the summary statistics on all the lists
* numbers
* evens
* odds
* 3|n
* 5|n

In [12]:
# summary statistics
####################

# title 
text5 = "Min, Max, Mean, Three Quartiles, and"
text5b = "Standard Deviation For Each Class of Numbers"
text6 = "Class\tMin\tMax\tMean\t1st Q\tMedian\t3rd Q\tSD"
print()
print()
print('{:^64s}'.format(text5))
print('{:^64s}'.format(text5b))
print("----------------------------------------------------------------")
print('{:<}'.format(text6))
print("----------------------------------------------------------------")

# printout
cases = [numbers, evens, odds, div_by_three, div_by_five]
Cases = ['All', 'Evens', 'Odds', '3|n', '5|n']

for i in range(len(cases)):
    mmax = max(cases[i])
    mmin = min(cases[i])
    mmean = np.mean(cases[i])
    quartile1 = np.quantile(cases[i],0.25)
    mmedian = np.median(cases[i])
    quartile3 = np.quantile(cases[i],0.75)
    ssd = np.std(cases[i])
    text7 = f"{Cases[i]:<5s}\t{mmin:<d}\t{mmax:<d}\t{mmean:<0.2f}\t{quartile1:<0.2f}\t{mmedian:<0.1f}\t{quartile3:<0.0f}\t{ssd:<0.2f}"
    print('{:<}'.format(text7))



              Min, Max, Mean, Three Quartiles, and              
          Standard Deviation For Each Class of Numbers          
----------------------------------------------------------------
Class	Min	Max	Mean	1st Q	Median	3rd Q	SD
----------------------------------------------------------------
All  	24	984	541.89	340.00	550.0	779	281.72
Evens	24	984	559.64	360.00	512.0	854	288.81
Odds 	47	953	527.36	291.00	575.0	748	274.94
3|n  	24	984	502.80	219.00	481.5	753	302.75
5|n  	105	980	539.25	378.75	477.5	695	235.93


Pandas has the pd.DataFrame() function, which produces a table out of a list.  Running the .describe() function on it extracts the summary statistics and displays them in a column.  

This is easy if you use it on a single list. But I wanted to get summary statistics for **all** lists
* numbers
* evens
* odds
* 3|n
* 5|n

so I had to construct a **dictionary** into ```pd.DataFrame()```, with keys taking the title positios above the corresponding column of list values.  Now, since each value in the dictionary was a list, the ```.describe()``` function treated it as a row instead of a column, but that was easy to fix with a simple **transpose** of the data frame.  That set me up for the next irritating problem, the difference in the lists' lengths, which caused ```pd.DataFrame()``` to produce an error.  A stackexchange answer gave me a well known, but less well understood, the ```.from_dict( , orient='index')``` appendage to ```pd.DataFrame()```.

In [13]:
Kases = {'all':numbers,'evens':evens,'odds':odds,'3|n':div_by_three, '5|n':div_by_five}

print()
print()
text8 = f"Let's Compare With Pandas' .describe() Function"
print('{:^48s}'.format(text8))
print("------------------------------------------------")

numbers_df = pd.DataFrame.from_dict(Kases, orient='index')
numbers_df = numbers_df.T
print(numbers_df.describe().applymap('{:.2f}'.format))



Let's Compare With Pandas' .describe() Function 
------------------------------------------------
          all   evens    odds     3|n     5|n
count  100.00   45.00   55.00   40.00   20.00
mean   541.89  559.64  527.36  502.80  539.25
std    283.14  292.07  277.47  306.60  242.05
min     24.00   24.00   47.00   24.00  105.00
25%    340.00  360.00  291.00  219.00  378.75
50%    550.00  512.00  575.00  481.50  477.50
75%    779.00  854.00  748.00  753.00  695.00
max    984.00  984.00  953.00  984.00  980.00


### Remark About Population Versus Sample Standard Deviation, and the NumPy Fix

We notice that NumPy's standard deviation (SD) is different from Pandas'.  This is because NumPy uses the *population* SD, 

$$
\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x}_i)^2}
$$

while Pandas uses the *sample* SD, 
 
$$
s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x}_i)^2}
$$

dividing the sum by $n-1$ instead of $n$.  NumPy's ```np.std()``` has an option to fix this, ```np.std( ,ddof=1)```, which changes the default value of ```np.std( ,ddof=0)```.  

In [14]:
text5c = "Min, Max, Mean, Three Quartiles,"
text5bc = "and *Sample* Standard Deviation For Each Class"
text6c = "Class\tMin\tMax\tMean\t1st Q\tMedian\t3rd Q\tSD"
print()
print('{:^64s}'.format(text5c))
print('{:^64s}'.format(text5bc))
print("----------------------------------------------------------------")
print('{:<}'.format(text6c))
print("----------------------------------------------------------------")

cases2 = [numbers, evens, odds, div_by_three, div_by_five]
Cases2 = ['All', 'Evens', 'Odds', '3|n', '5|n']

for i in range(len(cases2)):
    mmax = max(cases2[i])
    mmin = min(cases2[i])
    mmean = np.mean(cases2[i])
    quartile1 = np.quantile(cases2[i],0.25)
    mmedian = np.median(cases2[i])
    quartile3 = np.quantile(cases2[i],0.75)
    ssd = np.std(cases2[i],ddof=1)
    text7 = f"{Cases2[i]:<5s}\t{mmin:<d}\t{mmax:<d}\t{mmean:<0.2f}\t{quartile1:<0.2f}\t{mmedian:<0.1f}\t{quartile3:<0.0f}\t{ssd:<0.2f}"
    print('{:<}'.format(text7))

print()


                Min, Max, Mean, Three Quartiles,                
         and *Sample* Standard Deviation For Each Class         
----------------------------------------------------------------
Class	Min	Max	Mean	1st Q	Median	3rd Q	SD
----------------------------------------------------------------
All  	24	984	541.89	340.00	550.0	779	283.14
Evens	24	984	559.64	360.00	512.0	854	292.07
Odds 	47	953	527.36	291.00	575.0	748	277.47
3|n  	24	984	502.80	219.00	481.5	753	306.60
5|n  	105	980	539.25	378.75	477.5	695	242.05

