# Gini Index
### <I>Exploring different implementations of the Gini Index</I> 
There is no standardized implementation in Python.  
Therefore, three methods were compared:  
- Standard: book reference, and anywhere https://en.wikipedia.org/wiki/Gini_coefficient#cite_note-23
- Code simplification: https://www.statology.org/gini-coefficient-python/
- Alternative eq: https://en.wikipedia.org/wiki/Gini_coefficient#cite_note-23

In [1]:
import numpy as np
import time

In [103]:
x = [1000,2000,3000,4000,5000,6000,7000,8000,9000,10000]
x2 = [100,110,120,200,210,230,300,330,360,500]
unequal = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
equal = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
large_x = list(range(1, 10001))

## Conclusion
The function <code>gini_alternative</code> is the *best*.  
It is by far the fastest, and provides results consistent with:  
https://goodcalculators.com/gini-coefficient-calculator/  

## Standard Eq

> ***Note***:
The standard equation is:
$$\frac{\sum_{i=1}^n \sum_{j=1}^n\left|x_i-x_j\right|}{2 n^2 \bar{x}}$$  
This is what's referenced in the book reference.  
Which applied directly in python logic is:

In [106]:
def gini_book(x):
    # x = np.sort(x) #  sort not needed
    x = np.float64(x)
    total = 0 
    for i, xi in enumerate(x[:-1], 1):
        for j, xj in enumerate(x[:-1], 1):
            total += np.abs(xi - xj)
            
    return total / (2*(len(x)**2) * np.mean(x))

In [107]:
start_time = time.time()
print(f'gini Book: {gini_book(large_x)}')
print(f'Book Time: {np.round(time.time() - start_time,2)}[s]')  

gini Book: 0.3332000199980002
Book Time: 149.62[s]


## Statology Python simplification

> ***Note***:
This implementes the standard formula,  
but applies <code>numpy</code> vectorisation tricks to speed up calculation.

In [108]:
# From:
# https://www.statology.org/gini-coefficient-python/
def gini(x):
    x = np.float64(x)
    # x = np.sort(x) # sort was added, but not needed
    total = 0
    for i, xi in enumerate(x[:-1],1):
        total += np.sum(np.abs(xi - x[i:]))
    return total / (len(x)**2 * np.mean(x))

In [109]:
start_time = time.time()
print(f'gini Statology: {gini(large_x)}')
print(f'Statology Time: {np.round(time.time() - start_time,2)}[s]')  

gini Statology: 0.3333
Statology Time: 0.13[s]


## Alternative Formula

> ***Note***:
Adding the assumption that <code>x</code> is in ascending order,  
allows for the following simplification:  
$$\frac{2 \sum_{i=1}^n i x_i}{n \sum_{i=1}^n x_i}-\frac{n+1}{n}$$  
This is already a simplication from the formulas in:  
https://www.statsdirect.com/help/nonparametric_methods/gini_coefficient.htm  
https://goodcalculators.com/gini-coefficient-calculator/  
which is derived in the wikipedia page:  
https://en.wikipedia.org/wiki/Gini_coefficient#cite_note-23  

*Note* that there is a further simplication in the function.  
The estimated Gini, should be even faster, but doesn't provide a good approximation for small sample sizes <code>n</code>

In [110]:
def gini_alternative (x):
    # https://en.wikipedia.org/wiki/Gini_coefficient#cite_note-23
    # last in: Alternative expressions
    x = np.float64(x)
    x = np.sort(x)
    sum_x = x.cumsum()[-1]
    
    ix = x * np.arange(1, len(x)+1)
    sum_ix = ix.cumsum()[-1]

    n = len(x)
    
    # estimated_total = 1 - (2/(n - 1)) * (n - (sum_ix / sum_x))
    
    total = ((2 * sum_ix) / (n * sum_x)) - (n+1)/n
    
    # print(f'Diff: {abs(total-estimated_total)}')
    
    return total

In [113]:
start_time = time.time()
print(f'gini Wiki: {gini_alternative(large_x)}')
print(f'Wiki Time: {np.round(time.time() - start_time,10)}[s]')  

gini Wiki: 0.33329999999999993
Wiki Time: 0.0009973049[s]
