## Hellinger transformation 

This is recommended for count data as a means of standardization<sup>1,2</sup> and scaling allowing data to be passed to PCA and other metrics not otherwise acceptable for count data where observations within rows are not independent. Not perfect but preferred to other column based scaling techniques since the structure of the row is important for count data<sup>3</sup>. Essentially the square-root transformation by row total downweights abundant species and allows better representation of the tail represented by less dominant species. Results are identical to Decostand implementation in Vegan (R package).

**Requirements**
A dataframe consisting of raw count data with different observations arranged by row and the labels for the different species as columns. Assumes dataframe has some prior processing to remove N/A and other non-numerical flags within the body of the dataframe.


<sup>1</sup>Legendre, P. & Legendre, L. (2012) * Numerical Ecology (3rd edition)* Developments in Environmental Modelling, vol. **24** pp. 1-990. Elsevier, Amsterdam.  
<sup>2</sup>Legendre, P. & Gallagher, E. D. (2001) Ecologically meaningful transformations for ordination of species data. *Oecologia*, **129**:271-280.  
<sup>3</sup>Borcard, D., Gillet, F., & Legendre, P. (2011) *Numerical Ecology with R*. pp.1-306, Springer, Dortrecht.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('taxa_matrix.csv')
df.head()

Unnamed: 0,Sample,taxa_1,Taxa_2,Taxa_3
0,LM 10,10,5,1
1,LM 20,18,3,3
2,LM 30,10,5,0


In [3]:
## Split the taxa labels off from the df
taxa = df.iloc[:,0:1] #these are id_rows
data = df.iloc[:,1:] #these are the data

In [4]:
## HELLINGER transformation:
'''
to call this function, move workbook to working folder, import transformation as tm
and then call with tm.hellinger(df kwargs= {df_type('full' or None), ids = row labels}).
'''

def hellinger(df, **kwargs):
    import numpy as np
    import pandas as pd
    '''
    ids --> this is the first column (or columns) with row lables (e.g. sample numbers, depth etc.)
    and is a separate df.
    
    df --> these columns are the counts (with the column headers) of the cleaned data matrix containing just
    raw count frequencies and zeros (strings need removing first).
    
    df_type --> if left blnk, this just returns the matrix, if 'full' returns df with row labels.
    '''
    ids = kwargs.get('ids', None)
    df_type = kwargs.get('df_type', None)
    
    if df_type == 'full':
        tm = df.apply(lambda x: np.sqrt(x / df.sum(axis=1)))
        tm_df = pd.concat([ids, tm], axis=1)
        return tm_df
    else:
        tm = df.apply(lambda x: np.sqrt(x / df.sum(axis=1)))
        return tm
    

In [6]:
## Returns the whole dataframe again with row labels 

new = hellinger(data, df_type='full', ids = taxa)
new.head()

Unnamed: 0,Sample,taxa_1,Taxa_2,Taxa_3
0,LM 10,0.790569,0.559017,0.25
1,LM 20,0.866025,0.353553,0.353553
2,LM 30,0.816497,0.57735,0.0


In [7]:
## Returns just the transformed matrix

new_matrix = hellinger(data)
new_matrix.head()

Unnamed: 0,taxa_1,Taxa_2,Taxa_3
0,0.790569,0.559017,0.25
1,0.866025,0.353553,0.353553
2,0.816497,0.57735,0.0


## To import function from the transformation.py file

Place this file in the working directory of the jpyter notebook.

In [9]:
import transformation as tm

In [11]:
new_full_imported = tm.hellinger(data, df_type='full', ids=taxa)
new_full_imported.head()

Unnamed: 0,Sample,taxa_1,Taxa_2,Taxa_3
0,LM 10,0.790569,0.559017,0.25
1,LM 20,0.866025,0.353553,0.353553
2,LM 30,0.816497,0.57735,0.0
