# Analayse Shannon diversity index and Faith's phylogenetic index characteristics

This Jupyter Notebook contains main functions analyse alpha diversity dynamic behavior 

### RANDOM WALK TESTS

In a random walk time series, the value at each time point is equal to the value at the previous time point plus a random shock or disturbance term. The shock term is usually assumed to be normally distributed and uncorrelated with the previous value, which means that it is completely unpredictable.

Random walk time series are interesting because they do not have a fixed mean or variance over time. Instead, they tend to drift or wander around randomly, sometimes moving up and sometimes moving down. This makes them difficult to model and forecast accurately, and they are often used as a benchmark for testing other time series models.

1. The ADF (Augmented Dickey-Fuller) test is a statistical test used to determine if a time series is stationary or non-stationary. Stationarity is an important property of a time series, as it means that the statistical properties of the series do not change over time.

    The ADF test works by estimating the relationship between each observation in a time series and the previous observation. Specifically, the test involves regressing the series on its lagged values and testing whether the coefficient on the lagged values is significantly different from 1. If the coefficient is significantly less than 1, it suggests that the series is stationary, whereas if the coefficient is not significantly different from 1, it suggests that the series is non-stationary.

    The ADF test can also be augmented to include additional lagged terms in the regression, which can help to account for other sources of non-stationarity, such as trends or seasonality.

    In summary, the ADF test is a method for testing stationarity in a time series by estimating the relationship between each observation and the previous observation. It can be used to identify the presence of trends, seasonality, or other non-stationary features in a time series. If the p-value from the Augmented Dickey-Fuller (ADF) test is less than 0.05, it means that there is strong evidence to reject the null hypothesis that the time series has a unit root, and therefore the time series is stationary.
        
1. The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test is a statistical test used to determine if a time series is stationary or non-stationary. Stationarity is an important property of a time series, as it means that the statistical properties of the series do not change over time.

    The KPSS test works by comparing the trend component of a time series to the overall variability of the series. Specifically, the test involves estimating the trend of the time series and calculating the sum of the squared deviations of the series from this trend. This sum is then compared to a reference value based on the variance of the series, with the null hypothesis being that the series is stationary.

    If the sum of the squared deviations is greater than the reference value, the null hypothesis is rejected and the series is considered to be non-stationary. This suggests the presence of a trend or other non-stationary features in the series.

    Conversely, if the sum of the squared deviations is less than the reference value, the null hypothesis is not rejected and the series is considered to be stationary. This suggests that the statistical properties of the series are consistent over time and that there is no evidence of non-stationarity.

    In summary, the KPSS test is a method for testing stationarity in a time series by comparing the trend component of the series to its overall variability. It can be used to identify the presence of trends, seasonality, or other non-stationary features in a time series.
        
        
#### null hypothesis 

* ADF (Augmented Dickey-Fuller) Test: The null hypothesis is also that the time series has a unit root and is non-stationary.

* KPSS (Kwiatkowski-Phillips-Schmidt-Shin) Test: The null hypothesis is that the time series is stationary.

In [2]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller, kpss

from sklearn.linear_model import LinearRegression
from statsmodels.stats.diagnostic import acorr_ljungbox

import scipy.stats as stats
import matplotlib as mpl

# I. SHANNON DIVERSITY INDEX

In [4]:
wd =  './data/alpha_diversity/shannon/'

male_alpha_df = pd.read_csv(wd + 'male_shannon_entropy.csv')#.iloc[:150]
female_alpha_df = pd.read_csv(wd + 'female_shannon_entropy.csv')#.iloc[:150]
donorA_alpha_df = pd.read_csv(wd + 'donorA_shannon_entropy.csv')#.iloc[:150]
donorB_alpha_df = pd.read_csv(wd + 'donorB_shannon_entropy.csv')#.iloc[:150]

datasets = [male_alpha_df, female_alpha_df, donorA_alpha_df, donorB_alpha_df.iloc[:150]]
subjects = ['male', 'female', 'donorA', 'donorB']

### Ljung box test for serial correlation

In [8]:
def remove_trend(ts):
    
    lr = LinearRegression()
    X = ts.index.values.reshape(len(ts), 1)
    lr.fit(X, ts.values)
    trend = lr.predict(X)

    feature_detrended = ts.values - trend
    
    return feature_detrended

def autocorrelation_presence(ts):
    
    detrended_ts = remove_trend(ts)
    
    # Ljung-Box test for white noise
    ljung_box_results = acorr_ljungbox(detrended_ts, lags=30)
    ljung_box_results_df = ljung_box_results.reset_index()

    if ljung_box_results_df[ljung_box_results_df['lb_pvalue'] > 0.05].shape[0] == 0:
        print('series is autocorrelated')
    elif ljung_box_results_df[ljung_box_results_df['lb_pvalue'] < 0.05].shape[0] == 0:
        print('series is not autocorrelated') 

In [9]:
for dataset in datasets:
    autocorrelation_presence(dataset)

series is autocorrelated
series is autocorrelated
series is autocorrelated
series is autocorrelated


In [10]:
def test_unit_root(ts, subject):

    detrend_ts = remove_trend(ts)

    result_ADF = adfuller(ts, maxlag=30)
    result_KPSS = kpss(np.log(ts), nlags=30)


    unit_root_df = pd.DataFrame([result_ADF[1], result_KPSS[1]], columns = ['pvalue'])
    unit_root_df['test'] = [ 'ADF', 'KPSS']
    unit_root_df['pvalue'] = np.round(unit_root_df['pvalue'], 3)
    unit_root_df['subject'] = subject
    
    return unit_root_df


DF = []
for dataset, subject in zip(datasets, subjects):
    res_df = test_unit_root(dataset, subject)
    DF.append(res_df)
    
UNIT_ROOT_RESULTS_DF = pd.concat(DF)

look-up table. The actual p-value is greater than the p-value returned.

look-up table. The actual p-value is greater than the p-value returned.

look-up table. The actual p-value is greater than the p-value returned.

look-up table. The actual p-value is greater than the p-value returned.



In [11]:
UNIT_ROOT_RESULTS_DF

Unnamed: 0,pvalue,test,subject
0,0.0,ADF,male
1,0.1,KPSS,male
0,0.001,ADF,female
1,0.1,KPSS,female
0,0.0,ADF,donorA
1,0.1,KPSS,donorA
0,0.0,ADF,donorB
1,0.1,KPSS,donorB


# II. Faith's PD index

In [5]:
#faiths
wd =  './data/alpha_diversity/faiths_pd/'

male_alpha_df = pd.read_csv(wd + 'male_faiths_pd.csv')
female_alpha_df = pd.read_csv(wd + 'female_faiths_pd.tsv', sep='\t', index_col = [0])#.iloc[40:].reset_index(drop=True)
donorA_alpha_df = pd.read_csv(wd + 'donorA_faiths_pd.tsv', sep='\t', index_col = [0])
donorB_alpha_df = pd.read_csv(wd + 'donorB_faiths_pd.tsv', sep='\t', index_col = [0])

datasets = [male_alpha_df, female_alpha_df, donorA_alpha_df, donorB_alpha_df]
subjects = ['male', 'female', 'donorA', 'donorB']

In [13]:
for dataset in datasets:
    autocorrelation_presence(dataset)

series is autocorrelated
series is autocorrelated
series is autocorrelated
series is autocorrelated


In [15]:
DF = []
for dataset, subject in zip(datasets, subjects):
    res_df = test_unit_root(dataset, subject)
    DF.append(res_df)
    
UNIT_ROOT_RESULTS_DF = pd.concat(DF)
UNIT_ROOT_RESULTS_DF

look-up table. The actual p-value is greater than the p-value returned.

look-up table. The actual p-value is greater than the p-value returned.



Unnamed: 0,pvalue,test,subject
0,0.0,ADF,male
1,0.019,KPSS,male
0,0.002,ADF,female
1,0.075,KPSS,female
0,0.0,ADF,donorA
1,0.1,KPSS,donorA
0,0.0,ADF,donorB
1,0.1,KPSS,donorB
