### Project: Jellyfish

Welcome to the Jellyfish Python module - your one-stop solution for automatic statistical testing! Designed to streamline your data analysis process, Jellyfish takes the hassle out of selecting the appropriate statistical test by intelligently identifying the best method based on your data's characteristics.

Performing statistical tests can be a daunting task, especially when dealing with diverse datasets and various test assumptions. However, with Jellyfish, you can say goodbye to manual test selection and confidently make data-driven decisions. Our module automatically detects the nature of your data, checks for any deviations from assumptions, and seamlessly applies the most suitable statistical test.

One of the key features of Jellyfish is its adaptive approach to hypothesis testing. For instance, when faced with groups that might not have equal variances, the module automatically opts for the Welch's t-test, ensuring accurate and reliable results without you having to worry about the underlying assumptions.

Whether you're dealing with simple two-sample tests or complex ANOVA scenarios, Jellyfish has got you covered. Its versatility allows it to handle various test scenarios and present you with meaningful insights to support your research, experiments, or decision-making processes.

By using Jellyfish, you not only save time and effort but also enhance the rigor of your statistical analyses. Our goal is to empower researchers, data analysts, and scientists to focus on their core work while knowing that they can rely on Jellyfish for robust statistical testing.

In this documentation, we'll walk you through the easy-to-use functions and demonstrate how Jellyfish automatically adapts to different situations, providing you with a deeper understanding of your data with minimal coding.

Let Jellyfish be your trusted companion in statistical analysis, and let's embark on a journey of effortless and accurate data exploration together! Happy testing!




# Version 0.0.1
**implemented**: 
- projectname and intro
- chi-square independence / exact fisher testing according to cell sizes (for smaller samples, for big samples the exact test won't work)

**To Do**: 
- chi-square testing with sample dataset from seaborn
- chi-square correct overflow error integer division result too large for a float
- reformat script according to PEP8 guidelines

**Planned in next step**: 
- t-tests for independend samples (with levene-test and normality checks / outliers for welch-t-test and mann-whitney)
- t-tests for dependend samples (with normality checks / outliers for wilcoxon)
- ANOVA (with box-m and normality checks for welch-anova and kruskal-wallis)
- linear regression (gauss-markov assumptions and robust testing / bootstrap)
- correlation (with linearity check for spearman or log-transformations)

### Sample Data

In [44]:
#Sample Data
import seaborn as sns
import pandas as pd

df = pd.read_excel("TestData.xlsx")

df2 = sns.load_dataset("penguins")
df2.head(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### Documentation for perfect_chi

This code defines a function perfect_chi that takes a pandas DataFrame table, and two strings column1 and column2, which represent the names of two columns in the DataFrame. The function then performs a chi-square test or a Fisher's exact test, depending on the size of the table and the values within it. The function returns a summary of the results of the test, along with a printed output.

The first few lines of the function import the necessary packages, including pandas, numpy, scipy.stats, math, and warnings. The warnings package is used to suppress warning messages that might be produced during the calculations.

The function then uses the pd.crosstab function to create a contingency table called beobachtet, which shows the number of occurrences of each unique combination of values in column1 and column2. The table also includes totals for the rows and columns.

Next, the function creates a few variables that will be used later in the calculations. These include chi_square, which will hold the result of the chi-square test, alpha, which is the significance level for the test, and rows and columns, which contain the unique values in column1 and column2, respectively. The function also creates an empty DataFrame called erwartet to hold the expected values for the chi-square test.

The function then uses a nested loop to calculate the observed and expected values for each combination of values in column1 and column2. The observed values are extracted from the beobachtet table, while the expected values are calculated using the formula E = (row total * column total) / grand total. The chi-square statistic is also calculated during this loop. The expected values are stored in the erwartet DataFrame.

The function then drops the last column and last row of the erwartet DataFrame, since they contain the total values that were used to calculate the expected values.

The function then checks the minimum expected value in the erwartet DataFrame. If this value is less than 5, the function uses Fisher's exact test instead of the chi-square test. If the minimum expected value is greater than or equal to 5, the function uses the chi-square test.

If Fisher's exact test is used, the function defines an internal helper function called _dfs, which is used to recursively calculate the p-value for Fisher's exact test. This helper function takes a contingency table as input and returns the p-value for the test.

Finally, the function uses the fisher_exact function to calculate the p-value for Fisher's exact test, and prints out the results of the test. If the chi-square test is used, the function calculates the p-value for the test and prints out the results of the test.

In summary, the perfect_chi function is used to perform a chi-square test or a Fisher's exact test on a contingency table, and returns a summary of the results of the test, along with a printed output.

### Code for perfect_chi


In [39]:
def perfect_chi(table, column1, column2):
    '''Checks expected values and calculates chi-square-test or fishers exact test accordingly'''
    #import necessary packages
    import pandas as pd
    import numpy as np
    import scipy.stats as stats
    import math
    from copy import deepcopy
    import warnings
    warnings.filterwarnings("ignore")
    #pivot table
    beobachtet = pd.crosstab(table[f'{column1}'],table[f'{column2}'], margins=True, margins_name="total")
    #groundwork
    chi_square = 0
    alpha = 0.05
    rows = table[f'{column1}'].dropna().unique()
    columns = table[f'{column2}'].dropna().unique()
    erwartet = pd.DataFrame(0, index=beobachtet.index, columns=beobachtet.columns)
    min_cell = beobachtet.min().min()
    #calculate observed and expected values and create tables for expected
    for i in columns:
        for j in rows:
            O = beobachtet[i][j]
            E = beobachtet[i]['total'] * beobachtet['total'][j] / beobachtet['total']['total']
            erwartet[i][j] = E
            chi_square += (O-E)**2/E

    erwartet = erwartet.drop(erwartet.columns[-1],axis=1)
    erwartet = erwartet.drop(erwartet.index[-1])
    #check for min erwartet
    print("the minimum expected value is:",np.round(erwartet.min().min(),3),". The minimum absolut value is: ",min_cell)
    if erwartet.min().min() < 5 or min_cell == 0: 
        print("therefore use fishers exact test")
    else: 
        print("therefore use chi square test")
    minerwartet=np.round(erwartet.min().min(),3)
    #check table size, since only up to 5x5 is feasible with exact
    if minerwartet<5:
        eshape = [erwartet.shape[0], erwartet.shape[1]]
        max(eshape)
        print("the maximum length of the shape of the table is",max(eshape))
        if max(eshape) > 5:
            print("please consider collapsing table, only a max of 5x5 should be calculated")
            print("the larger the table, the longer it takes to calculate the fishers exact test")
        else:
            print("shape is small enough for quick usage of fishers exact test")
    #calculate test statistics
    #chi-square
    if minerwartet>=5 and min_cell>0:
        print("p-value calculation of chi-square starting")
        p_value = 1 - stats.chi2.cdf(chi_square, (len(rows)-1)*(len(columns)-1))
        conclusion = "failed to reject the null hypothesis."
        if p_value <= alpha:
            conclusion = "null hypothesis is rejected."
                
        print("chi square is:", round(chi_square,3), " and p value is:", round(p_value,3))
        print(conclusion)
        print("the cross table looks like this:")
        print(beobachtet)
    #fishers exact test
    else:
        def _dfs(mat, pos, r_sum, c_sum, p_0, p):
            (xx, yy) = pos
            r, c = len(r_sum), len(c_sum)
            mat_new = deepcopy(mat)

            if xx == -1 and yy == -1:
                for i in range(r-1):
                    mat_new[i][c-1] = r_sum[i] - sum(mat_new[i][:c-1])
                for j in range(c-1):
                    mat_new[r-1][j] = c_sum[j] - sum([mat_new[i][j] for i in range(r-1)])
                temp = r_sum[r-1] - sum(mat_new[r-1][:c-1])
                if temp < 0:
                    return
                mat_new[r-1][c-1] = temp

                p_1 = math.prod([math.factorial(x) for x in r_sum+c_sum])
                n = sum(r_sum)
                p_1 /= math.factorial(n)

                for row in mat_new:
                    for x in row:
                        p_1 /= math.factorial(x)
                if p_1 <= p_0 + 0.00000001:
                    p[0] += p_1
            else:
                max_1 = r_sum[xx] - sum(mat_new[xx])
                max_2 = c_sum[yy] - sum([mat_new[i][yy] for i in range(r)])
                for k in range(min(max_1,max_2)+1):
                    mat_new[xx][yy] = k
                    if xx == r-2 and yy == c-2:
                        pos_new = (-1, -1)
                    elif xx == r-2:
                        pos_new = (0, yy+1)
                    else:
                        pos_new = (xx+1, yy)
                    _dfs(mat_new, pos_new, r_sum, c_sum, p_0, p)

        def fisher_exact(table):
            row_sum = [sum(row) for row in table]
            col_sum = [sum([table[i][j] for i in range(len(table))]) for j in range(len(table[0]))]
            mat = [[0] * len(col_sum) for _ in range(len(row_sum))]
            pos = (0, 0)
            p_0 = math.prod([math.factorial(x) for x in row_sum+col_sum])
            n = sum(row_sum)
            p_0 /= math.factorial(n)
            for row in table:
                for x in row:
                    p_0 /= math.factorial(x)
            p = [0]
            _dfs(mat, pos, row_sum, col_sum, p_0, p)
            return p[0]

        beobachtet = beobachtet.drop(beobachtet.columns[-1],axis=1)
        beobachtet = beobachtet.drop(beobachtet.index[-1])
        print("p-value calculation of fishers exact test starting")
        print("the p-value of fishers exact test is:",round(fisher_exact(beobachtet.values.tolist()),3))
        print("the cross table looks like this:")
        print(beobachtet)

### Testing for perfect_chi()

In [47]:
perfect_chi(df, "V1", "V2")

the minimum expected value is: 46.148 . The minimum absolut value is:  35
therefore use chi square test
p-value calculation of chi-square starting
chi square is: 7.802  and p value is: 0.253
failed to reject the null hypothesis.
the cross table looks like this:
V2       1    2    3    4  total
V1                              
1       59   63   61   53    236
2       61   35   49   53    198
3       57   54   48   42    201
total  177  152  158  148    635


In [48]:
perfect_chi(df, "V1", "V3")

the minimum expected value is: 0.333 . The minimum absolut value is:  0
therefore use fishers exact test
the maximum length of the shape of the table is 3
shape is small enough for quick usage of fishers exact test
p-value calculation of fishers exact test starting
the p-value of fishers exact test is: 0.679
the cross table looks like this:
V3  1.0  2.0  3.0
V1               
1     0    0    1
2     1    0    1
3     2    3    1


In [45]:
perfect_chi(df2, "species", "island")

the minimum expected value is: 10.279 . The minimum absolut value is:  0
therefore use fishers exact test
p-value calculation of fishers exact test starting


OverflowError: integer division result too large for a float

In [46]:
perfect_chi(df2, "sex", "island")

the minimum expected value is: 23.288 . The minimum absolut value is:  23
therefore use chi square test
p-value calculation of chi-square starting
chi square is: 0.058  and p value is: 0.972
failed to reject the null hypothesis.
the cross table looks like this:
island  Biscoe  Dream  Torgersen  total
sex                                    
Female      80     61         24    165
Male        83     62         23    168
total      163    123         47    333


# t-test for independend samples

### Sample dataset for perfect_t_ind

In [79]:
#Sample Data
import seaborn as sns
df = sns.load_dataset("penguins")
df.head(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### Documentation for perfect_t_ind


yadayadayada

### Code for perfect_t_ind


In [81]:
# def perfect_t_ind(table, mean_column, group_column):
#     df.dropna(subset=[f"{group_column}"], inplace=True)
#     if len(table[f"{group_column}"].unique()) != 2: 
#         print("Please reconsider test-decision. The t-test for independend samples can only be calculated for 2 groups")
#     else:
#         print("The two compared groups are: ", table[f"{group_column}"].unique()[0] ,"and: ", table[f"{group_column}"].unique()[1])



Vorgehen: 
- Anzahl der Gruppen prüfen
- Deskriptive Kennzahlen der Gruppen berechnen und ausgeben
- Voraussetzungstest levene durchführen
- Voraussetzungstest shapiro durchführen
- Passenden Test berechnen