Here I have collected two scripts written in Python and SQL, designed for analyzing data related to physiological parameters derived from experimental measurements. These tools were created to expedite the statistical analysis process, extracting and sorting data from tabular-format datasets, in my specific case studies.
Hoping they can be useful to you as example from which collecting inspiration for your specific cases.
*** DISCLAIMER *** these scripts were written to help me in my laboratory data analysis work. The example I used to explain how they work show totally random values, in any case each data contained here should be trated as confidential. Thank you for the support.
Both scripts are designated to collect specific data from my dataset (namely, haematological_dataset.db). The difference is that data are stored in a local database located in the project folder (not present in the repository) in haematological_analysis_local_folder case; while in haematological_analysis_host case, data are stored on a database located on an host (in my case, the localhost).
The type of dataset I needed to analyze was relatively simple.
In my case, the dataset was made from haematological analysis performed on blood sample from different subjects (both male and female) at different timepoints (6, 12, 18 and 24 months). These subjects were grouped on the basis of genotype (knock-out, KO; heterozygous, HE; and wild-type, WT).
The haematological parameters measured are: red blood cell count, rcb; haemoglobin level, hgb; hematocrit, hct; mean corpuscular volume, mcv; mean corpuscular haemoglobin, mch; mean corpuscular haemoglobin concentration, mchc; red cell distribution width - standard deviation, rdw_sd; reticulocyte number, ret_num; reticulocyte percentage, ret_perc; platelet count, plt; white blood cell count, wbc; reticulocyte haemoglobin content, ret_he.
An example of such dataset is depicted in figure 1, left panel.
My goal was to group data at different timepoints on the basis of the parameter considered (see figure 1, right panels), in order to perform variance analysis and post hoc test on the three genotypes, using Prism GraphPad.
Given the amount of haematological parameters measured, the timepoints considered and the difference between male and female that could be significant, this process used to be time consuming (approximately two hours for analysis).
Figure 1
These scripts were created to automate the data collection process from my dataset, covering all the steps I previously performed manually — from the Excel file to the Prism GraphPad analysis.
Data were collected at specific timepoints (the input inserted at the beginning) and grouped based on the hematological parameter considered (in my case, all parameters).
Following this, the algorithm conducted the Shapiro-Wilk normality test on the Gaussian distribution to determine the appropriate test type to use (parametric vs. non-parametric); then, it calculted the p-value through ANOVA or Kruskal-Wallis test, and conducted post hoc tests (Tukey’s or Dunn’s, depending on the data distribution) for multiple comparisons.
At this stage of the algorithm, a report displaying p-values and post hoc test results for each parameter is printed in the terminal (see Figure 2).
Figure 2
In addition, a brief preview of the corresponding graphs for each considered parameter is also displayed to provide an overview of the data distribution in each situation (see figure 3).
Figure 3
Using these scripts, I was able to save several hours of unproductive work. 🥰