## Project 1 - Simulated Data Set Analysis 

***

Author: Emma Dunleavy 

Student Number: g00425660

Module: Programming for Data Analysis Winter 2023/24  

Lecturer: Brian McGinley
***

### Required Libraries
***

In [1]:
import pandas as pd # Data frames
import numpy as np # Numerical arrays
import seaborn as sns # Data visualization
import matplotlib.pyplot as plt # Data visualization
import scipy.stats as ss # Probabilistic distributions and statistical operations
from scipy.stats import f_oneway # One-way ANOVA
import statistics as s # for alternative method for generating data

<span style='color:orange'> Question Part 1 - Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables. </span>

Using the results derived from the Mashad stroke and heart atherosclerotic disorders (MASHAD) study (2010-2020) [1], I have selected variables as per tables 1 below for analysis, with a focus on systolic blood pressure, for people categorised as "Normal weight", "Underweight" and "Obese". For the discussion in this report the "Normal Weight" variable will be reported as "Healthy weight".  The other contributing measured variables are "Age", "Weight", "Height" and "Total energy expenditure". Table 1 provides a summary of the study's statistical findings, which is a sample of the overall results, follow the link below for a comprehensive listing of the study findings.

[Dietary Intake and Its Relationship to Different Body Mass Index Categories: A Population-Based Study ](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6941633/)

|Variable| 	Healthy weight| 	Underweight |	 Obese|
|:-----|-----|:-----:|-----:|
||**Mean / 	SD**| **Mean 	/ SD** |**Mean	/  SD**|	
|<span style='color:grey'>**Age (yr)**</span> |	47.9 / 8.5|	47.6	/ 8.1|		48.4	/ 7.9	|
|<span style='color:grey'>**Weight(kg)**</span> |	47.9 /	6.5|	48.2 /	8.6|		81.6 /	14.8|
|<span style='color: grey'>**Height (meter)**</span> |	1.6	/0.1	|1.6	/0.1		|1.5 /0.1	|
|<span style='color:grey'>**Total energy expenditure**</span> |	2362.0	/ 341.2	|2380.6	 / 384.6	|  2344.0 /	269.9|
|<span style='color:grey'>**Systolic blood pressure (mmHg)**</span> | 	116.3 /	19.1 |	111.3	/ 20.0	| 	122.6	/ 26.2 |

**<i>Table 1 - Statistical findings of continuous data<i>**

<span style='color:orange'>Part 2: Investigate the types of variables involved, their likely distributions, and their
relationships with each other.</span>

 |Variable|Age (yr) |Weight(kg) | Total energy expenditure   |Systolic blood pressure (mmHg) |Weight Category|
 |---|:---|---|---|---|---|
 |<span style='color:grey'>**Variable type level 1**</span>|numerical|numerical|numerical|numerical|categorical|
 |<span style='color:grey'>**Variable type level 2**</span>|continuous - ratio|continuous - ratio|continuous - ratio|continuous - ratio|nominal|

 **<i>Table 2 - Variable Types<i>**

The study stated *"Descriptive statistics including <kbd>mean ± standard deviation (SD)</kbd> &nbsp; were determined for variables with normal distribution"*, which is the case for the variables per table 1 so it can therefore be stated all variables for discussion in this report are normally distributed. 


<span style='color:orange'>Part 3: Synthesise/simulate a data set as closely matching their properties as possible.</span>

## Healthy Weight Data Generation
***

In [2]:
# Define the means and standard deviations for the "healthy weight" variables
hw_age_mean = 47.9
hw_age_sd = 8.5
  
hw_weight_mean = 47.9
hw_weight_sd = 6.5

hw_height_mean = 1.6
hw_height_sd = 0.1

hw_tee_mean = 2362.0
hw_tee_sd = 341.2

hw_bp_mean = 116.3
hw_bp_sd = 19.1

In [3]:
# Generate random data for each variable with the specified standard deviations
hw_age = np.random.normal(hw_age_mean, hw_age_sd, 1000)
hw_weight = np.random.normal(hw_weight_mean, hw_weight_sd, 1000)
hw_height = np.random.normal(hw_height_mean, hw_height_sd, 1000)
hw_tee = np.random.normal(hw_tee_mean, hw_tee_sd, 1000)
hw_bp = np.random.normal(hw_bp_mean, hw_bp_sd, 1000)

In [4]:
print(hw_age.mean())
print(hw_age.std())
print(hw_weight.mean())
print(hw_weight.std())
print(hw_height.mean())
print(hw_height.std())
print(hw_tee.mean())
print(hw_tee.std())
print(hw_bp.mean())
print(hw_bp.std())

47.89518723659546
8.767928094537897
47.57744988762421
6.5878037313254785
1.5998206952145102
0.09786040696820468
2366.6668462667103
336.0559996806164
116.78096590583527
19.17436193948595


|Supplied Mean and Sigma|Output from 1,000 samples  Dif|Output from 10,000 samples|Output from 100,000 samples|
|-----|-----|-----|-----|
|hw_age_mean = 47.9|47.81299247251516|47.96596301558934|47.86359295967902|
|hw_age_sd = 8.5|8.600455933348217|8.48339578136973|8.499437770973914|
|hw_weight_mean = 47.9|47.724202758509485|48.016540015644225|47.91576967623339|
|hw_weight_sd = 6.5|6.476830442812602|6.533759642617733|6.492628439093849|
|hw_height_mean = 1.6|1.59992325926602|1.600029710705768|1.6005796952056446|
|hw_height_sd = 0.1|0.09742172592076556|0.10008765585016649|0.09949277850531468|
|hw_tee_mean = 2362.0|2350.9717066226685|2366.4390138020744|2362.369793639186|
|hw_tee_sd = 341.2|340.4809631864144|336.8788866518161|340.5291459830427|
|hw_bp_mean = 116.3|116.39811688031763|116.12013277464749|116.26332160499516|
|hw_bp_sd = 19.1|19.658444735911882|19.055677026091875|19.115876399830324|


![image-2.png](attachment:image-2.png)

In [5]:
# Test healthy weight standard deviation for randomly generated sample compared to provided standard deviation
hw_age.std()

8.767928094537897

In [6]:
# Create a dataset by combining the variables
hw_dataset = np.column_stack((hw_age, hw_height, hw_weight, hw_tee, hw_bp))

In [7]:
# Confirmatory calculation. Show the healthy weight age std dev.
hw_age.std()

8.767928094537897

The smaller the sample size the further the outputted mean and standard deviation, using the random generator, is from the originally provided values. 

For eaxmple for the age variable 
a sample size of 100 provided and mean of 48.408 (0.5 away) and a standard deviation of 8.566 
a sample size of 1000 provided and mean of 47.515 (0.4 away) and a standard deviation of 8.389
a sample size of 10000 provided and mean of 47.743 (0.2 away) and a standard deviation of 8.394
a sample size of 100000 provided and mean of 47.913 (on target) and a standard deviation of 8.477

hw_age_mean = 47.9
hw_age_sd = 8.5

Increasing the sample size to 100,000  result is not an exact match as the numbers are randomly generated using the provided mean and the standard deviation as parmeters to generate the dataset. By testing sample size numbers I found by increasing the sample I got closer to the true number. Increasing the sample size to 100,000 provided a result of closest to the provided number.

### Convert Healthy Weight data to a DataFrame

In [8]:
hw_dataframe = pd.DataFrame(hw_dataset, columns = ['hw_age', 'hw_height', 'hw_weight', 'hw_tee', 'hw_bp'])
hw_dataframe

Unnamed: 0,hw_age,hw_height,hw_weight,hw_tee,hw_bp
0,55.507218,1.558117,59.631831,2152.878630,107.431542
1,54.439967,1.682875,40.095944,2534.513766,124.247989
2,42.326951,1.571694,62.062871,2204.233616,86.221905
3,68.173877,1.616693,57.312303,2360.353194,116.942958
4,68.456679,1.741797,37.726058,2516.122577,167.757269
...,...,...,...,...,...
995,49.074771,1.635220,38.590092,2967.346850,115.520704
996,45.429705,1.536528,42.738572,2085.039147,136.278952
997,35.330695,1.615715,39.745991,1880.953789,93.400637
998,43.763956,1.593803,37.344637,2611.318137,78.247543


In [9]:
# Generate the Healthy Weight categorical DataFrame.
hw_cat_df = pd.DataFrame (['healty weight'] * 1000,  columns=['Weight Category'])
hw_dataframe1 = pd.concat([hw_dataframe, hw_cat_df], axis = 1)


In [10]:
# Show the healthy weight dataframe with the categorical column added. 
hw_dataframe1

Unnamed: 0,hw_age,hw_height,hw_weight,hw_tee,hw_bp,Weight Category
0,55.507218,1.558117,59.631831,2152.878630,107.431542,healty weight
1,54.439967,1.682875,40.095944,2534.513766,124.247989,healty weight
2,42.326951,1.571694,62.062871,2204.233616,86.221905,healty weight
3,68.173877,1.616693,57.312303,2360.353194,116.942958,healty weight
4,68.456679,1.741797,37.726058,2516.122577,167.757269,healty weight
...,...,...,...,...,...,...
995,49.074771,1.635220,38.590092,2967.346850,115.520704,healty weight
996,45.429705,1.536528,42.738572,2085.039147,136.278952,healty weight
997,35.330695,1.615715,39.745991,1880.953789,93.400637,healty weight
998,43.763956,1.593803,37.344637,2611.318137,78.247543,healty weight


## Under Weight Data Generation
***

In [11]:
# Define the means and standard deviations for the "healthy weight" variables
uw_age_mean = 47.6
uw_age_sd = 8.1
  
uw_weight_mean = 48.2
uw_weight_sd = 8.6

uw_height_mean = 1.6
uw_height_sd = 0.1

uw_tee_mean = 2380.6
uw_tee_sd = 384.6

uw_bp_mean = 111.3
uw_bp_sd = 20.0

In [12]:
# Generate random data for each variable with the specified standard deviations
uw_age = np.random.normal(uw_age_mean, uw_age_sd, 1000)
uw_weight = np.random.normal(uw_weight_mean, uw_weight_sd, 1000)
uw_height = np.random.normal(uw_height_mean, uw_height_sd, 1000)
uw_tee = np.random.normal(uw_tee_mean, uw_tee_sd, 1000)
uw_bp = np.random.normal(uw_bp_mean, uw_bp_sd, 1000)

# Create a dataset by combining the variables
uw_dataset = np.column_stack((uw_age, uw_height, uw_weight, uw_tee, uw_bp))


In [13]:
uw_dataframe = pd.DataFrame(uw_dataset, columns = ['uw_age', 'uw_height', 'uw_weight', 'uw_tee', 'uw_bp'])
uw_dataframe

Unnamed: 0,uw_age,uw_height,uw_weight,uw_tee,uw_bp
0,51.948600,1.687361,45.534344,2556.497687,130.489214
1,32.957981,1.673214,39.379667,1881.699514,110.424254
2,54.289372,1.631749,59.976774,3011.871942,122.557381
3,42.060036,1.630132,55.133113,2829.458739,100.873121
4,42.981583,1.587104,40.649827,2540.190061,131.142629
...,...,...,...,...,...
995,49.253628,1.646491,41.221928,2316.930287,66.880357
996,47.605880,1.647830,49.471134,2200.494065,105.467536
997,59.426531,1.606970,52.467595,2449.758454,89.851178
998,43.819987,1.407701,56.633505,2615.095064,118.370160


In [14]:
uw_cat_df = pd.DataFrame (['underweight'] * 1000,  columns=['Weight Category'])
uw_dataframe1 = pd.concat([ uw_dataframe, uw_cat_df], axis = 1)
uw_dataframe1

Unnamed: 0,uw_age,uw_height,uw_weight,uw_tee,uw_bp,Weight Category
0,51.948600,1.687361,45.534344,2556.497687,130.489214,underweight
1,32.957981,1.673214,39.379667,1881.699514,110.424254,underweight
2,54.289372,1.631749,59.976774,3011.871942,122.557381,underweight
3,42.060036,1.630132,55.133113,2829.458739,100.873121,underweight
4,42.981583,1.587104,40.649827,2540.190061,131.142629,underweight
...,...,...,...,...,...,...
995,49.253628,1.646491,41.221928,2316.930287,66.880357,underweight
996,47.605880,1.647830,49.471134,2200.494065,105.467536,underweight
997,59.426531,1.606970,52.467595,2449.758454,89.851178,underweight
998,43.819987,1.407701,56.633505,2615.095064,118.370160,underweight


## Obese Weight Data Generation
***

In [15]:
# Define the means and standard deviations for the "obese" variables
obese_age_mean = 48.4
obese_age_sd = 7.9

obese_weight_mean = 81.6
obese_weight_sd = 14.8

obese_height_mean = 1.5
obese_height_sd = 0.1

obese_tee_mean = 2344.0
obese_tee_sd = 269.9

obese_bp_mean = 122.6
obese_bp_sd = 26.2

# Generate random data for each variable with the specified standard deviations
# Assumption data being reverse engineered is normally distributed
obese_age = np.random.normal(obese_age_mean, obese_age_sd, 1000)
obese_weight = np.random.normal(obese_weight_mean, obese_weight_sd, 1000)
obese_height = np.random.normal(obese_height_mean, obese_height_sd, 1000)
obese_tee = np.random.normal(obese_tee_mean, obese_tee_sd, 1000)
obese_bp = np.random.normal(obese_bp_mean, obese_bp_sd, 1000)

# Create a dataset by combining the variables
obese_dataset = np.column_stack((obese_age, obese_height, obese_weight, obese_tee, obese_bp))
#obese_dataset [:5]
obese_tee.mean()

2346.0948477922675

In [16]:
ob_dataframe = pd.DataFrame(obese_dataset, columns = ['obese_age', 'obese_height', 'obese_weight', 'obese_tee', 'obese_bp'])
ob_dataframe

Unnamed: 0,obese_age,obese_height,obese_weight,obese_tee,obese_bp
0,75.446950,1.382127,80.691533,2188.239274,175.087954
1,48.458515,1.399742,83.378472,2318.505489,154.904479
2,35.348020,1.593376,94.260294,2123.822951,130.700043
3,46.379392,1.540773,69.035306,2432.158722,135.440027
4,54.845051,1.418612,70.366558,1748.616994,137.247756
...,...,...,...,...,...
995,43.638775,1.621331,73.080586,2796.138162,126.973027
996,56.985942,1.508745,72.377856,2229.800657,121.775737
997,53.410072,1.616814,82.582091,2047.314318,110.130099
998,53.031552,1.497449,75.109161,2487.557943,90.416072


In [17]:
obese_cat_df = pd.DataFrame (['obese weight'] * 1000,  columns=['Weight Category'])
obese_dataframe1 = pd.concat([ob_dataframe, obese_cat_df], axis = 1)
obese_dataframe1

Unnamed: 0,obese_age,obese_height,obese_weight,obese_tee,obese_bp,Weight Category
0,75.446950,1.382127,80.691533,2188.239274,175.087954,obese weight
1,48.458515,1.399742,83.378472,2318.505489,154.904479,obese weight
2,35.348020,1.593376,94.260294,2123.822951,130.700043,obese weight
3,46.379392,1.540773,69.035306,2432.158722,135.440027,obese weight
4,54.845051,1.418612,70.366558,1748.616994,137.247756,obese weight
...,...,...,...,...,...,...
995,43.638775,1.621331,73.080586,2796.138162,126.973027,obese weight
996,56.985942,1.508745,72.377856,2229.800657,121.775737,obese weight
997,53.410072,1.616814,82.582091,2047.314318,110.130099,obese weight
998,53.031552,1.497449,75.109161,2487.557943,90.416072,obese weight


### Concatonating data 

In [18]:
hw_age = hw_dataframe['hw_age']
uw_age = uw_dataframe['uw_age']
obese_age = ob_dataframe['obese_age']
age = pd.concat([hw_age, uw_age, obese_age], ignore_index=True)
age_df = pd.DataFrame(age)
type (age_df)

pandas.core.frame.DataFrame

In [19]:
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3000 non-null   float64
dtypes: float64(1)
memory usage: 23.6 KB


In [20]:
hw_weight = hw_dataframe['hw_weight']
uw_weight = uw_dataframe['uw_weight']
obese_weight = ob_dataframe['obese_weight']
weight = pd.concat([hw_weight, uw_weight, obese_weight], ignore_index=True)
weight_df = pd.DataFrame(weight)

In [21]:
hw_height = hw_dataframe['hw_height']
uw_height = uw_dataframe['uw_height']
obese_height = ob_dataframe['obese_height']
height = pd.concat([hw_height, uw_height, obese_height], ignore_index=True)
height_df = pd.DataFrame(height)

In [22]:
hw_tee = hw_dataframe['hw_tee']
uw_tee = uw_dataframe['uw_tee']
obese_tee = ob_dataframe['obese_tee']
tee = pd.concat([hw_tee, uw_tee, obese_tee], ignore_index=True)
tee_df = pd.DataFrame(tee)

In [23]:
hw_bp = hw_dataframe['hw_bp']
uw_bp = uw_dataframe['uw_bp']
obese_bp = ob_dataframe['obese_bp']
bp = pd.concat([hw_bp, uw_bp, obese_bp], ignore_index=True)
bp_df = pd.DataFrame(bp)

In [24]:
combined_continuous = pd.concat([age_df , weight_df,height_df, tee_df, bp_df], axis=1)
combined_continuous.columns = ['Age' , 'Weight', 'Height', 'Total Energy Ex', 'Blood Pressure' ]
combined_continuous

Unnamed: 0,Age,Weight,Height,Total Energy Ex,Blood Pressure
0,55.507218,59.631831,1.558117,2152.878630,107.431542
1,54.439967,40.095944,1.682875,2534.513766,124.247989
2,42.326951,62.062871,1.571694,2204.233616,86.221905
3,68.173877,57.312303,1.616693,2360.353194,116.942958
4,68.456679,37.726058,1.741797,2516.122577,167.757269
...,...,...,...,...,...
2995,43.638775,73.080586,1.621331,2796.138162,126.973027
2996,56.985942,72.377856,1.508745,2229.800657,121.775737
2997,53.410072,82.582091,1.616814,2047.314318,110.130099
2998,53.031552,75.109161,1.497449,2487.557943,90.416072


### Generating the Categorical Variable 'Weight Category'

In [25]:
hw = hw_dataframe1['Weight Category']
uw = uw_dataframe1['Weight Category']
obese = obese_dataframe1['Weight Category']
weightcat = pd.concat([hw, uw, obese])
weightcat_df = pd.DataFrame(weightcat)
weightcat_df

Unnamed: 0,Weight Category
0,healty weight
1,healty weight
2,healty weight
3,healty weight
4,healty weight
...,...
995,obese weight
996,obese weight
997,obese weight
998,obese weight


In [26]:
combined_all = pd.concat([combined_continuous, weightcat_df], axis=1)
combined_all

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

## Histogram of continuous variables
***

In [None]:
combined_all.hist(alpha=0.8, bins=100, figsize=(12,8))

## Scatterplots
***

In [None]:
sns.set(style="ticks", palette="deep")
f, axes = plt.subplots(2, 2, sharey=False, figsize=(12, 8))
sns.scatterplot(x = "Blood Pressure", y = "Age", hue = 'Weight Category', data = combined_all, ax=axes[0,1])
sns.scatterplot(x = "Blood Pressure" , y = "Weight"  , hue = 'Weight Category' , data = combined_all, ax=axes[1,1])
sns.scatterplot(x= "Blood Pressure"  , y= "Height", hue = 'Weight Category', data = combined_all, ax=axes[0,0])
sns.scatterplot(x= "Blood Pressure" , y= "Age", hue = 'Weight Category', data = combined_all, ax=axes[1,0])
f.suptitle("Boxplot of the Sistolic Blood Pressure Readings")
plt.show()

## Boxplots
***

In [None]:
sns.set(style="ticks", palette="pastel")
f, axes = plt.subplots(2, 2, sharey=False, figsize=(12, 8))

sns.boxplot(x= 'Weight Category', y= "Age", data = combined_all, palette = "ch:s=.25,rot=-.25",  ax=axes[0,1])
sns.boxplot(x= 'Weight Category', y= "Weight", data = combined_all,  palette = "ch:s=.25,rot=-.25",  ax=axes[1,1])
sns.boxplot(x= 'Weight Category', y= "Height", data = combined_all,  palette = "ch:s=.25,rot=-.25", ax=axes[1,0])
sns.boxplot(x= 'Weight Category', y= "Blood Pressure", data = combined_all, palette = "husl" ,ax=axes[0,0])
f.suptitle("Boxplot of the Sistolic Blood Pressure Readings")
plt.show()

## Alternative method for generating categorical data

|Variable| |	Healthy weight| 	Underweight |	 Obese| Ratio (for generation of data)|
|-----|-----|-----|-----|---|---|
|<span style='color:green'>**Employment Status**</span>| **Class**||||
||Student             |	2	    |1|	                4|0.002|
||Employed            |	1194    |70|              	721| 0.356|
||Unemployed          |	1064	|51 |                	1991|0.557|
||Retired             |	251	    |13  |            	212	| 0.085|

**<i>Table 2 - Counts of categorical data<i>**

In [None]:
import random
categories = ['Student', 'Employed', 'Unemployeed','Retired']
weights = [0.002, 0.356, 0.557, 0.085]  # Adjust weights based on desired probabilities

# Generate a list of random categories with specified weights
num_samples = 1000
random_employment_list = random.choices(categories, weights=weights, k=num_samples)
employment_list = pd.Series(random_employment_list, dtype="category")
employment_list.cat.categories
employment_list

employment_df = pd.DataFrame(random_employment_list)
employment_df.index.is_unique
employment_df.columns = ['Employment Status']

employment_df

In [None]:
names = ['Student', 'Employed', 'Unemployeed','Retired']
values = [0.1, 4.9, 4.0, 1.0]

plt.figure(figsize=(15, 3))

plt.subplot(131)
plt.bar(names, values)
plt.subplot(132)
plt.scatter(names, values)
plt.subplot(133)
plt.plot(names, values)
plt.suptitle('Count of ')
plt.show()

In [None]:
print(combined_continuous.index.duplicated().any())
print(employment_df.index.duplicated().any())

In [None]:
combined_continuous.reset_index(drop=True, inplace=True)
employment_df.reset_index(drop=True, inplace=True)


In [None]:
weightcat_df

In [None]:
employment_df

In [None]:
combined_all1 = pd.concat([combined_continuous, weightcat_df], axis=1)
combined_all1

In [None]:
sns.catplot(data=combined_all1, x= "Blood Pressure", y = "Age", hue = "Employment Status", kind="boxen")

Using ```obese_tee.mean()``` to test the output v's the inputted <kbd>mean</kbd> shows there is a difference between these two values. The main factors contributing to the discrepancy could include:

1. Sample Size: The mean calculated from a sample of 1000 data points may not exactly match the population mean you specified (i.e., obese_tee_mean). The larger the sample size, the closer the sample mean is expected to be to the population mean, but it may still exhibit some variation.

2. Randomness: Even when generating data from a normal distribution with a specified mean and standard deviation, there is inherent randomness in the process. Each time you run the code, you may get a slightly different mean due to the randomness of the data points generated.

3. Parameter Accuracy: Ensure that the values of obese_tee_mean and obese_tee_sd are set correctly. Any deviation in these parameters will result in a discrepancy between the calculated mean and the specified mean.

4. Calculation Error: The calculation of the mean using obese_tee.mean() is generally accurate, but if there is a coding error or if the data in obese_tee is not as expected, it can lead to a different result.

## REFERENCES NEEDED!!!!

In [None]:
obese_tee.std()

### t-Test comparing Weight and Sistolic Blood Proessure
***

In [None]:
ss.ttest_ind(hw_bp, obese_bp)

## One-way ANOVA comparing Blood Pressure Variables 

In [None]:
f_oneway(uw_bp, hw_bp, obese_bp)

The null  hypothesis is there is no relastionship between weight and blood pressure

Synthesise/simulate a data set as closely matching their properties as possible.

Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

### Alternative method for generating data

In [None]:
n = s.NormalDist(mu=47.9, sigma=8.5)
samples = n.samples(100000, seed=42)  
print(s.mean(samples))  
print(s.stdev(samples)) 

### References

[1] Rashidi AA, Heidari Bakavoli AR, Avan A, Aghasizade M, Ghazizadeh H, Tayefi M, Khayyatzadeh SS, Ebrahimi M, Moohebati M, Safarian M, Nematy M, Sadr-Bazzaz M, Ferns GA, Ghayour Mobarhan M. Dietary Intake and Its Relationship to Different Body Mass Index Categories: A Population-Based Study. J Res Health Sci. 2018 Sep 8;18(4):e00426. PMID: 30728312; PMCID: PMC6941633, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6941633/ accessed on 06 NOV 2023

https://datatofish.com/numpy-array-to-pandas-dataframe/ (visited 08 Nov 2023)

https://www.geeksforgeeks.org/how-to-perform-a-one-way-anova-in-python/ (visited 08 Nov 2023)

Pandas Website url: https://pandas.pydata.org/docs/user_guide/merging.html (visited 08 Nov 2023)

https://www.sciencedirect.com/science/article/pii/S1047279796001275 (visited 08 Nov 2023)

https://www.geeksforgeeks.org/how-to-combine-two-dataframe-in-python-pandas/ (visited 12 Nov 2023)

https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ (visited 12 Nov 2023)

https://matplotlib.org/stable/tutorials/pyplot.html (visited 12 Nov 2023)

https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette (visited 13 Nov 2023)