# Playground (Sandbox)

This is a safe space to **experiment**. It generates a small synthetic dataset so you can practise plotting and simple analyses used in FB2NEP.

- Run the setup cells from top to bottom.  
- Change the parameters (like sample size) and re-run to see what happens.

> **Note:** This Sandbox uses **small synthetic data**. Avoid uploading large files (>50–100 MB) in Colab; use Drive and read in chunks if needed.

In [None]:
# Reset & verify environment (run first if anything is odd)
import sys, numpy as np, pandas as pd, matplotlib
print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__, "| pandas:", pd.__version__, "| Matplotlib:", matplotlib.__version__)
SEED = 11088  # FB2NEP reproducibility convention
np.random.seed(SEED)

In [None]:
# Setup: import required libraries for the sandbox
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

pd.set_option("display.max_columns", 50)

In [None]:
# If a package is missing in Colab, uncomment and run, then Runtime → Restart runtime
# %pip -q install statsmodels scipy

## 1) Parameters — change these and re-run

In [None]:
# ▶ Try changing N or the effect sizes and re-run the next cells
N = 400          # sample size
RATIO_F = 0.6    # fraction female
GROUP_EFFECT = -3.5  # mean SBP difference (B vs A), in mmHg

np.random.seed(SEED)

## 2) Generate a simple dataset

In [None]:
ages = np.random.normal(45, 12, N).round(1)
bmi  = np.random.normal(26, 4, N).round(1)

sex = np.where(np.random.rand(N) < RATIO_F, "F", "M")
group = np.where(np.random.rand(N) < 0.5, "A", "B")

# SBP depends on age, BMI, sex, and group (B has lower mean by GROUP_EFFECT)
base = 110 + 0.35*ages + 0.9*bmi + np.where(sex=="M", 4.0, 0.0)
sbp = base + np.where(group=="B", GROUP_EFFECT, 0.0) + np.random.normal(0, 8, N)

# Total cholesterol (mmol/L) loosely related to age/BMI
chol = 3.8 + 0.015*ages + 0.05*bmi + np.random.normal(0, 0.4, N)

# Binary outcome: high SBP (≥140)
high_sbp = (sbp >= 140).astype(int)

df = pd.DataFrame({
    "age": ages,
    "bmi": bmi,
    "sex": sex,
    "group": group,
    "sbp": sbp.round(1),
    "chol": chol.round(2),
    "high_sbp": high_sbp
})
df.head()

## 3) Quick exploration

Run the following cells; then try changing parameters above (e.g. `N`, `GROUP_EFFECT`) and re-run.

In [None]:
print("Shape:", df.shape)
df.info()

In [None]:
df.describe(include="all")

In [None]:
df['sex'].value_counts(), df['group'].value_counts()

In [None]:
df.isna().mean()  # missingness per column

## 4) Plots

In [None]:
# Histogram of SBP
df['sbp'].hist(bins=25)
plt.xlabel("SBP (mmHg)"); plt.ylabel("Count"); plt.title("SBP distribution")
plt.tight_layout()
plt.show()

In [None]:
# Boxplot by group
df.boxplot(column="sbp", by="group")
plt.suptitle("")
plt.title("SBP by group"); plt.xlabel("group"); plt.ylabel("SBP (mmHg)")
plt.tight_layout()
plt.show()

In [None]:
# Scatter: BMI vs SBP
plt.scatter(df['bmi'], df['sbp'])
plt.xlabel("BMI"); plt.ylabel("SBP (mmHg)"); plt.title("BMI vs SBP")
plt.tight_layout()
plt.show()

## 5) Table 1-style summary

In [None]:
# Continuous variables by group
cont = df.groupby('group')[['age','bmi','sbp','chol']].agg(['mean','std','count'])
cont

In [None]:
# Categorical variables by group
pd.crosstab(df['group'], df['sex'], margins=True, normalize='index')

## 6) Basic hypothesis tests

In [None]:
# Two-sample t-test: SBP between A and B
a = df.loc[df['group']=='A','sbp']
b = df.loc[df['group']=='B','sbp']
stats.ttest_ind(a, b, equal_var=False)

In [None]:
# Chi-square test: sex distribution by group
tab = pd.crosstab(df['group'], df['sex'])
tab, stats.chi2_contingency(tab)[:2]  # (table, (chi2, p))

## 7) Simple models

We’ll use `statsmodels` with formula syntax. `C(var)` treats a variable as categorical.

In [None]:
# OLS: SBP ~ age + BMI + sex + group
ols = smf.ols("sbp ~ age + bmi + C(sex) + C(group)", data=df).fit()
ols.summary()

In [None]:
# Logistic regression: high_sbp (0/1) ~ predictors
logit = smf.logit("high_sbp ~ age + bmi + C(sex) + C(group)", data=df).fit(disp=False)
logit.summary()

## 8) Data I/O helpers (optional, Colab)

In [None]:
# Upload a CSV from your computer (Colab only)
# from google.colab import files
# up = files.upload()  # pick a file
# import pandas as pd
# df = pd.read_csv(next(iter(up.keys())))
# df.head()

In [None]:
# Mount your Google Drive (persistent files across sessions)
# from google.colab import drive
# drive.mount('/content/drive')
# Example save path:
# df.to_csv('/content/drive/MyDrive/fb2nep/sandbox_output.csv', index=False)

## 9) Export your work

In [None]:
# Save your sandbox dataset/results locally in this session
df.to_csv("sandbox_output.csv", index=False)
print("Saved as sandbox_output.csv. In Colab, open the Files panel (left) → three dots → Download.")

## 10) Your turn — try these

- Change `GROUP_EFFECT` to `+2.0` (so group B has **higher** SBP) and re-generate. What happens to the t-test and model coefficients?  
- Increase `N` to 2000. Do p-values change? Why?  
- Add `C(group):C(sex)` interaction to the OLS formula. Does it help?  
- Create a new variable `waist = 2.5*bmi + noise` and see how it relates to SBP.