### **Method: Stratified Sampling for Population Mean BMI Estimation**

**Sampling Technique:**
Stratified sampling was utilized to estimate the population mean BMI from a dataset pertaining to diabetes health indicators. This method partitions the population into distinct strata based on specific characteristics, enhancing the precision of parameter estimation.

**Sample Size and Selection:**
A sample size of \( n = 50 \) was determined, meeting the criterion of 30 samples, ensuring the validity of the Central Limit Theorem.

**Incorporating Finite Population Correction (FPC):**
Finite Population Correction (FPC) was employed due to the sample size constituting less than 10% of the overall population size. This adjustment was integrated to rectify the slight bias introduced when the sample size is relatively large compared to the entire population.

**Estimation for Continuous Variable - BMI:**
- **Sample Mean $( \bar{x} )$:** The computed sample mean for BMI was 28.1, derived from the selected sample.

**Stratified Sampling Process:**
The study segregated the population into strata, predominantly stratifying based on HighBP as it exhibited the most significant variation. Variance between strata was examined to establish the most influential strata for the population mean BMI estimation.

**Standard Error (SE) of Sample Mean:**
The standard error for the sample mean was determined using the formula:
\[
SE(\bar{x}) = \sqrt{(1 - \frac{n}{N}) \cdot \frac{s^2}{n}}
\]
Where \( s \) represents the sample standard deviation, \( n \) is the sample size, and \( N \) is the population size.

**Confidence Interval for BMI:**
The 95% confidence interval for the population mean BMI was determined using the formula:
\[
CI = \bar{x} \pm z \cdot SE(\bar{x})
\]
The calculated confidence interval was found to be [26.41, 29.79].

**Interpretation of Results:**
With 95% confidence, it is inferred that the true population mean BMI falls within the range of 26.41 to 29.79. These inferences were drawn based on the assumption of random sampling and the approximated normality of the BMI distribution in the population.

**Advantages and Limitations:**
The application of stratified sampling provided unbiased estimates, ensuring a structured and methodical approach. Nonetheless, its efficiency might be surpassed by other techniques, especially when the population exhibits clear subgroups. Furthermore, stratified sampling mandates an exhaustive listing of the population, which might not always be practical.

---

The inclusion of the Finite Population Correction (FPC) in the methodology was deemed necessary due to the sample size constituting less than 10% of the total population size. This correction factor ensures that the estimation process takes into account the finite size of the population relative to the sample, reducing the minor bias introduced by larger sample sizes compared to the entire population.

This report delves into the stratified sampling method used for estimating the population mean of the BMI variable and includes relevant information concerning FPC to improve the accuracy of parameter estimation. These details can be tailored further to accommodate the specific intricacies of the study or research requirements.


In [3]:
######################


## 这里应该有一些问题，因为这两个量是整个dataset的proportion， 我用它去计算 between strata variation, 你们得看一下，怎么写合适
We computed the population proportion of smokers in the dataset, denoted by $p_{\text{smoker}} = 0.518$. Additionally, the population variance of smokers was calculated, resulting in a value of $0.250$.

#### Calculating Between Strata Variation

##### Explanation of Calculating Between Strata Variation

Calculating between-strata variation is crucial in stratified sampling to understand the differences in the proportion of the variable of interest (in this case, 'Smoker') across various subgroups ('HighBP', 'HighChol', 'CholCheck', 'Stroke', 'HeartDiseaseorAttack', 'HvyAlcoholConsump'). By examining these variations, we can identify the stratum that maximizes the differences in proportions, helping in more efficient estimation by selecting the most significant stratum.

##### Formula for Between Strata Variation (using Notation)
The formula for between-strata variation ($s^2_{pb}$) is represented as:
$$
s^2_{pb} = \sum_{h=1}^{H} \frac{N_h}{N} \times (p_{h} - \overline{p})^2
$$

##### Interpretation
The between-strata variation formula assesses the squared differences between the proportion of the variable of interest in each stratum and the overall population proportion, weighted by the population size in each stratum. This calculation identifies which strata exhibit the most significant differences in the variable's proportion.

This analysis helps to choose the optimal stratum for accurate estimation by focusing on the subgroup that showcases the most substantial variation in the variable of interest across the population.

The 'HeartDiseaseorAttack' stratum was selected for further estimation due to exhibiting the highest between-strata variation, enabling a more focused and precise assessment of the 'Smoker' variable within this subgroup.

This crucial step ensures that the selected stratum provides a representative and distinct subset for an accurate estimation of the variable's proportion.


### Sample Size Allocation
Two approaches were used to allocate sample sizes - the theorem-based method and the proportion allocation method. Both methods resulted in similar sample size allocations: $n_1 = 39$ and $n_2 = 11$ for the 'HeartDiseaseorAttack' strata.


The sample size allocation methods employed in our research aim to ensure that each health subgroup is adequately represented, facilitating a more accurate estimation of smoking prevalence across different health conditions. The theorem-based method optimizes sample sizes based on estimated variances and costs within each health stratum, while the proportion allocation method ensures proportional representation of each health subgroup, contributing to a more comprehensive understanding of smoking habits and their association with diverse health conditions.

#### Importance of Sample Size Allocation

The allocation of sample sizes serves a dual purpose:

1. **Enhancing Precision:** Assigning larger sample sizes to health strata with higher variability in smoking habits enables a more precise estimation within those specific subgroups. This approach allows us to capture the nuanced differences in smoking prevalence across diverse health conditions.

2. **Efficient Resource Utilization:** Balancing sample sizes based on health conditions and their respective smoking habits ensures an optimal use of resources while maintaining the accuracy of our estimations. This method aims to strike a balance between precision and cost efficiency.

#### Formulas for Sample Size Allocation Methods

##### 1. Theorem-Based Allocation:
The theorem-based allocation formula used is:

$
\frac{n_{h}}{n} = \frac{N_{h} \times \left(\frac{s_{h,\text{guess}}}{\sqrt{c_{h}}}\right)}{\sum_{k=1}^{H} N_{k} \times \left(\frac{s_{k,\text{guess}}}{\sqrt{c_{h}}}\right)}
$
##### 2. Proportion Allocation Method:
The formula for the proportion allocation method is:

$
n_{h} = \frac{N_{h}}{N} \times n
$



### Estimation
Using the sampled data, we estimated the population mean of smokers across 'HeartDiseaseorAttack' strata. The estimated population mean was $0.537$ with a standard error of $0.120$.

The estimation of the population mean, denoted by $\bar{y}_{str}$, is calculated as:

$
\bar{y}_{str} = \sum_{h=1}^{H} \left( \frac{N_h}{N} \times \bar{y}_h \right)
$

The Standard Error (SE) is calculated as:

$
SE = \sqrt{\sum_{h=1}^{H} \left( \left(1 - \frac{n_h}{N_h}\right) \times \frac{s_h^2}{n_h} \right)}
$

### Confidence Interval
A 95% confidence interval was calculated using the z-score method, resulting in a confidence interval of $(0.302, 0.773)$ for the population proportion of smokers within the 'HeartDiseaseorAttack' strata.

The Confidence Interval (CI) is derived using the formula:

$
\text{CI} = \bar{y}_{str} \pm z \times SE
$

### Conclusion
Utilizing stratified sampling methodology has provided a comprehensive and nuanced understanding of smoking prevalence across diverse health conditions. The estimated smoking prevalence of approximately 53.74% serves as a critical benchmark within our sampled population. This estimation is a valuable insight into the distribution of smoking behaviors within various health subgroups, aiding in a more comprehensive understanding of the prevalence of smoking habits in relation to different health conditions.

The 95% Confidence Interval (CI), ranging from approximately 30.18% to 77.29%, encapsulates the potential variability and uncertainty surrounding our estimated smoking prevalence. This interval highlights the level of confidence we have in the range within which the true population prevalence may lie.

In conclusion, the stratified sampling approach has provided a more detailed and accurate estimation of smoking prevalence across various health conditions. The estimated prevalence, along with the confidence interval, forms a crucial foundation for policy considerations and targeted interventions. However, the wider interval underscores the need for further research to obtain a more precise understanding of smoking behaviors within each specific health subgroup. This approach showcases the significance of a comprehensive understanding of smoking prevalence concerning diverse health conditions for effective policy-making and intervention strategies.

In [4]:
#########################

#### Calculating Between Strata Variation

##### Explanation of Calculating Between Strata Variation

Calculating between-strata variation is crucial in stratified sampling to understand the differences in the mean of the variable of interest (in this case, 'BMI') across various subgroups ('HighBP', 'HighChol', 'CholCheck', 'Stroke', 'HeartDiseaseorAttack', 'HvyAlcoholConsump'). By examining these variations, we can identify the stratum that maximizes the differences in mean, helping in more efficient estimation by selecting the most significant stratum.

##### Formula for Between Strata Variation (using Notation)
The formula for between-strata variation ($s^2_{pb}$) is represented as:
$$
s^2_{pb} = \sum_{h=1}^{H} \frac{N_h}{N} \times (\overline{y_{P_{h}}} - \overline{y_{p}})^2
$$

##### Interpretation
The between-strata variation formula assesses the squared differences between the mean of the variable of interest in each stratum and the overall population mean, weighted by the population size in each stratum. This calculation identifies which strata exhibit the most significant differences in the variable's mean.

This analysis helps to choose the optimal stratum for accurate estimation by focusing on the subgroup that showcases the most substantial variation in the variable of interest across the population.

The ' HighBP' stratum was selected for further estimation due to exhibiting the highest between-strata variation, enabling a more focused and precise assessment of the 'BMI' variable within this subgroup.

This crucial step ensures that the selected stratum provides a representative and distinct subset for an accurate estimation of the variable's proportion.


### Sample Size Allocation
Two approaches were used to allocate sample sizes - the theorem-based method and the proportion allocation method. Both methods resulted in similar sample size allocations: $n_1 = 12$ and $n_2 = 38$ for the 'HighBP' strata.


The sample size allocation methods employed in our research aim to ensure that each health subgroup is adequately represented, facilitating a more accurate estimation of BMI prevalence across different health conditions. The theorem-based method optimizes sample sizes based on estimated variances and costs within each health stratum, while the mean allocation method ensures proportional representation of each health subgroup, contributing to a more comprehensive understanding of smoking habits and their association with diverse health conditions.

#### Importance of Sample Size Allocation

The allocation of sample sizes serves a dual purpose:

1. **Enhancing Precision:** Assigning larger sample sizes to health strata with higher variability in BMI enables a more precise estimation within those specific subgroups. This approach allows us to capture the nuanced differences in BMI prevalence across diverse health conditions.

2. **Efficient Resource Utilization:** Balancing sample sizes based on health conditions and their respective BMI ensures an optimal use of resources while maintaining the accuracy of our estimations. This method aims to strike a balance between precision and cost efficiency.

#### Formulas for Sample Size Allocation Methods

##### 1. Theorem-Based Allocation:
The theorem-based allocation formula used is:

$
\frac{n_{h}}{n} = \frac{N_{h} \times \left(\frac{s_{h,\text{guess}}}{\sqrt{c_{h}}}\right)}{\sum_{k=1}^{H} N_{k} \times \left(\frac{s_{k,\text{guess}}}{\sqrt{c_{h}}}\right)}
$
##### 2. Proportion Allocation Method:
The formula for the proportion allocation method is:

$
n_{h} = \frac{N_{h}}{N} \times n
$



### Estimation
Using the sampled data, we estimated the population mean of BMI across 'HighBP' strata. The estimated population mean was $0.537$ with a standard error of $0.120$.

The estimation of the population mean, denoted by $\bar{y}_{str}$, is calculated as:

$
\bar{y}_{str} = \sum_{h=1}^{H} \left( \frac{N_h}{N} \times \bar{y}_h \right)
$

The Standard Error (SE) is calculated as:

$
SE = \sqrt{\sum_{h=1}^{H} \left( \left(1 - \frac{n_h}{N_h}\right) \times \frac{s_h^2}{n_h} \right)}
$

### Confidence Interval
A 95% confidence interval was calculated using the z-score method, resulting in a confidence interval of $(0.302, 0.773)$ for the population proportion of smokers within the 'HeartDiseaseorAttack' strata.

The Confidence Interval (CI) is derived using the formula:

$
\text{CI} = \bar{y}_{str} \pm z \times SE
$

### Conclusion
Utilizing stratified sampling methodology has provided a comprehensive and nuanced understanding of smoking prevalence across diverse health conditions. The estimated smoking prevalence of approximately 53.74% serves as a critical benchmark within our sampled population. This estimation is a valuable insight into the distribution of smoking behaviors within various health subgroups, aiding in a more comprehensive understanding of the prevalence of smoking habits in relation to different health conditions.

The 95% Confidence Interval (CI), ranging from approximately 30.18% to 77.29%, encapsulates the potential variability and uncertainty surrounding our estimated smoking prevalence. This interval highlights the level of confidence we have in the range within which the true population prevalence may lie.

In conclusion, the stratified sampling approach has provided a more detailed and accurate estimation of smoking prevalence across various health conditions. The estimated prevalence, along with the confidence interval, forms a crucial foundation for policy considerations and targeted interventions. However, the wider interval underscores the need for further research to obtain a more precise understanding of smoking behaviors within each specific health subgroup. This approach showcases the significance of a comprehensive understanding of smoking prevalence concerning diverse health conditions for effective policy-making and intervention strategies.