# **Method: Stratified Sampling for estimate Population Mean BMI and Population Proportion for smoker**

Stratified sampling is a critical method employed in research and data collection, particularly in health analysis, to ensure a more comprehensive understanding of variables across diverse health conditions or attributes. By dividing the population into distinct subgroups, or strata, based on specific health characteristics ('HighBP', 'HighChol', 'CholCheck', 'Stroke', 'HeartDiseaseorAttack', 'HvyAlcoholConsump'), this methodology aims to capture the nuanced differences and variations within various health conditions. The rationale behind employing stratified sampling is to address the potential variability and distinct characteristics within these segments of the population, ensuring a more accurate estimation of the prevalence of variables like 'Smoker' and 'BMI' across different health conditions.



## BMI

## tip: 这里应该有一些问题，下面这两个量是整个dataset的mean， 我用它去计算 between strata variation, 你们得看一下，怎么写合适
We computed the population proportion of smokers in the dataset, denoted by $p_{\text{smoker}} = 31.944$. Additionally, the population variance of smokers was calculated, resulting in a value of $54.22$.

#### Choosing the partition for the population by Calculating Between Strata Variation

##### Explanation of Calculating Between Strata Variation

Calculating between-strata variation is crucial in stratified sampling to understand the differences in the mean of the variable of interest (in this case, 'BMI') across various subgroups ('HighBP', 'HighChol', 'CholCheck', 'Stroke', 'HeartDiseaseorAttack', 'HvyAlcoholConsump'). By examining these variations, we can identify the stratum that maximizes the differences in mean, helping in more efficient estimation by selecting the most significant stratum.

##### Formula for Between Strata Variation (using Notation)
The formula for between-strata variation ($s^2_{pb}$) is represented as:
$$
s^2_{pb} = \sum_{h=1}^{H} \frac{N_h}{N} \times (\overline{y_{P_{h}}} - \overline{y_{p}})^2
$$

##### Interpretation
The between-strata variation formula assesses the squared differences between the mean of the variable of interest in each stratum and the overall population mean, weighted by the population size in each stratum. This calculation identifies which strata exhibit the most significant differences in the variable's mean.

This analysis helps to choose the optimal stratum for accurate estimation by focusing on the subgroup that showcases the most substantial variation in the variable of interest across the population.

The ' HighBP' stratum was selected for further estimation due to exhibiting the highest between-strata variation, enabling a more focused and precise assessment of the 'BMI' variable within this subgroup.

This crucial step ensures that the selected stratum provides a representative and distinct subset for an accurate estimation of the variable's proportion.


### Sample Size Allocation
Two approaches were used to allocate sample sizes - the theorem-based method and the proportion allocation method. Both methods resulted in similar sample size allocations: $n_1 = 10$ and $n_2 = 32$ for the 'HighBP' strata.


The sample size allocation methods employed in our research aim to ensure that each health subgroup is adequately represented, facilitating a more accurate estimation of BMI prevalence across different health conditions. The theorem-based method optimizes sample sizes based on estimated variances and costs within each health stratum, while the mean allocation method ensures proportional representation of each health subgroup, contributing to a more comprehensive understanding of smoking habits and their association with diverse health conditions.

#### Importance of Sample Size Allocation

The allocation of sample sizes serves a dual purpose:

1. **Enhancing Precision:** Assigning larger sample sizes to health strata with higher variability in BMI enables a more precise estimation within those specific subgroups. This approach allows us to capture the nuanced differences in BMI prevalence across diverse health conditions.

2. **Efficient Resource Utilization:** Balancing sample sizes based on health conditions and their respective BMI ensures an optimal use of resources while maintaining the accuracy of our estimations. This method aims to strike a balance between precision and cost efficiency.

#### Formulas for Sample Size Allocation Methods

##### 1. Theorem-Based Allocation:
The theorem-based allocation formula used is:

$
\frac{n_{h}}{n} = \frac{N_{h} \times \left(\frac{s_{h,\text{guess}}}{\sqrt{c_{h}}}\right)}{\sum_{k=1}^{H} N_{k} \times \left(\frac{s_{k,\text{guess}}}{\sqrt{c_{h}}}\right)}
$
##### 2. Proportion Allocation Method:
The formula for the proportion allocation method is:

$
n_{h} = \frac{N_{h}}{N} \times n
$



### Estimation
Using the sampled data, we estimated the population mean of BMI across 'HighBP' strata. The estimated population mean was $33.48$ with a standard error of $0.92$.

The estimation of the population mean, denoted by $\bar{y}_{str}$, is calculated as:

$
\bar{y}_{str} = \sum_{h=1}^{H} \left( \frac{N_h}{N} \times \bar{y}_h \right)
$

The Standard Error (SE) is calculated as:

$
SE = \sqrt{\sum_{h=1}^{H} \left( \left(1 - \frac{n_h}{N_h}\right) \times \frac{s_h^2}{n_h} \right)}
$

### Confidence Interval
A 95% confidence interval was calculated using the z-score method, resulting in a confidence interval of $(31.68,35.28)$ for the population mean of smokers within the 'HighBP' strata.

The Confidence Interval (CI) is derived using the formula:

$
\text{CI} = \bar{y}_{str} \pm z \times SE
$

### Conclusion

The estimation of the population mean BMI across the 'HighBP' strata presents a compelling insight into the distribution of BMI within individuals with high blood pressure. The estimated population mean of 33.48 with a standard error of 0.92 and a 95% confidence interval ranging from 31.68 to 35.28 signifies a crucial understanding of the potential BMI average for individuals affected by high blood pressure.

The relatively low standard error of 0.92 indicates a fair degree of precision in our estimation of the population mean BMI. This suggests that the estimated sample mean is relatively close to the actual population mean, providing confidence in the accuracy of the estimation. The small standard error indicates less variability or more consistency in the sample means, which enhances the reliability of our estimate.

Moreover, the 95% confidence interval of 31.68 to 35.28 reinforces the reliability of our estimation. It implies that in 95 out of 100 cases, the true population mean BMI of individuals with high blood pressure is likely to fall within this interval. This provides a certain level of confidence and reliability in the accuracy of our estimation.

Therefore, based on our estimates, we can reasonably conclude that the average BMI for individuals with high blood pressure falls within the range of 31.68 to 35.28. This information can be immensely valuable for healthcare professionals, policymakers, and researchers aiming to understand and address health concerns and interventions related to high blood pressure and its association with BMI.

## Smoker


## tips: 相同的问题，下面这两个量是整个dataset的proportion， 我用它去计算 between strata variation, 你们得看一下，怎么写合适
We computed the population proportion of smokers in the dataset, denoted by $p_{\text{smoker}} = 0.518$. Additionally, the population variance of smokers was calculated, resulting in a value of $0.250$.

#### Choosing the partition for the population by Calculating Between Strata Variation

##### Explanation of Calculating Between Strata Variation

Calculating between-strata variation is crucial in stratified sampling to understand the differences in the proportion of the variable of interest (in this case, 'Smoker') across various subgroups ('HighBP', 'HighChol', 'CholCheck', 'Stroke', 'HeartDiseaseorAttack', 'HvyAlcoholConsump'). By examining these variations, we can identify the stratum that maximizes the differences in proportions, helping in more efficient estimation by selecting the most significant stratum.

##### Formula for Between Strata Variation (using Notation)
The formula for between-strata variation ($s^2_{pb}$) is represented as:
$$
s^2_{pb} = \sum_{h=1}^{H} \frac{N_h}{N} \times (\hat{p_{p_{h}}} - p)^2
$$

##### Interpretation
The between-strata variation formula assesses the squared differences between the proportion of the variable of interest in each stratum and the overall population proportion, weighted by the population size in each stratum. This calculation identifies which strata exhibit the most significant differences in the variable's proportion.

This analysis helps to choose the optimal stratum for accurate estimation by focusing on the subgroup that showcases the most substantial variation in the variable of interest across the population.

The 'HeartDiseaseorAttack' stratum was selected for further estimation due to exhibiting the highest between-strata variation, enabling a more focused and precise assessment of the 'Smoker' variable within this subgroup.

This crucial step ensures that the selected stratum provides a representative and distinct subset for an accurate estimation of the variable's proportion.


### Sample Size Allocation
Two approaches were used to allocate sample sizes - the theorem-based method and the proportion allocation method. Both methods resulted in similar sample size allocations: $n_1 = 297$ and $n_2 = 82$ for the 'HeartDiseaseorAttack' strata.


The sample size allocation methods employed in our research aim to ensure that each health subgroup is adequately represented, facilitating a more accurate estimation of smoking prevalence across different health conditions. The theorem-based method optimizes sample sizes based on estimated variances and costs within each health stratum, while the proportion allocation method ensures proportional representation of each health subgroup, contributing to a more comprehensive understanding of smoking habits and their association with diverse health conditions.

#### Importance of Sample Size Allocation

The allocation of sample sizes serves a dual purpose:

1. **Enhancing Precision:** Assigning larger sample sizes to health strata with higher variability in smoking habits enables a more precise estimation within those specific subgroups. This approach allows us to capture the nuanced differences in smoking prevalence across diverse health conditions.

2. **Efficient Resource Utilization:** Balancing sample sizes based on health conditions and their respective smoking habits ensures an optimal use of resources while maintaining the accuracy of our estimations. This method aims to strike a balance between precision and cost efficiency.

#### Formulas for Sample Size Allocation Methods

##### 1. Theorem-Based Allocation:
The theorem-based allocation formula used is:

$
\frac{n_{h}}{n} = \frac{N_{h} \times \left(\frac{s_{h,\text{guess}}}{\sqrt{c_{h}}}\right)}{\sum_{k=1}^{H} N_{k} \times \left(\frac{s_{k,\text{guess}}}{\sqrt{c_{h}}}\right)}
$
##### 2. Proportion Allocation Method:
The formula for the proportion allocation method is:

$
n_{h} = \frac{N_{h}}{N} \times n
$



### Estimation
Using the sampled data, we estimated the population mean of smokers across 'HeartDiseaseorAttack' strata. The estimated population mean was $0.5747$ with a standard error of $0.04$.

The estimation of the population mean, denoted by $\bar{y}_{str}$, is calculated as:

$
\hat{p}_{str} = \sum_{h=1}^{H} \left( \frac{N_h}{N} \times \hat{p}_{S_{h}} \right)
$

The Standard Error (SE) is calculated as:

$
SE = \sqrt{\sum_{h=1}^{H} \left( \left(1 - \frac{n_h}{N_h}\right) \times \frac{s_h^2}{n_h} \right)}
$

### Confidence Interval
A 95% confidence interval was calculated using the z-score method, resulting in a confidence interval of $(0.4928, 0.6565)$ for the population proportion of smokers within the 'HeartDiseaseorAttack' strata.

The Confidence Interval (CI) is derived using the formula:

$
\text{CI} = \hat{p}_{str} \pm z \times SE
$

### Conclusion
Utilizing stratified sampling methodology has provided a comprehensive and nuanced understanding of smoking prevalence across diverse health conditions. The estimated smoking prevalence of approximately 57.47% serves as a critical benchmark within our sampled population. This estimation is a valuable insight into the distribution of smoking behaviors within various health subgroups, aiding in a more comprehensive understanding of the prevalence of smoking habits in relation to different health conditions.

The 95% Confidence Interval (CI), ranging from approximately 49% to 66%, encapsulates the potential variability and uncertainty surrounding our estimated smoking prevalence. This interval highlights the level of confidence we have in the range within which the true population prevalence may lie.

In conclusion, the stratified sampling approach has provided a more detailed and accurate estimation of smoking prevalence across various health conditions. The estimated prevalence, along with the confidence interval, forms a crucial foundation for policy considerations and targeted interventions. However, the wider interval underscores the need for further research to obtain a more precise understanding of smoking behaviors within each specific health subgroup. This approach showcases the significance of a comprehensive understanding of smoking prevalence concerning diverse health conditions for effective policy-making and intervention strategies.