In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("gla05.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Guided Learning Activity 05: Data Analysis

This Guided Learning Activity is designed for you to complete alongside a Data Ambassador from the course. You might find that it feels like a combination of the lectures and lab assignment. Whether you are participating live or watching the recording of the live meeting, let the Data Ambassador guide you through the following tasks. There will be moments for you to reflect and explore your own ideas as a way to solidify concepts and skills introduced by your instructor. Keep in mind that this is not a graded assignment for MATH 108 by default. If you have any concerns about participation, reach out to your instructor.

---

## Learning Objectives

1. Understand how to use the `datascience` library to manipulate and summarize health-related survey data. 
2. Learn to compute and interpret prevalence rates for health conditions. 
3. Explore data visualization techniques to compare hypertension prevalence across demographic groups.  
4. Practice using `.group()`, `.pivot()`, and `.join()` table methods to organize and analyze data.  
5. Gain insights into how national survey data informs public health policies and medical research.  

---

## National Health and Nutrition Examination Survey

<img src="./NHANES-Trademark.avif" width=200px alt="NHANES logo">

* The [National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/about/index.html) (NHANES), conducted by the National Center for Health Statistics, collects data on the health, diet, and nutrition of U.S. adults and children.
* It is the only national survey that includes health exams, lab tests, and dietary interviews for all ages.
* The study aims to represent the civilian, non-institutionalized U.S. population, excluding those in supervised care, active-duty military personnel and families overseas, and citizens outside the 50 states and D.C.
* It includes non-institutional group quarters like college dorms.
* NHANES data inform medical practices and public health policies to improve overall health in the U.S.
* In this activity, you will utilize data from the [2021-2023 NHANES cycle](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2021-2023).

---

### Hypertension

Hypertension, commonly known as high blood pressure, is a chronic medical condition defined as **systolic blood pressure at or above 130 mmHg and/or diastolic blood pressure at or above 80 mmHg**. As a leading modifiable risk factor for cardiovascular diseases, stroke, kidney failure, and dementia, hypertension affects over 1.3 billion adults globally according to the [World Health Organization (WHO)](https://www.who.int/thailand/news/detail/19-09-2023-first-who-report-details-devastating-impact-of-hypertension-and-ways-to-stop-it). Its societal importance stems from its pervasive prevalence, "silent" asymptomatic progression, and disproportionate impact on various communities due to socioeconomic and environmental factors. Systematic monitoring through surveys like NHANES enables early detection of population-level trends, evaluation of public health interventions, and identification of health disparities - critical for reducing cardiovascular mortality and containing associated economic burdens. In this activity, you will focus on recreating results from the NCHS Data Brief No. 511 from October 2024 titled [_Hypertension Prevalence, Awareness, Treatment, and Control Among Adults Age 18 and Older: United States, August 2021–August 2023_](https://www.cdc.gov/nchs/products/databriefs/db511.htm).

---

### Blood Pressure

According to the [American Heart Association](https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings), your blood pressure is recorded as two numbers:

* Systolic blood pressure is the first number. It measures the pressure your blood is pushing against your artery walls when the heart beats.
* Diastolic blood pressure is the second number. It measures the pressure your blood is pushing against your artery walls while the heart muscle rests between beats.

The following table from the same source categorizes various blood pressure measurement intervals for adults (18+).

| **Blood Pressure Category**                       | **Systolic (mm Hg)**         | **Diastolic (mm Hg)**         |
|--------------------------------------------------|------------------------------|-------------------------------|
| **Normal**                                       | Less than 120                | Less than 80                  |
| **Elevated**                                     | 120 – 129                    | Less than 80                  |
| **High Blood Pressure (Hypertension) Stage 1**  | 130 – 139                    | 80 – 89                       |
| **High Blood Pressure (Hypertension) Stage 2**  | 140 or higher                | 90 or higher                  |
| **Hypertensive Crisis** (consult your doctor)    | Higher than 180              | Higher than 120               |

For children, hypertension is defined differently. 

---

### Pandas

<a href="https://pandas.pydata.org/about/citing.html"><img src="./pandas.svg" width=200px alt="Pandas logo"></a>

* [Pandas](https://pandas.pydata.org/) is a powerful open-source Python library for data manipulation and analysis, built on NumPy.
* Designed for working with structured/relational data, it excels at cleaning, transforming, and analyzing datasets through intuitive functions for filtering, grouping, merging, and handling missing values.
* The `datascience` used in MATH 108 was created to provide a softer entry point to more professional libraries like Pandas, but it is somewhat limited in its functionality.
* You can convert between datascience Tables and Pandas DataFrames using the [`to_df`](https://datascience.readthedocs.io/en/master/_autosummary/datascience.tables.Table.to_df.html#datascience.tables.Table.to_df) and [`from_df`](https://datascience.readthedocs.io/en/master/_autosummary/datascience.tables.Table.from_df.html#datascience.tables.Table.from_df) functions from the `datascience` library.
* We will only use Pandas in this activity to extract NHANES data.
    * The data is sorted in `XPT` (`.xpt`) files, a common statistical file type derived by the [SAS Institute](https://www.sas.com/).
    * The Pandas function [`read_sas`](https://pandas.pydata.org/docs/reference/api/pandas.read_sas.html) can read `XPT` files
* Run the following command to import Pandas as `pd` along with the other common tools and configurations from MATH 108.

In [None]:
import pandas as pd
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

### The Data

In order to recreate the results found in [_Hypertension Prevalence, Awareness, Treatment, and Control Among Adults Age 18 and Older: United States, August 2021–August 2023_](https://www.cdc.gov/nchs/products/databriefs/db511.htm), you will need to access demographics and physical examination data sourced from the [2021 - 2023 NHANES website](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2021-2023). 

---

#### Demographics Data

* The [demographics data](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&Cycle=2021-2023) is stored in the file `DEMO_L.xpt`.
* The <a href="https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.htm">demographics documentation file</a> contains information about the data, such as what the variable codes represent.

---

#### Task 01 📍

Use the `pd.read_sas` Pandas function and `Table.read_df` `datascience` function to create a table called `demographics` from  `DEMO_L.xpt`.

In [None]:
demographics = ...
demographics

In [None]:
grader.check("task_01")

---

#### Examination Data

* The [examination data on blood pressure](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination&Cycle=2021-2023) is stored in the file `BPXO_L.xpt`.
* The <a href="https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/BPXO_L.htm">examination documentation file</a> contains information about the data, such as what the variable codes represent.

---

#### Task 02 📍

Use the `pd.read_sas` Pandas function and `Table.read_df` `datascience` function to create a table called `examination` from  `BPXO_L.xpt`.

In [None]:
examination = ...
examination

In [None]:
grader.check("task_02")

---

#### Questionnaire Data

* The [questionnaire data on blood pressure and cholesterol](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&Cycle=2021-2023) is stored in the file `BPQ_L.xpt`.
* The <a href="https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/BPQ_L.htm">questionnaire documentation file</a> contains information about the data, such as what the variable codes represent.

---

#### Task 03 📍

Use the `pd.read_sas` Pandas function and `Table.read_df` `datascience` function to create a table called `questionnaire` from  `BPQ_L.xpt`.

In [None]:
questionnaire = ...
questionnaire

In [None]:
grader.check("task_03")

---

### Defining Hypertension

According to the NHANES data report:
> An average of up to three brachial systolic and diastolic blood pressure readings were taken using an oscillometric device. All blood pressure readings were obtained during a participant's health examination in the mobile examination center by trained staff following a standard protocol.

So, an individual (identified by their `SEQN` number) will be defined as having hypertension if their average systolic blood pressure is at or above 130 mmHg and/or their average diastolic blood pressure is at or above 80 mmHg, or if they are currently taking medication to lower blood pressure.
* The systolic blood pressure values are associated with columns labels starting with `'BPXOSY'` in `examination`.
* The diastolic blood pressure values are associated with columns labels starting with `'BPXODI'` in `examination`.
* The results of the individuals reporting if they are taking blood pressure medication (`1`) are found in the `'BPQ150'` variable in `questionnaire`.

---

### Task 04 📍

Create a table called `blood_pressure` that contains the same information from `examination` along with an additional column at the right end called `'BPQ150'` that includes the same information from `'BPQ150'` in the `questionnaire` table where there is a match in the two tables based on the `'SEQN'` values.

In [None]:
blood_pressure = ...
blood_pressure

In [None]:
""" # BEGIN TEST CONFIG
hidden: false
success_message: "✅ blood_pressure is a Table."
failure_message: "❌ blood_pressure is not a Table."
""" # END TEST CONFIG

isinstance(blood_pressure, Table)

In [None]:
""" # BEGIN TEST CONFIG
hidden: false
success_message: "✅ blood_pressure has the correct number of rows."
failure_message: "❌ blood_pressure does not have the correct number of rows."
""" # END TEST CONFIG

blood_pressure.num_rows

In [None]:
""" # BEGIN TEST CONFIG
hidden: false
success_message: "✅ blood_pressure has the correct column labels."
failure_message: "❌ blood_pressure does not have the correct column labels."
""" # END TEST CONFIG

blood_pressure.labels == ('SEQN',
 'BPAOARM',
 'BPAOCSZ',
 'BPXOSY1',
 'BPXODI1',
 'BPXOSY2',
 'BPXODI2',
 'BPXOSY3',
 'BPXODI3',
 'BPXOPLS1',
 'BPXOPLS2',
 'BPXOPLS3',
 'BPQ150')

In [None]:
""" # BEGIN TEST CONFIG
hidden: false
success_message: "✅ blood_pressure seems like the correct table."
failure_message: "❌ blood_pressure does not seem like the correct table."
""" # END TEST CONFIG

blood_pressure

---

### Task 05 📍

Create a table called `blood_pressure_with_hypertension` that contains the same information from `blood_pressure` with an extra column called `'HYPERTENSION'` of `bool` values to the far right-end of the `blood_pressure` table, indicating whether or not the individual should be labeled as having hypertension based on the above definition. We've provided a few functions and a template to help with this one!

In [None]:
def average_bp(measurement1, measurement2, measurement3):
    return np.average([measurement1, measurement2, measurement3])

def bp_medication(bqp150):
    if bqp150 == 1:
        return True
    else:
        return False
    
BPXOSY_AVE_arr = ...
BPXODI_AVE_arr = ...
bp_medication_arr = ...
hypertension_arr = ...
blood_pressure_with_hypertension = ...
blood_pressure_with_hypertension

In [None]:
grader.check("task_05")

---

### Including Demographics

* Based on the data and defintion we have, the calculated prevalence of hypertension in the data set is around 32\%.
* The NHANES report indicates a higher prevalence of hypertension of 47.7\%.
* The report only considers those 18+ in age.
* The report indicates that blood pressure analysis "excluded pregnant women."
* The report also indicates that 
* You should filter the blood pressure data based on the age of the individual and their pregnancy status.
* Additionally, the report showcases the prevalence of hypertension for various adult age groups and by gender.

---

### Task 06 📍

Join the `demographics` table with the `blood_pressure_with_hypertension` table for individuals 18+ in age into one table called `nhanes` who were not labeled as being pregnant (`'RIDEXPRG'` value of `1`). The table should contain the columns:
* `'AGE'`: The age (`int`) of the individual (exclude those individuals under 18 in age)
* `'GENDER'`: The self-reported sex (`int`) of the individual assigned at birth, given two choices:
    * `1`: Male
    * `2`: Female
* `'HYPERTENSION'`: The classification of hypertension (`bool`) as defined above

In [None]:
...

In [None]:
grader.check("task_06")

---

### Task 07 📍

What is the prevalence of hypertension according to this data? That is, what percentage of the individuals in the `nhanes` table are labeled as having hypertension? Assign that percentage value (`float`) to `hypertension_prevalence_2021_2023`.

In [None]:
...

In [None]:
grader.check("task_07")

---

### Task 08 📍

The NHANES report breaks down the prevalence of hypertension based on the age categories: `'18 - 39'`, `'40 - 59'`, and `'60 and older'`. Add a column to `nhanes` called `'AGE CATEGORY'` where each individual is assigned to the relevant age category based on their age.

In [None]:
...

In [None]:
grader.check("task_08")

---

### Visualizing Hypertension Prevalence

To wrap up this activity, you're goal is to partially recreate the following prevalence graphic from the report:

<a href="https://www.cdc.gov/nchs/products/databriefs/db511.htm"><img src="./db511-fig1.avif" width=600px alt="NHANES hypertension prevalence graphic"></a>

You won't include the leftmost bars for both men and women, and you won't include the 18 and older age category for brevity.

---

### Task 09 📍

To create this graphic, you'll need to restructure the data in the following format:

<img src="./graph_table.png" width=300px alt="Data table for creating the graphic.">

Create the table `nhanes_for_graphic` that includes 2 rows (one for each gender label, `'Men'` and `'Women'`, from the graphic) and 4 columns for the `'GENDER'` and 3 age categories. The values in the table should be the prevalence of hypertension for each gender and age category combination. 

**Note:** This will require several steps, so stay focused and check your work as you go!

In [None]:
nhanes_for_graphic = ...
nhanes_for_graphic

In [None]:
grader.check("task_09")

---

### Task 09 📍🔎

<!-- BEGIN QUESTION -->

Finally, using the `nhanes_for_graphic` table, create bar charts for each gender label, and within each gender, there is a bar for each of the 3 age categories. Make sure to reflect on the findings.

**Note:** You will notice some discrepancies in the values since the NHANES made a few other modifications that we did not consider here.

In [None]:
...

plt.title('Hypertension Prevalence')
plt.show()

In [None]:
grader.check("task_09")

<!-- END QUESTION -->

---

## Reflection

In this activity, you worked with real-world health data from NHANES to analyze hypertension prevalence across different demographic groups. You applied data wrangling techniques, such as grouping, pivoting, and computing new columns, to uncover meaningful patterns. By visualizing the results, you gained insights into the disparities in hypertension prevalence and reinforced key programming concepts. More importantly, you saw how large-scale survey data plays a crucial role in public health decision-making. As you reflect on this work, consider how similar techniques could be used to investigate other health conditions or societal trends.

---

## License

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.

<img src="./by-nc-sa.png" width=100px>