# Downloading, Extracting, and Converting BRFSS Data (2022)

This notebook retrieves the 2022 Behavioral Risk Factor Surveillance System (BRFSS) dataset from the CDC. The dataset is provided in SAS Transport Format (`.XPT`), which we convert to CSV format for further analysis.

The process includes:
- Downloading the `.zip` archive from the CDC website (only if not already downloaded)
- Extracting the `.XPT` file from the archive
  - If the filename contains leading or trailing spaces, it is automatically renamed
- Converting the `.XPT` file to a `.csv` file (`brfss_2022.csv`) using fallback encodings if necessary
- Cleaning up intermediate files (`.zip` and `.XPT`) after successful conversion

This step prepares the raw BRFSS data for exploration and analysis in later notebooks.

**Source**: [CDC BRFSS Annual Data 2022](https://www.cdc.gov/brfss/annual_data/2022/files/LLCP2022XPT.zip)


In [1]:
import os
import zipfile
import requests
import pandas as pd
import pyreadstat  # Required for converting .XPT to .CSV

### Download Setup

The following cell defines the source URL for the 2022 BRFSS dataset and sets up the local paths for storing the downloaded ZIP file, extracted data, and converted CSV. It also ensures that the output directory exists before any files are saved.

In [2]:
# Define paths
url = "https://www.cdc.gov/brfss/annual_data/2022/files/LLCP2022XPT.zip"
zip_path = "data/LLCP2022XPT.zip"
extract_dir = "data"
csv_file = os.path.join(extract_dir, "brfss_2022.csv")

# Ensure 'data/' directory exists
os.makedirs(extract_dir, exist_ok=True)

In [3]:
# Function to download the ZIP file
def download_file(url, filename):
    print("Downloading BRFSS 2022 data... This may take a while.")
    response = requests.get(url, stream=True)
    with open(filename, "wb") as file:
        for chunk in response.iter_content(chunk_size=1024):
            file.write(chunk)
    print("Download complete.")

In [4]:
# Function to extract ZIP file and find the .XPT file
def extract_xpt_from_zip(zip_path, extract_dir):
    """
    Extracts an .XPT file from a ZIP archive while handling spaces in filenames,
    renaming files if necessary, and ensuring proper file management.

    Args:
        zip_path (str): Path to the ZIP file.
        extract_dir (str): Directory where files will be extracted.

    Returns:
        str: Path to the extracted .XPT file, or None if extraction fails.
    """
    print("Extracting dataset...")

    # Ensure the extraction directory exists; create it if necessary
    os.makedirs(extract_dir, exist_ok=True)

    # Extract all files from the ZIP archive
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)

    # Get a list of extracted files
    extracted_files = os.listdir(extract_dir)
    xpt_file_path = None  # Variable to store the final .XPT file path

    for file in extracted_files:
        old_path = os.path.join(extract_dir, file)  # Original extracted file path
        new_path = os.path.join(extract_dir, file.strip())  # Trim spaces from filename

        # Rename the file if it had leading/trailing spaces
        if old_path != new_path:
            os.rename(old_path, new_path)

        # Identify and store the path of the .XPT file
        if new_path.lower().endswith(".xpt"):
            xpt_file_path = new_path

    # Check if the .XPT file was found and exists
    if not xpt_file_path or not os.path.exists(xpt_file_path):
        print("Error: XPT file extraction failed.")
        return None

    return xpt_file_path  # Return the cleaned-up .XPT file path

This function extracts and cleans up the `.XPT` file from the ZIP archive, renaming it if needed.

In [5]:
# Function to convert XPT to CSV with error handling
def convert_xpt_to_csv(xpt_file, csv_file):
    if xpt_file and os.path.exists(xpt_file):
        print("Converting .XPT file to .CSV...")
        try:
            df, meta = pyreadstat.read_xport(xpt_file, encoding="latin1")  # Try Latin-1 encoding
            df.to_csv(csv_file, index=False)
            print(f"Conversion complete. CSV saved as: {csv_file}")
        except UnicodeDecodeError:
            print("Error: UnicodeDecodeError occurred. Trying an alternative encoding...")
            try:
                df, meta = pyreadstat.read_xport(xpt_file, encoding="windows-1252")  # Try Windows-1252
                df.to_csv(csv_file, index=False)
                print(f"Conversion complete using Windows-1252. CSV saved as: {csv_file}")
            except Exception as e:
                print(f"Critical Error: Could not convert XPT file. Error: {e}")
                return False
    else:
        print("Error: XPT file not found or extraction failed.")
        return False
    return True

### Run Download and Conversion Pipeline

This block checks whether the CSV file already exists. If not, it downloads the ZIP file, extracts the `.XPT`, converts it to CSV, and removes intermediate files.


In [6]:
# Download the file if it doesn't exist
if not os.path.exists(csv_file):  # Only download if CSV doesn't exist
    download_file(url, zip_path)

    # Extract XPT file
    xpt_file = extract_xpt_from_zip(zip_path, extract_dir)

    if xpt_file is None:
        print("Extraction failed. Please check the extracted files in the 'data' directory.")
        exit(1)  # Stop execution since we don't have an XPT file

    # Convert to CSV
    conversion_success = convert_xpt_to_csv(xpt_file, csv_file)

    if conversion_success:
        # Remove only the downloaded ZIP and extracted XPT file, keep other files in `data/`
        os.remove(zip_path)
        os.remove(xpt_file)
        print("Clean-up complete: ZIP and XPT files removed.")

else:
    print("CSV file already exists. No action needed.")

Downloading BRFSS 2022 data... This may take a while.
Download complete.
Extracting dataset...
Converting .XPT file to .CSV...
Conversion complete. CSV saved as: data/brfss_2022.csv
Clean-up complete: ZIP and XPT files removed.


## Record Dependencies

In [7]:
%load_ext watermark
%watermark
%watermark --iversions

Last updated: 2025-02-17T02:27:29.648436+00:00

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 8.17.2

Compiler    : GCC 11.3.0
OS          : Linux
Release     : 6.5.0-1020-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 64
Architecture: 64bit

requests  : 2.31.0
pandas    : 2.0.2
pyreadstat: 1.2.8

