Skip to content

Commit

Permalink
adds script to upload pre preprocessed data (SatcherInstitute#3277)
Browse files Browse the repository at this point in the history
# Description and Motivation
<!--- bulleted, high level items. use keywords (eg "closes SatcherInstitute#144" or
"fixes #4323") -->

## Has this been tested? How?

Ran script locally and
- [x] observed fresh start repo being cloned
- [x] observed existing repo being pulled
- [x] observed files being processed
- [x] observed pre-processed files being uploaded into test buckets
- [x] same for prod buckets

## Screenshots (if appropriate)

## Types of changes

(leave all that apply)

- Bug fix
- New content or feature
- Refactor / chore

## New frontend preview link is below in the Netlify comment 😎
  • Loading branch information
benhammondmusic committed May 14, 2024
1 parent 68a1491 commit 7439a79
Showing 1 changed file with 126 additions and 0 deletions.
126 changes: 126 additions & 0 deletions scripts/update_cdc_restricted
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
#!/bin/bash

# Define the bucket destinations
test_bucket="gs://msm-test-manual-data-bucket"
prod_bucket="gs://prod-manual-data-bucket"


# Define the name of the repository directory
cdc_repo_name="covid_case_restricted_detailed"

# Find the parent directory of the health-equity-tracker directory
parent_dir=$(dirname "$(dirname "$(dirname "$(realpath "$0")")")")

# Define the path to the covid_case_restricted_detailed repository
repo_path="$parent_dir/$cdc_repo_name"

# Clone or pull the repository
if [ ! -d "$repo_path" ]; then
echo "Repository directory not found. Cloning the repository..."
git clone https://github.com/cdc-data/covid_case_restricted_detailed.git "$repo_path" || {
echo "Failed to clone repository"
exit 1
}
else
echo "Repository directory found. Pulling latest changes..."
cd "$repo_path" || {
echo "Failed to navigate to repository directory"
exit 1
}
git pull origin master || {
echo "Failed to pull latest changes"
exit 1
}
fi

# Define the path to the data directory
data_dir="$repo_path/data"

# Check if the data directory exists
if [ ! -d "$data_dir" ]; then
echo "Data directory not found"
exit 1
fi

# Find the most recently created directory within the data_dir
# shellcheck disable=SC2012
most_recent_dir=$(ls -td "$data_dir"/*/ | head -n 1)

if [ -z "$most_recent_dir" ]; then
echo "No directories found in ${data_dir}"
exit 1
fi

# Navigate to the most recently created directory within the data_dir
cd "$most_recent_dir" || {
echo "Failed to navigate to most recent directory"
exit 1
}

# Unzip all zip files in the most recent directory
unzip -n '*.zip' || {
echo "Failed to unzip files"
exit 1
}

# Navigate to the health-equity-tracker directory
cd ../../../health-equity-tracker || {
echo "Failed to navigate to health-equity-tracker directory"
exit 1
}

# shellcheck disable=SC1091
source .venv/bin/activate || {
echo "Failed to activate virtual environment"
exit 1
}

# Run the local module to generate non-restricted files
python python/datasources/cdc_restricted_local.py -dir "$most_recent_dir" -prefix spark_part || {
echo "Failed to run cdc_restricted_local.py"
exit 1
}

# Upload CSV files to the test bucket
for csv_file in "$most_recent_dir"/cdc_restricted_by_*.csv; do
gsutil cp "$csv_file" "$test_bucket/" || {
echo "Failed to upload $csv_file to $test_bucket"
exit 1
}
echo "Uploaded $csv_file to $test_bucket"
done

# If the upload to the test bucket succeeded, upload to the prod bucket
for csv_file in "$most_recent_dir"/cdc_restricted_by_*.csv; do
gsutil cp "$csv_file" "$prod_bucket/" || {
echo "Failed to upload $csv_file to $prod_bucket"
exit 1
}
echo "Uploaded $csv_file to $prod_bucket"
done

echo "🙌 Done!
Next steps for you:
1. Make a new PR updating 'original_data_sourced' fields entries starting with 'cdc_restricted_data-' in frontend/src/data/config/DatasetMetadata.ts
- NOTE: the most recent data is from the previous month, so if you're updating from the CDC's May release, the new data will be sourced from April.
2. Run the cdc_restricted DAG from the HET Infra TEST Composer https://console.cloud.google.com/composer/environments?project=het-infra-test-05
- If you see Composer exists at that link, click Airflow, and then the play button for the cdc_restricted DAG.
- If Composer doesn't exist, you can restart it by adding an empty comment to any file inside python/ to the PR you just made. Then, merging that PR to main will automatically rebuild Composer on the test environment allowing Airflow usage. You can also force push your updated, local main branch to 'infra-test' to trigger.
- Once the DAG completes successfully, you should be able to view the updated data pipeline output in the test GCP project's BigQuery tables and also the exported .json files found in the GCP Buckets. Once you merge the PR from Step 1, the updated data should show on the dev site: https://dev.healthequitytracker.org/exploredata?mls=1.covid-3.00&group1=All#rates-over-time
3. If all looks good on staging, cut a new release: https://www.notion.so/healthequitytracker/Cut-and-Deploy-New-Release-to-Production-18f7e04e42f444ad83a5c857d4007090?pvs=4
- Make sure you select 'True' when asked if the release includes changes to the data pipeline and require Composer.
4. Run the cdc_restricted DAG from the HET Prod TEST Composer https://console.cloud.google.com/composer/environments?project=het-infra-prod-f6
5. Once DAG completes successfully, you should be able to see the results on the real production site: https://healthequitytracker.org/exploredata?mls=1.covid-3.00&group1=All#rates-over-time
6. Don't forget to turn off both Composers once you're done using them.
- https://github.com/SatcherInstitute/setup-cloud-platform/actions/workflows/cronDeleteComposer.yml
- https://github.com/SatcherInstitute/health-equity-tracker/actions/workflows/cronDeleteComposer.yml
"

exit 0

0 comments on commit 7439a79

Please sign in to comment.