adds script to upload pre preprocessed data (SatcherInstitute#3277)

# Description and Motivation  ## Has this been tested? How? Ran script locally and - [x] observed fresh start repo being cloned - [x] observed existing repo being pulled - [x] observed files being processed - [x] observed pre-processed files being uploaded into test buckets - [x] same for prod buckets ## Screenshots (if appropriate) ## Types of changes (leave all that apply) - Bug fix - New content or feature - Refactor / chore ## New frontend preview link is below in the Netlify comment 😎
benhammondmusic · May 14, 2024 · 7439a79 · 7439a79
1 parent 68a1491
commit 7439a79
Showing 1 changed file with 126 additions and 0 deletions.
diff --git a/scripts/update_cdc_restricted b/scripts/update_cdc_restricted
@@ -0,0 +1,126 @@
+#!/bin/bash
+
+# Define the bucket destinations
+test_bucket="gs://msm-test-manual-data-bucket"
+prod_bucket="gs://prod-manual-data-bucket"
+
+
+# Define the name of the repository directory
+cdc_repo_name="covid_case_restricted_detailed"
+
+# Find the parent directory of the health-equity-tracker directory
+parent_dir=$(dirname "$(dirname "$(dirname "$(realpath "$0")")")")
+
+# Define the path to the covid_case_restricted_detailed repository
+repo_path="$parent_dir/$cdc_repo_name"
+
+# Clone or pull the repository
+if [ ! -d "$repo_path" ]; then
+  echo "Repository directory not found. Cloning the repository..."
+  git clone https://github.com/cdc-data/covid_case_restricted_detailed.git "$repo_path" || {
+    echo "Failed to clone repository"
+    exit 1
+  }
+else
+  echo "Repository directory found. Pulling latest changes..."
+  cd "$repo_path" || {
+    echo "Failed to navigate to repository directory"
+    exit 1
+  }
+  git pull origin master || {
+    echo "Failed to pull latest changes"
+    exit 1
+  }
+fi
+
+# Define the path to the data directory
+data_dir="$repo_path/data"
+
+# Check if the data directory exists
+if [ ! -d "$data_dir" ]; then
+  echo "Data directory not found"
+  exit 1
+fi
+
+# Find the most recently created directory within the data_dir
+# shellcheck disable=SC2012
+most_recent_dir=$(ls -td "$data_dir"/*/ | head -n 1)
+
+if [ -z "$most_recent_dir" ]; then
+  echo "No directories found in ${data_dir}"
+  exit 1
+fi
+
+# Navigate to the most recently created directory within the data_dir
+cd "$most_recent_dir" || {
+  echo "Failed to navigate to most recent directory"
+  exit 1
+}
+
+# Unzip all zip files in the most recent directory
+unzip -n '*.zip' || {
+  echo "Failed to unzip files"
+  exit 1
+}
+
+# Navigate to the health-equity-tracker directory
+cd ../../../health-equity-tracker || {
+  echo "Failed to navigate to health-equity-tracker directory"
+  exit 1
+}
+
+# shellcheck disable=SC1091
+source .venv/bin/activate || {
+  echo "Failed to activate virtual environment"
+  exit 1
+}
+
+# Run the local module to generate non-restricted files
+python python/datasources/cdc_restricted_local.py -dir "$most_recent_dir" -prefix spark_part || {
+  echo "Failed to run cdc_restricted_local.py"
+  exit 1
+}
+
+# Upload CSV files to the test bucket
+for csv_file in "$most_recent_dir"/cdc_restricted_by_*.csv; do
+  gsutil cp "$csv_file" "$test_bucket/" || {
+    echo "Failed to upload $csv_file to $test_bucket"
+    exit 1
+  }
+  echo "Uploaded $csv_file to $test_bucket"
+done
+
+# If the upload to the test bucket succeeded, upload to the prod bucket
+for csv_file in "$most_recent_dir"/cdc_restricted_by_*.csv; do
+  gsutil cp "$csv_file" "$prod_bucket/" || {
+    echo "Failed to upload $csv_file to $prod_bucket"
+    exit 1
+  }
+  echo "Uploaded $csv_file to $prod_bucket"
+done
+
+echo "🙌 Done!
+
+Next steps for you:
+  1. Make a new PR updating 'original_data_sourced' fields entries starting with 'cdc_restricted_data-' in frontend/src/data/config/DatasetMetadata.ts
+    - NOTE: the most recent data is from the previous month, so if you're updating from the CDC's May release, the new data will be sourced from April.
+
+  2. Run the cdc_restricted DAG from the HET Infra TEST Composer https://console.cloud.google.com/composer/environments?project=het-infra-test-05
+    - If you see Composer exists at that link, click Airflow, and then the play button for the cdc_restricted DAG.
+    - If Composer doesn't exist, you can restart it by adding an empty comment to any file inside python/ to the PR you just made. Then, merging that PR to main will automatically rebuild Composer on the test environment allowing Airflow usage. You can also force push your updated, local main branch to 'infra-test' to trigger.
+    - Once the DAG completes successfully, you should be able to view the updated data pipeline output in the test GCP project's BigQuery tables and also the exported .json files found in the GCP Buckets. Once you merge the PR from Step 1, the updated data should show on the dev site: https://dev.healthequitytracker.org/exploredata?mls=1.covid-3.00&group1=All#rates-over-time
+
+  3. If all looks good on staging, cut a new release: https://www.notion.so/healthequitytracker/Cut-and-Deploy-New-Release-to-Production-18f7e04e42f444ad83a5c857d4007090?pvs=4
+    - Make sure you select 'True' when asked if the release includes changes to the data pipeline and require Composer.
+
+  4. Run the cdc_restricted DAG from the HET Prod TEST Composer https://console.cloud.google.com/composer/environments?project=het-infra-prod-f6
+
+  5. Once DAG completes successfully, you should be able to see the results on the real production site: https://healthequitytracker.org/exploredata?mls=1.covid-3.00&group1=All#rates-over-time
+
+  6. Don't forget to turn off both Composers once you're done using them.
+    - https://github.com/SatcherInstitute/setup-cloud-platform/actions/workflows/cronDeleteComposer.yml
+    - https://github.com/SatcherInstitute/health-equity-tracker/actions/workflows/cronDeleteComposer.yml
+
+"
+
+exit 0