<a href="https://colab.research.google.com/github/ajitonelsonn/H_ArngoDB/blob/main/H_ArangoDB_Download_Extract_and_Merge_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🏥 SyntheticMass Health Data Processing Pipeline

## A Comprehensive Data Engineering Project for Healthcare Analytics

---

### 📊 Project Components:
1. **Data Download**: Box dataset retrieval
2. **Archive Extraction**: Processing nested archives
3. **CSV Management**: Merging and organizing data
4. **Data Analysis**: Sample visualization and statistics
5. **Distribution**: ZIP creation and Hugging Face upload

**⚠ Important:** This notebook is run using **High RAM** on Google Colab.

# 1.Dataset Download Code

## **Code Explanation**
## Import Section
All necessary Python libraries are imported for handling downloads, file compression, and file system operations.

## Main Function Overview
The `download_box_data()` function is designed to:
- Download a compressed dataset from Box
- Extract the contents
- Manage the files locally
- Handle any potential errors

## URL and Directory Setup
Creates a new directory for downloaded data and defines the Box URL where the dataset is stored.

## Download Process
Downloads the file in chunks to manage memory efficiently, especially for large files. The streaming approach prevents loading the entire file into memory at once.

## File Saving
Saves the downloaded compressed file to disk, writing it chunk by chunk to ensure stable handling of large files.

## Extraction Logic
Has two extraction methods:
- Handles TAR.GZ files using the tarfile module
- Falls back to regular GZIP extraction if not a TAR file
- Shows the size of extracted files

## File Cleanup
Gives the user control over keeping or deleting the compressed file after extraction is complete.

## Error Handling
Implements comprehensive error handling for:
- Download failures
- File processing issues
- General exceptions

## Execution
Final step that runs the download function and stores the resulting file path for further use.

## Code

In [None]:
# Download Box Dataset
import requests
import gzip
import os
from pathlib import Path
import shutil
import tarfile

print("Downloading Box dataset...")

def download_box_data():
    """Downloads and extracts gzipped data from Box"""
    url = "https://mitre.box.com/shared/static/3bo45m48ocpzp8fc0tp005vax7l93xji.gz"
    output_dir = Path('downloaded_data')
    output_dir.mkdir(exist_ok=True)

    try:
        # Download the file
        print("Downloading file...")
        response = requests.get(url, stream=True)
        response.raise_for_status()

        # Save the compressed file
        gz_path = output_dir / 'data.tar.gz'
        with open(gz_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)

        print(f"Downloaded file size: {gz_path.stat().st_size / (1024*1024):.2f} MB")

        # Check if it's a tar.gz file
        if tarfile.is_tarfile(gz_path):
            print("Detected tar.gz format, extracting...")
            with tarfile.open(gz_path, 'r:gz') as tar:
                tar.extractall(path=output_dir)
                extracted_files = tar.getnames()
        else:
            # Try regular gzip extraction
            print("Attempting gzip extraction...")
            output_path = output_dir / 'extracted_data'
            with gzip.open(gz_path, 'rb') as f_in:
                with open(output_path, 'wb') as f_out:
                    shutil.copyfileobj(f_in, f_out)
            extracted_files = [output_path]

        print("\nExtracted files:")
        for file in extracted_files:
            file_path = output_dir / Path(file)
            if file_path.exists():
                print(f"- {file}: {file_path.stat().st_size / (1024*1024):.2f} MB")

        # Ask before cleaning up
        keep_gz = input("\nKeep compressed file? (y/n): ").lower().strip() == 'y'
        if not keep_gz:
            gz_path.unlink()
            print("Compressed file removed")

        return output_dir

    except requests.exceptions.RequestException as e:
        print(f"Download failed: {str(e)}")
    except Exception as e:
        print(f"Error processing file: {str(e)}")
        print("Full error details:", e)

    return None

# Execute download
downloaded_path = download_box_data()

Downloading Box dataset...
Downloading file...
Downloaded file size: 21304.18 MB
Detected tar.gz format, extracting...

Extracted files:
- synthea_1m_fhir_3_0_May_24: 0.00 MB
- synthea_1m_fhir_3_0_May_24/output_10_20170528T030916.tar.gz: 1816.68 MB
- synthea_1m_fhir_3_0_May_24/output_11_20170528T113605.tar.gz: 1815.63 MB
- synthea_1m_fhir_3_0_May_24/output_12_20170528T195303.tar.gz: 1814.30 MB
- synthea_1m_fhir_3_0_May_24/output_1_20170524T232103.tar.gz: 1815.34 MB
- synthea_1m_fhir_3_0_May_24/output_2_20170525T073836.tar.gz: 1817.33 MB
- synthea_1m_fhir_3_0_May_24/output_3_20170525T161555.tar.gz: 1811.70 MB
- synthea_1m_fhir_3_0_May_24/output_4_20170526T004637.tar.gz: 1817.53 MB
- synthea_1m_fhir_3_0_May_24/output_5_20170526T091439.tar.gz: 1814.23 MB
- synthea_1m_fhir_3_0_May_24/output_6_20170526T173337.tar.gz: 1814.72 MB
- synthea_1m_fhir_3_0_May_24/output_7_20170527T015508.tar.gz: 1810.01 MB
- synthea_1m_fhir_3_0_May_24/output_8_20170527T102552.tar.gz: 1813.36 MB
- synthea_1m_fhir_3

# 2.Nested Archive Extraction Code

## **Code Explanation**
## Function Purpose
The `extract_nested_archives` function searches through a directory for tar.gz files, extracts them, and collects all CSV files into a separate directory.

## Input Parameters
- `base_dir`: Source directory containing the archives (default path for downloaded Synthea data)
- `output_dir`: Destination directory for extracted CSV files

## Directory Setup
Creates an output directory to store all discovered CSV files in an organized structure.

## Main Process Flow
1. **Archive Discovery**: Searches for all tar.gz files in the base directory
2. **Temporary Processing**: Creates a temp directory for each archive during extraction
3. **File Extraction**: Extracts archives one by one using tarfile module
4. **CSV Collection**:
   - Finds all CSV files in extracted content
   - Maintains original directory structure
   - Copies only new or modified files

## File Organization
- Preserves relative paths from source
- Creates necessary subdirectories automatically
- Avoids duplicate copies of identical files

## Cleanup Process
- Removes temporary directories after processing
- Maintains clean workspace during extraction

## Summary Generation
- Groups CSV files by type (based on filename)
- Shows first 3 examples of each type
- Provides count of remaining files
- Creates organized overview of extracted data

## Error Handling
Includes try-catch blocks to:
- Handle extraction errors gracefully
- Continue processing remaining files if one fails
- Provide error feedback for troubleshooting

## Code

In [None]:
# Extract Nested Archives and Find CSVs
import tarfile
import os
from pathlib import Path
import shutil

def extract_nested_archives(base_dir='downloaded_data/synthea_1m_fhir_3_0_May_24',
                          output_dir='extracted_csvs'):
    """
    Extract nested tar.gz files and collect CSV files
    """
    base_path = Path(base_dir)
    output_path = Path(output_dir)

    # Create output directory
    output_path.mkdir(exist_ok=True)

    print("Starting extraction of nested archives...")

    # Track discovered CSV files
    csv_files = []

    # Process each tar.gz file in the base directory
    for tar_file in base_path.glob('*.tar.gz'):
        print(f"\nProcessing {tar_file.name}...")

        # Create temporary directory for this archive
        temp_dir = base_path / f"temp_{tar_file.stem}"
        temp_dir.mkdir(exist_ok=True)

        try:
            # Extract the tar.gz file
            with tarfile.open(tar_file, 'r:gz') as tar:
                tar.extractall(path=temp_dir)

            # Find and copy CSV files
            for csv_file in temp_dir.rglob('*.csv'):
                # Get relative path components
                rel_path = csv_file.relative_to(temp_dir)

                # Create destination path
                dest_path = output_path / rel_path
                dest_path.parent.mkdir(parents=True, exist_ok=True)

                # Copy file if it doesn't exist or is different
                if not dest_path.exists() or csv_file.stat().st_size != dest_path.stat().st_size:
                    shutil.copy2(csv_file, dest_path)
                    print(f"Copied: {rel_path}")
                    csv_files.append(dest_path)

            # Clean up temporary directory
            shutil.rmtree(temp_dir)

        except Exception as e:
            print(f"Error processing {tar_file.name}: {str(e)}")
            continue

    # Print summary
    print("\nExtraction complete!")
    print(f"Found {len(csv_files)} unique CSV files:")

    # Group and display CSV files by type
    csv_by_type = {}
    for csv_file in csv_files:
        file_type = csv_file.stem.split('_')[0]
        if file_type not in csv_by_type:
            csv_by_type[file_type] = []
        csv_by_type[file_type].append(csv_file)

    for file_type, files in csv_by_type.items():
        print(f"\n{file_type.capitalize()} files ({len(files)}):")
        for file in files[:3]:  # Show first 3 examples
            print(f"- {file.name}")
        if len(files) > 3:
            print(f"  ... and {len(files) - 3} more")

    return csv_files

# Execute extraction
csv_files = extract_nested_archives()

Starting extraction of nested archives...

Processing output_8_20170527T102552.tar.gz...
Copied: output_8/csv/allergies.csv
Copied: output_8/csv/careplans.csv
Copied: output_8/csv/immunizations.csv
Copied: output_8/csv/observations.csv
Copied: output_8/csv/encounters.csv
Copied: output_8/csv/medications.csv
Copied: output_8/csv/procedures.csv
Copied: output_8/csv/patients.csv
Copied: output_8/csv/conditions.csv

Processing output_10_20170528T030916.tar.gz...
Copied: output_10/csv/allergies.csv
Copied: output_10/csv/careplans.csv
Copied: output_10/csv/immunizations.csv
Copied: output_10/csv/observations.csv
Copied: output_10/csv/encounters.csv
Copied: output_10/csv/medications.csv
Copied: output_10/csv/procedures.csv
Copied: output_10/csv/patients.csv
Copied: output_10/csv/conditions.csv

Processing output_2_20170525T073836.tar.gz...
Copied: output_2/csv/allergies.csv
Copied: output_2/csv/careplans.csv
Copied: output_2/csv/immunizations.csv
Copied: output_2/csv/observations.csv
Copied: 

# 3.**CSV** File Merging Code

## **Code Explanation**
## Function Purpose
The `merge_csv_files` function combines multiple CSV files with the same name from different directories into single consolidated files.

## Input Parameters
- `source_dir`: Directory containing the CSV files (default: 'extracted_csvs')
- `output_dir`: Directory where merged files will be saved (default: 'final_merge')

## Initial Setup
- Creates output directory
- Finds all unique CSV filenames across all subdirectories

## CSV Processing Flow
1. **File Discovery**:
   - Identifies all instances of each unique CSV filename
   - Creates consistent column structure from first file

2. **Data Reading**:
   - Reads each CSV file maintaining original column structure
   - Tracks total row count
   - Reports successful reads and any errors

3. **Merging Process**:
   - Combines all dataframes using pandas concat
   - Preserves data integrity with ignore_index
   - Maintains original column structure

## File Output
- Saves merged files to output directory
- Preserves original filename
- Reports file statistics:
  - Number of files merged
  - Total row count
  - Final file size in MB

## Error Handling
- Per-file error catching
- Continues processing if individual files fail
- Reports specific errors for troubleshooting
- Ensures partial success if some files can't be processed

## Progress Reporting
- Shows current file being processed
- Reports successful file reads
- Provides merge statistics
- Shows final file sizes

## Code

In [None]:
import pandas as pd
import os
from pathlib import Path
import glob

def merge_csv_files(source_dir='extracted_csvs', output_dir='final_merge'):
    """
    Merge CSV files with the same name from different directories and save to output directory
    """
    # Create output directory if it doesn't exist
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    # Get unique CSV file names
    unique_csv_names = set()
    for csv_file in Path(source_dir).rglob('*.csv'):
        unique_csv_names.add(csv_file.name)

    print("Starting CSV merging process...")

    # Process each unique CSV name
    for csv_name in unique_csv_names:
        print(f"\nProcessing {csv_name}...")

        try:
            # Find all files with this name
            csv_files = list(Path(source_dir).rglob(f'**/{csv_name}'))

            # First read the header of the first file to get column structure
            first_df = pd.read_csv(csv_files[0], nrows=0)
            columns = first_df.columns.tolist()

            # Read and concatenate all matching CSV files
            dfs = []
            total_rows = 0

            for file in csv_files:
                try:
                    # Read CSV with only the columns from the first file
                    df = pd.read_csv(file, usecols=columns)
                    total_rows += len(df)
                    dfs.append(df)
                    print(f"Successfully read {file}")
                except Exception as e:
                    print(f"Error reading {file}: {str(e)}")
                    continue

            if dfs:
                # Merge all dataframes
                merged_df = pd.concat(dfs, ignore_index=True)

                # Save merged file
                output_file = output_path / csv_name
                merged_df.to_csv(output_file, index=False)

                # Get file size in MB
                file_size = output_file.stat().st_size / (1024 * 1024)  # Convert bytes to MB

                print(f"Merged {len(dfs)} files into {csv_name}")
                print(f"Total rows: {total_rows:,}")
                print(f"Final file size: {file_size:.2f} MB")
            else:
                print(f"No files were successfully processed for {csv_name}")

        except Exception as e:
            print(f"Error processing {csv_name}: {str(e)}")
            continue

# Execute the merge
merge_csv_files()

Starting CSV merging process...

Processing medications.csv...
Successfully read extracted_csvs/output_6/csv/medications.csv
Successfully read extracted_csvs/output_11/csv/medications.csv
Successfully read extracted_csvs/output_9/csv/medications.csv
Successfully read extracted_csvs/output_7/csv/medications.csv
Successfully read extracted_csvs/output_4/csv/medications.csv
Successfully read extracted_csvs/output_10/csv/medications.csv
Successfully read extracted_csvs/output_8/csv/medications.csv
Successfully read extracted_csvs/output_12/csv/medications.csv
Successfully read extracted_csvs/output_2/csv/medications.csv
Successfully read extracted_csvs/output_5/csv/medications.csv
Successfully read extracted_csvs/output_1/csv/medications.csv
Successfully read extracted_csvs/output_3/csv/medications.csv
Merged 12 files into medications.csv
Total rows: 4,781,956
Final file size: 747.25 MB

Processing immunizations.csv...
Successfully read extracted_csvs/output_6/csv/immunizations.csv
Success

# 4.Data Sample Display Code

## **Code Explanation**

## Function Purpose
The `show_merged_data_samples` function provides a comprehensive overview of each merged CSV file, showing key statistics and sample data.

## Input Parameters
- `merged_dir`: Directory containing merged CSV files (default: 'final_merge')
- `num_samples`: Number of sample rows to display (default: 5)

## Data Overview Process
1. **File Discovery**:
   - Locates all CSV files in merged directory
   - Processes each file individually

2. **Basic Information Display**:
   - Shows filename
   - Reports total row count
   - Lists all column names
   - Displays column count

3. **Sample Data Display**:
   - Shows first few rows of data
   - Uses specified sample size
   - Maintains readable format

4. **Statistical Analysis**:
   - Generates descriptive statistics
   - Focuses on numeric columns
   - Rounds values for readability
   - Includes:
     - Count
     - Mean
     - Standard deviation
     - Min/Max values
     - Quartile information

## Output Formatting
- Uses clear section separators
- Organizes information logically
- Makes output easily readable
- Provides consistent structure across files

## Visual Organization
- Uses separator lines for clarity
- Groups related information together
- Maintains consistent spacing
- Creates clear visual hierarchy

## Code

In [None]:
def show_merged_data_samples(merged_dir='final_merge', num_samples=5):
    """
    Display sample data from each merged CSV file
    """
    merged_path = Path(merged_dir)

    # Get all CSV files in the merged directory
    merged_files = list(merged_path.glob('*.csv'))

    for file in merged_files:
        print(f"\n{'='*80}")
        print(f"File: {file.name}")
        print(f"{'='*80}")

        # Read the CSV file
        df = pd.read_csv(file)

        # Display basic information
        print("\nDataset Info:")
        print(f"Total Rows: {len(df):,}")
        print(f"Total Columns: {len(df.columns):,}")
        print("\nColumns:", ', '.join(df.columns))

        # Display sample data
        print(f"\nFirst {num_samples} rows:")
        print(df.head(num_samples))

        # Display basic statistics for numeric columns
        print("\nNumeric Columns Statistics:")
        print(df.describe().round(2))

# Execute the function
show_merged_data_samples()


File: allergies.csv

Dataset Info:
Total Rows: 624,611
Total Columns: 6

Columns: START, STOP, PATIENT, ENCOUNTER, CODE, DESCRIPTION

First 5 rows:
        START STOP                               PATIENT  \
0  1990-10-01  NaN  6d1aa8c5-c16e-488c-9542-c08b016a069a   
1  1990-10-01  NaN  6d1aa8c5-c16e-488c-9542-c08b016a069a   
2  1963-01-05  NaN  aca17de8-b17d-4db0-91c5-8dc3211df286   
3  1981-09-04  NaN  058388d9-e0c2-49ae-8994-a9db47205c8b   
4  1981-09-04  NaN  058388d9-e0c2-49ae-8994-a9db47205c8b   

                              ENCOUNTER       CODE              DESCRIPTION  
0  338c5a79-5ca2-40e8-8503-c511b9a9315f  417532002          Allergy to fish  
1  338c5a79-5ca2-40e8-8503-c511b9a9315f  232347008  Dander (animal) allergy  
2  9b65787f-cf76-4299-84ac-815e03b1ba06  300913006        Shellfish allergy  
3  e1029f45-c057-42c6-a393-20cbc2726620  232350006  House dust mite allergy  
4  e1029f45-c057-42c6-a393-20cbc2726620  300916003            Latex allergy  

Numeric Columns Stati

# 5.ZIP Archive Creation Code

## **Code Explanation**

## Function Purpose
The `create_zip_archive` function creates a ZIP file from the merged CSV data directory, making it easy to share or store the processed data.

## Input Parameters
- `source_dir`: Directory to be zipped (default: 'final_merge')
- `zip_name`: Name for the output ZIP file (default: 'final_merge_data')

## Process Flow
1. **Path Setup**:
   - Creates Path objects for source directory
   - Sets up ZIP file destination path

2. **Archive Creation**:
   - Uses shutil.make_archive for compression
   - Creates ZIP format archive
   - Includes all files from source directory

3. **Size Reporting**:
   - Calculates final archive size
   - Converts size to megabytes
   - Displays formatted size

## Error Handling
- Catches potential compression errors
- Provides error feedback
- Ensures graceful failure handling

## Output Information
- Confirms successful archive creation
- Shows final archive path
- Reports compressed file size
- Uses readable size format (MB)

## Example Usage
Shows how to create a ZIP archive named 'SyntheticMass_Data_Hack_ArangoDB' from the processed data.

## Code

In [None]:
import shutil
import os
from pathlib import Path

def create_zip_archive(source_dir='final_merge', zip_name='final_merge_data'):
    """
    Create a zip archive of the final_merge folder
    """
    try:
        # Get full paths
        source_path = Path(source_dir)
        zip_path = Path(f"{zip_name}.zip")

        # Create zip archive
        print(f"Creating zip archive of {source_dir}...")
        shutil.make_archive(zip_name, 'zip', source_path)

        # Get zip file size
        zip_size = zip_path.stat().st_size / (1024 * 1024)  # Convert to MB

        print(f"\nZip archive created successfully: {zip_path}")
        print(f"Archive size: {zip_size:.2f} MB")

    except Exception as e:
        print(f"Error creating zip archive: {str(e)}")

# Create the zip archive
create_zip_archive(zip_name='SyntheticMass_Data_Hack_ArangoDB')

Creating zip archive of final_merge...

Zip archive created successfully: SyntheticMass_Data_Hack_ArangoDB.zip
Archive size: 2303.02 MB


# 6.Hugging Face Dataset Upload Code


## **Code Explanation**

## Function Purpose
This code uploads a processed dataset ZIP file to Hugging Face's dataset repository system.

## Component Setup
1. **Library Requirements**:
   - Imports HfApi for Hugging Face interaction
   - Uses getpass for secure token input

2. **Authentication**:
   - Securely prompts for Hugging Face token
   - Initializes API with credentials

## Upload Configuration
- **Source File**: Local ZIP archive 'SyntheticMass_Data_Hack_ArangoDB.zip'
- **Destination**: Specified repository path
- **Repository Details**:
  - ID: "ajitonelson/synthetic-mass-health-data"
  - Type: Dataset repository
  - File Path: Maintains original ZIP name

## Upload Process
1. Takes local ZIP file
2. Authenticates with provided token
3. Uploads to specified repository
4. Maintains file structure
5. Uses dataset-specific repository type

## Security Features
- Uses getpass for hidden token entry
- Secures API communication
- Maintains token privacy

## CODE

In [None]:
from huggingface_hub import HfApi
from getpass import getpass

# Get token
hf_token = getpass('Enter your Hugging Face token: ')

# Initialize API
api = HfApi()

# Upload files to existing repository
api.upload_file(
    path_or_fileobj="/content/SyntheticMass_Data_Hack_ArangoDB.zip",
    path_in_repo="SyntheticMass_Data_Hack_ArangoDB.zip",
    repo_id="ajitonelson/synthetic-mass-health-data",
    repo_type="dataset",
    token=hf_token
)

Enter your Hugging Face token: ··········


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


SyntheticMass_Data_Hack_ArangoDB.zip:   0%|          | 0.00/2.41G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/ajitonelson/synthetic-mass-health-data/commit/ae5eff0194827991afd72e9075fe18310fa7acfe', commit_message='Upload SyntheticMass_Data_Hack_ArangoDB.zip with huggingface_hub', commit_description='', oid='ae5eff0194827991afd72e9075fe18310fa7acfe', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/ajitonelson/synthetic-mass-health-data', endpoint='https://huggingface.co', repo_type='dataset', repo_id='ajitonelson/synthetic-mass-health-data'), pr_revision=None, pr_num=None)

---


# 📚 Access the Dataset

Visit Hugging Face to explore and download the Synthetic Mass Health Dataset:  
[ajitonelson/synthetic-mass-health-data](https://huggingface.co/datasets/ajitonelson/synthetic-mass-health-data)

---

## 📋 Dataset Overview
- Full synthetic patient health records
- Multiple CSV files with related health data
- Processed and merged for easy analysis
- Available as a compressed ZIP archive

---

## 🤝 Contributing
Feel free to open issues or submit pull requests to improve the dataset.

---

## 📄 License
This dataset is available under standard Synthea licensing terms.

---

## 🙏 Acknowledgments
- Synthea™ Project for synthetic data generation
- Hugging Face for dataset infrastructure

---

*Made with* ❤️ *in Timor-Leste*

*© 2025 All rights reserved*