# Local and Cloud Storage with fsspec (s3fs/gcsfs/adlfs), Streaming I/O

In medical data integration, we often need to work with data stored in various locations - from local file systems to cloud storage services like AWS S3, Google Cloud Storage, or Azure Data Lake. The `fsspec` library provides a unified interface for accessing files across different storage systems, which is crucial when dealing with large medical datasets that may be distributed across multiple platforms.

## Installing Required Libraries

First, let's install the necessary libraries for working with different storage systems.

In [None]:
!pip install fsspec s3fs gcsfs adlfs pandas pyarrow

## Understanding fsspec

Let's import the required libraries and explore the basic functionality of fsspec.

In [None]:
import fsspec
import pandas as pd
import json
import os

## Working with Local File System

Let's start by creating a sample medical dataset and saving it locally to demonstrate basic file operations.

In [None]:
# Create a sample medical dataset
medical_data = pd.DataFrame({
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'age': [45, 32, 67, 29, 55],
    'blood_pressure_systolic': [120, 135, 145, 118, 130],
    'blood_pressure_diastolic': [80, 85, 90, 75, 82],
    'glucose_level': [95, 110, 125, 88, 105]
})

medical_data.head()

Now let's save this data to a local file using fsspec's file system interface.

In [None]:
# Create a local file system instance
fs = fsspec.filesystem('file')

# Save the data
with fs.open('medical_data.csv', 'w') as f:
    medical_data.to_csv(f, index=False)

print("Data saved to medical_data.csv")

Let's read the data back using fsspec to verify it was saved correctly.

In [None]:
# Read the data back
with fs.open('medical_data.csv', 'r') as f:
    df_loaded = pd.read_csv(f)

df_loaded.head()

## Using fsspec with URLs

fsspec can automatically detect and handle different protocols. Let's demonstrate this by reading data from a public URL.

In [None]:
# Read a CSV file directly from a URL
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/healthexp.csv'

with fsspec.open(url, 'r') as f:
    health_exp_data = pd.read_csv(f)

health_exp_data.head()

## Working with AWS S3

Now let's explore how to work with AWS S3 storage. Note that you'll need AWS credentials configured for this to work in practice.

In [None]:
# Create an S3 file system instance
# In practice, you would need proper AWS credentials
s3_fs = fsspec.filesystem('s3', anon=True)  # anon=True for public buckets

# Example: List files in a public S3 bucket
try:
    files = s3_fs.ls('s3://nyc-tlc/trip data/')
    print(f"Found {len(files)} files")
    print("First 3 files:", files[:3])
except Exception as e:
    print(f"Error accessing S3: {e}")

## Creating a Mock Cloud Storage Example

Since actual cloud storage requires credentials, let's create a mock example to demonstrate the pattern of working with cloud storage.

In [None]:
# Create a memory file system to simulate cloud storage
memory_fs = fsspec.filesystem('memory')

# Save medical data to "cloud"
with memory_fs.open('cloud/medical/patients.json', 'w') as f:
    medical_data.to_json(f, orient='records')

print("Data saved to mock cloud storage")

Let's read the data back from our mock cloud storage.

In [None]:
# Read from "cloud"
with memory_fs.open('cloud/medical/patients.json', 'r') as f:
    cloud_data = pd.read_json(f)

cloud_data.head()

## Streaming I/O for Large Medical Files

When working with large medical imaging files or genomic data, streaming I/O becomes crucial. Let's demonstrate how to read data in chunks.

In [None]:
# Create a larger dataset to simulate streaming
import numpy as np

large_medical_data = pd.DataFrame({
    'patient_id': [f'P{i:04d}' for i in range(10000)],
    'measurement_1': np.random.normal(100, 15, 10000),
    'measurement_2': np.random.normal(75, 10, 10000),
    'measurement_3': np.random.normal(120, 20, 10000)
})

# Save to CSV
large_medical_data.to_csv('large_medical_data.csv', index=False)
print(f"Created dataset with {len(large_medical_data)} records")

Now let's read this large file in chunks using streaming I/O to process data efficiently without loading everything into memory.

In [None]:
# Stream data in chunks
chunk_size = 1000
total_rows = 0
mean_values = []

with fs.open('large_medical_data.csv', 'r') as f:
    for chunk in pd.read_csv(f, chunksize=chunk_size):
        total_rows += len(chunk)
        mean_values.append(chunk['measurement_1'].mean())
        
print(f"Processed {total_rows} rows in {len(mean_values)} chunks")
print(f"Overall mean of measurement_1: {np.mean(mean_values):.2f}")

## Working with Multiple Storage Systems

In medical data integration, you often need to combine data from multiple sources. Let's demonstrate how fsspec makes this seamless.

In [None]:
# Create data in different storage systems
local_fs = fsspec.filesystem('file')
memory_fs = fsspec.filesystem('memory')

# Patient demographics in local storage
demographics = pd.DataFrame({
    'patient_id': ['P001', 'P002', 'P003'],
    'age': [45, 32, 67],
    'gender': ['M', 'F', 'M']
})

with local_fs.open('demographics.csv', 'w') as f:
    demographics.to_csv(f, index=False)

In [None]:
# Lab results in "cloud" storage
lab_results = pd.DataFrame({
    'patient_id': ['P001', 'P002', 'P003'],
    'test_date': ['2024-01-15', '2024-01-16', '2024-01-15'],
    'hemoglobin': [14.5, 13.2, 15.1]
})

with memory_fs.open('cloud/lab_results.csv', 'w') as f:
    lab_results.to_csv(f, index=False)

Now let's read from both storage systems and merge the data.

In [None]:
# Read from local storage
with local_fs.open('demographics.csv', 'r') as f:
    demo_df = pd.read_csv(f)

# Read from "cloud" storage
with memory_fs.open('cloud/lab_results.csv', 'r') as f:
    lab_df = pd.read_csv(f)

# Merge the data
integrated_data = pd.merge(demo_df, lab_df, on='patient_id')
integrated_data

## Using fsspec with Context Managers

fsspec supports Python's context manager protocol, making it easy to ensure files are properly closed after use.

In [None]:
# Using fsspec.open directly with any URL or path
medical_notes = {
    'patient_id': 'P001',
    'notes': 'Patient presents with mild hypertension. Recommended lifestyle changes.'
}

# Write JSON data
with fsspec.open('medical_notes.json', 'w') as f:
    json.dump(medical_notes, f)

# Read JSON data
with fsspec.open('medical_notes.json', 'r') as f:
    loaded_notes = json.load(f)

print(loaded_notes)

## Caching Remote Files

When working with remote medical data, caching can significantly improve performance. fsspec provides built-in caching capabilities.

In [None]:
# Create a cached file system
cached_fs = fsspec.filesystem('filecache', 
                             target_protocol='https',
                             cache_storage='./cache')

# This will cache the file locally on first access
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/healthexp.csv'

with cached_fs.open(url, 'r') as f:
    cached_data = pd.read_csv(f)

print("Data cached and loaded")
print(f"Cache directory created: {os.path.exists('./cache')}")

## Exercise

Create a medical data integration pipeline that:

1. Creates three different datasets:
   - Patient vital signs (store locally)
   - Laboratory results (store in memory to simulate cloud)
   - Medication history (store as JSON locally)

2. Implements a function that:
   - Reads all three datasets using appropriate fsspec file systems
   - Merges them based on patient_id
   - Calculates risk scores based on the combined data
   - Saves the final integrated dataset with risk scores

3. Demonstrates streaming processing by:
   - Creating a large dataset (>5000 records)
   - Processing it in chunks to calculate statistics
   - Identifying high-risk patients without loading the entire dataset into memory

Your solution should showcase the flexibility of fsspec in handling different storage backends and efficient data processing techniques suitable for large medical datasets.