# 📊 Exploratory Data Analysis - Microbiome

## Objective
Understand the structure of microbiome data before creating the API.

### Files we will explore:
1. `biorun-metadata.csv.gz` - Experiment metadata
2. `summary.phylum.csv.gz` - Phylum-level composition
3. `gtdb_taxonomy.tsv.gz` - Complete taxonomy


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Configure visualization
plt.style.use('default')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Show more columns and rows
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 100)

print("✅ Libraries loaded successfully")


## 🔍 Step 1: Load and Explore Metadata

Metadata contains information about each experiment (biorun), including the environment type.


In [None]:
# Load biorun metadata
print("📂 Loading metadata...")
metadata_path = '../../Microbe-vis-data/sandpiper1.0.0.condensed.biorun-metadata.csv.gz'
metadata = pd.read_csv(metadata_path)

print(f"📊 Data shape: {metadata.shape}")
print(f"📋 Columns: {list(metadata.columns[:10])}...")  # Only first 10
print(f"💾 Memory used: {metadata.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
