## Prerequisites

Before starting, ensure you have:

1. ‚úÖ **Docker environment running**:
   ```bash
   docker-compose up -d
   ```

2. ‚úÖ **Configuration file** at `~/.webhdfsmagic/config.json`:
   ```json
   {
     "knox_url": "http://localhost:8080/gateway/default",
     "webhdfs_api": "/webhdfs/v1",
     "username": "hdfs",
     "password": "password",
     "verify_ssl": false
   }
   ```

3. ‚úÖ **webhdfsmagic installed**:
   ```bash
   pip install webhdfsmagic
   ```

## Step 1: Load Extension and Verify Configuration

First, we load the webhdfsmagic extension and verify our connection settings.

In [None]:
# Load the webhdfsmagic extension
%load_ext webhdfsmagic

In [None]:
# Display help to see all available commands
%hdfs help

In [None]:
# Verify configuration
import json
import os

config_path = os.path.expanduser('~/.webhdfsmagic/config.json')
with open(config_path) as f:
    config = json.load(f)
    
print("‚úì Configuration loaded successfully!")
print(f"  Gateway URL: {config['knox_url']}")
print(f"  WebHDFS API: {config['webhdfs_api']}")
print(f"  Username: {config['username']}")
print(f"  SSL Verification: {config['verify_ssl']}")

## Step 2: Directory Operations

### User Story
*As a data engineer, I need to organize my data in HDFS by creating a logical directory structure for my project.*

Let's explore basic directory operations: listing, creating, and navigating.

In [None]:
# List root directory to see what's already there
print("üìÇ Root directory contents:")
%hdfs ls /

In [None]:
# Create a project directory
print("Creating /demo directory...")
%hdfs mkdir /demo

In [None]:
# Create nested directories for organizing data
print("Creating nested structure...")
%hdfs mkdir /demo/data
%hdfs mkdir /demo/results

In [None]:
# Verify our directory structure
print("üìÇ Project structure:")
%hdfs ls /demo

## Step 3: Uploading Files

### User Story
*As a data analyst, I have local CSV files that I need to upload to HDFS for distributed processing.*

Let's create a sample dataset and upload it to HDFS.

In [None]:
# Create a sample customer dataset
import pandas as pd

customers_df = pd.DataFrame({
    'customer_id': range(1, 21),
    'name': [f'Customer {i}' for i in range(1, 21)],
    'email': [f'customer{i}@example.com' for i in range(1, 21)],
    'total_purchases': [round(100.5 * i, 2) for i in range(1, 21)],
    'loyalty_tier': ['Gold' if i > 15 else 'Silver' if i > 10 else 'Bronze' for i in range(1, 21)]
})

# Save locally
customers_df.to_csv('customers.csv', index=False)

print("‚úì Sample dataset created!")
print(f"  Records: {len(customers_df)}")
print(f"\nFirst 5 records:")
print(customers_df.head())

In [None]:
# Upload to HDFS
print("üì§ Uploading customers.csv to HDFS...")
%hdfs put customers.csv /demo/data/customers.csv
print("‚úì Upload complete!")

In [None]:
# Verify the file was uploaded
print("üìÇ Files in /demo/data:")
%hdfs ls /demo/data

## Step 4: Reading Files from HDFS

### User Story
*As a data scientist, I need to quickly preview HDFS files without downloading them to verify content and structure.*

The `cat` command allows you to read files directly from HDFS.

In [None]:
# Read the entire file
print("üìÑ Full file content:")
%hdfs cat /demo/data/customers.csv

In [None]:
# Preview just the first 5 lines (header + 4 records)
print("üëÄ Quick preview (first 5 lines):")
%hdfs cat -n 5 /demo/data/customers.csv

## Step 5: Downloading Files

### User Story
*As a business analyst, I need to download processed data from HDFS to create reports in Excel.*

Let's download our file and work with it locally.

In [None]:
# Download file from HDFS
print("üì• Downloading from HDFS...")
%hdfs get /demo/data/customers.csv ./downloaded_customers.csv
print("‚úì Download complete!")

In [None]:
# Verify downloaded file
df_downloaded = pd.read_csv('downloaded_customers.csv')

print("‚úì File downloaded successfully!")
print(f"  Records: {len(df_downloaded)}")
print(f"\nData summary:")
print(df_downloaded.describe())

## Step 6: Batch Operations with Wildcards

### User Story
*As a data engineer processing daily sales data, I receive multiple files that need to be uploaded to HDFS efficiently.*

webhdfsmagic supports wildcards for batch operations, making it easy to handle multiple files.

In [None]:
# Generate multiple daily sales files
from datetime import datetime, timedelta

print("üìä Generating daily sales data...\n")

for i in range(3):
    date = datetime.now() - timedelta(days=i)
    date_str = date.strftime('%Y%m%d')
    
    # Generate sales data
    sales_df = pd.DataFrame({
        'date': [date.strftime('%Y-%m-%d')] * 15,
        'product_id': [f'PROD{j:03d}' for j in range(1, 16)],
        'quantity': [10 + i*5 + j for j in range(15)],
        'unit_price': [50.0 + j*10 for j in range(15)],
        'total': [(50.0 + j*10) * (10 + i*5 + j) for j in range(15)]
    })
    
    filename = f'sales_{date_str}.csv'
    sales_df.to_csv(filename, index=False)
    
    print(f"  ‚úì {filename}: {len(sales_df)} transactions, ${sales_df['total'].sum():,.2f}")

print("\n‚úì All sales files generated!")

In [None]:
# Create sales directory
%hdfs mkdir /demo/sales

In [None]:
# Upload all sales files at once using wildcards
print("üì§ Uploading all sales_*.csv files...")
%hdfs put sales_*.csv /demo/sales/
print("‚úì Batch upload complete!")

In [None]:
# Verify all files were uploaded
print("üìÇ Files in /demo/sales:")
%hdfs ls /demo/sales

## Step 7: Data Validation Workflow

### User Story
*As a data quality analyst, I need to verify that uploaded files are complete and readable before proceeding with processing.*

In [None]:
# Quick validation: preview each sales file
import glob

print("üîç Validating uploaded sales files...\n")

for local_file in sorted(glob.glob('sales_*.csv')):
    hdfs_file = f"/demo/sales/{local_file}"
    print(f"File: {local_file}")
    print(f"Preview (first 3 lines):")
    result = %hdfs cat -n 3 {hdfs_file}
    print(result)
    print("-" * 60)

## Step 8: Cleanup Operations

### User Story
*As a storage administrator, I need to remove obsolete files and directories to free up space.*

Let's clean up our demo data.

In [None]:
# Delete a single file
print("üóëÔ∏è Deleting single file...")
%hdfs rm /demo/data/customers.csv
print("‚úì File deleted")

In [None]:
# Delete entire directory recursively
print("üóëÔ∏è Deleting /demo/sales directory (recursive)...")
%hdfs rm -r /demo/sales
print("‚úì Directory deleted")

In [None]:
# Verify cleanup
print("üìÇ Remaining contents in /demo:")
%hdfs ls /demo

In [None]:
# Final cleanup: remove demo directory
print("üóëÔ∏è Final cleanup...")
%hdfs rm -r /demo
print("‚úì All demo data cleaned up!")

## üéâ Summary & Key Takeaways

### What We Accomplished

In this demo, we successfully:

1. ‚úÖ **Configured** webhdfsmagic to connect to HDFS via Knox Gateway
2. ‚úÖ **Created** organized directory structures
3. ‚úÖ **Uploaded** single files and batch files with wildcards
4. ‚úÖ **Read** files directly from HDFS with preview options
5. ‚úÖ **Downloaded** files for local analysis
6. ‚úÖ **Validated** data quality through quick previews
7. ‚úÖ **Cleaned up** obsolete data efficiently

### Commands Demonstrated

| Command | Purpose | Example |
|---------|---------|--------|
| `%hdfs ls <path>` | List directory contents | `%hdfs ls /demo` |
| `%hdfs mkdir <path>` | Create directory | `%hdfs mkdir /demo/data` |
| `%hdfs put <local> <hdfs>` | Upload file(s) | `%hdfs put *.csv /demo/` |
| `%hdfs get <hdfs> <local>` | Download file(s) | `%hdfs get /demo/file.csv .` |
| `%hdfs cat <path>` | Read file content | `%hdfs cat /demo/data.csv` |
| `%hdfs cat -n N <path>` | Read first N lines | `%hdfs cat -n 10 /demo/data.csv` |
| `%hdfs rm <path>` | Delete file | `%hdfs rm /demo/old.csv` |
| `%hdfs rm -r <path>` | Delete directory | `%hdfs rm -r /demo/old/` |

### Advantages Over Traditional Methods

1. **93% Less Code**: No verbose client initialization
2. **Intuitive Syntax**: Magic commands feel natural in notebooks
3. **Streaming Support**: Efficient handling of large files
4. **Wildcard Support**: Batch operations made simple
5. **Knox Gateway Ready**: Enterprise security built-in
6. **Better Debugging**: Clear error messages and feedback

### Useful Resources

- **HDFS NameNode UI**: http://localhost:9870
- **WebHDFS Gateway**: http://localhost:8080/gateway/default/webhdfs/v1/
- **PyPI Package**: https://pypi.org/project/webhdfsmagic/
- **GitHub Repository**: https://github.com/ab2dridi/webhdfsmagic

### Next Steps

Now that you've mastered the basics, try:
- Integrating webhdfsmagic into your data pipelines
- Processing large datasets with pandas + HDFS
- Automating file uploads/downloads in workflows
- Combining with Spark for distributed processing

### Stop the Demo Environment

When done, stop the Docker containers:

```bash
# Stop but keep data
docker-compose stop

# Stop and remove everything
docker-compose down -v
```

---

**Thank you for trying webhdfsmagic!** üöÄ

Questions or feedback? Open an issue on [GitHub](https://github.com/ab2dridi/webhdfsmagic/issues)!