## Prerequisites

Before starting, ensure you have:

1. ‚úÖ **Docker environment running**:
   ```bash
   docker-compose up -d
   ```

2. ‚úÖ **Configuration file** at `~/.webhdfsmagic/config.json`:
   ```json
   {
     "knox_url": "http://localhost:8080/gateway/default",
     "webhdfs_api": "/webhdfs/v1",
     "username": "hdfs",
     "password": "password",
     "verify_ssl": false
   }
   ```

3. ‚úÖ **webhdfsmagic installed**:
   ```bash
   pip install webhdfsmagic
   ```

## Step 1: Load Extension and Verify Configuration

First, we load the webhdfsmagic extension and verify our connection settings.

In [1]:
# Load the webhdfsmagic extension
%load_ext webhdfsmagic

The webhdfsmagic extension is already loaded. To reload it, use:
  %reload_ext webhdfsmagic


In [2]:
# Display help to see all available commands
%hdfs help

Command,Description
%hdfs help,Display this help
"%hdfs setconfig {""knox_url"": ""..."", ""webhdfs_api"": ""..."",  ""username"": ""..."", ""password"": ""..."", ""verify_ssl"": false}",Set configuration and credentials directly in the notebook
%hdfs ls [path],List files on HDFS
%hdfs mkdir <path>,Create a directory on HDFS
%hdfs rm <path or pattern> [-r],Delete a file/directory. Supports wildcards.  Example: %hdfs rm /user/files* [-r]
%hdfs put <local_file_or_pattern> <hdfs_destination>,"Upload one or more local files (wildcards allowed) to HDFS.  If the HDFS path ends with '/' or '.', the original file name is preserved."
%hdfs get <hdfs_file_or_pattern> <local_destination>,"Download one or more files from HDFS.  If the local destination is a directory (or "".""/~),  the original file name is appended."
%hdfs cat <file> [-n <number_of_lines>],"Display file content. Default is 100 lines.  Use ""-n -1"" to display the full file."
%hdfs chmod [-R] <permission> <path>,"Set permissions (SETPERMISSION).  The ""-R"" option applies recursively."
%hdfs chown [-R] <user:group> <path>,"Set owner and group (SETOWNER).  The ""-R"" option applies recursively."


In [3]:
# Verify configuration
import json
import os

config_path = os.path.expanduser('~/.webhdfsmagic/config.json')
with open(config_path) as f:
    config = json.load(f)

print("‚úì Configuration loaded successfully!")
print(f"  Gateway URL: {config['knox_url']}")
print(f"  WebHDFS API: {config['webhdfs_api']}")
print(f"  Username: {config['username']}")
print(f"  SSL Verification: {config['verify_ssl']}")

‚úì Configuration loaded successfully!
  Gateway URL: http://localhost:8080/gateway/default
  WebHDFS API: /webhdfs/v1
  Username: testuser
  SSL Verification: False


## Step 2: Directory Operations

### User Story
*As a data engineer, I need to organize my data in HDFS by creating a logical directory structure for my project.*

Let's explore basic directory operations: listing, creating, and navigating.

In [4]:
# List root directory to see what's already there
print("üìÇ Root directory contents:")
%hdfs ls /

üìÇ Root directory contents:


Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,data,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 12:10:49.489,0
1,demo,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 13:30:32.910,0
2,test_mkdir_direct,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:57:06.101,0
3,test_via_magic,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:57:16.778,0
4,test_webhdfs,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 10:49:59.125,0


In [5]:
# Create a project directory
print("Creating /demo directory...")
%hdfs mkdir /demo

Creating /demo directory...


{'boolean': True}

In [6]:
# Create nested directories for organizing data
print("Creating nested structure...")
%hdfs mkdir /demo/data
%hdfs mkdir /demo/results

Creating nested structure...


{'boolean': True}

In [7]:
# Verify our directory structure
print("üìÇ Project structure:")
%hdfs ls /demo

üìÇ Project structure:


Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,data,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 12:56:01.974,0
1,results,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 13:34:28.368,0
2,sales,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 12:56:02.484,0
3,test123,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:56:37.099,0


## Step 3: Uploading Files

### User Story
*As a data analyst, I have local CSV files that I need to upload to HDFS for distributed processing.*

Let's create a sample dataset and upload it to HDFS.

In [8]:
# Create a sample customer dataset
import pandas as pd

customers_df = pd.DataFrame({
    'customer_id': range(1, 21),
    'name': [f'Customer {i}' for i in range(1, 21)],
    'email': [f'customer{i}@example.com' for i in range(1, 21)],
    'total_purchases': [round(100.5 * i, 2) for i in range(1, 21)],
    'loyalty_tier': ['Gold' if i > 15 else 'Silver' if i > 10 else 'Bronze' for i in range(1, 21)]
})

# Save locally
customers_df.to_csv('customers.csv', index=False)

print("‚úì Sample dataset created!")
print(f"  Records: {len(customers_df)}")
print("\nFirst 5 records:")
print(customers_df.head())

‚úì Sample dataset created!
  Records: 20

First 5 records:
   customer_id        name                  email  total_purchases  \
0            1  Customer 1  customer1@example.com            100.5   
1            2  Customer 2  customer2@example.com            201.0   
2            3  Customer 3  customer3@example.com            301.5   
3            4  Customer 4  customer4@example.com            402.0   
4            5  Customer 5  customer5@example.com            502.5   

  loyalty_tier  
0       Bronze  
1       Bronze  
2       Bronze  
3       Bronze  
4       Bronze  


In [9]:
# Upload to HDFS
print("üì§ Uploading customers.csv to HDFS...")
%hdfs put customers.csv /demo/data/customers.csv
print("‚úì Upload complete!")

üì§ Uploading customers.csv to HDFS...
‚úì Upload complete!


In [10]:
# Verify the file was uploaded
print("üìÇ Files in /demo/data:")
%hdfs ls /demo/data

üìÇ Files in /demo/data:


Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,2024,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 10:56:15.400,0
1,clients.csv,FILE,178,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:56:02.030,3
2,customers.csv,FILE,1046,testuser,supergroup,rw-r--r--,134217728,2025-12-04 13:34:28.699,3


## Step 4: Reading Files from HDFS

### User Story
*As a data scientist, I need to quickly preview HDFS files without downloading them to verify content and structure.*

The `cat` command allows you to read files directly from HDFS.

In [11]:
# Read the entire file
print("üìÑ Full file content:")
%hdfs cat /demo/data/customers.csv

üìÑ Full file content:


'customer_id,name,email,total_purchases,loyalty_tier\n1,Customer 1,customer1@example.com,100.5,Bronze\n2,Customer 2,customer2@example.com,201.0,Bronze\n3,Customer 3,customer3@example.com,301.5,Bronze\n4,Customer 4,customer4@example.com,402.0,Bronze\n5,Customer 5,customer5@example.com,502.5,Bronze\n6,Customer 6,customer6@example.com,603.0,Bronze\n7,Customer 7,customer7@example.com,703.5,Bronze\n8,Customer 8,customer8@example.com,804.0,Bronze\n9,Customer 9,customer9@example.com,904.5,Bronze\n10,Customer 10,customer10@example.com,1005.0,Bronze\n11,Customer 11,customer11@example.com,1105.5,Silver\n12,Customer 12,customer12@example.com,1206.0,Silver\n13,Customer 13,customer13@example.com,1306.5,Silver\n14,Customer 14,customer14@example.com,1407.0,Silver\n15,Customer 15,customer15@example.com,1507.5,Silver\n16,Customer 16,customer16@example.com,1608.0,Gold\n17,Customer 17,customer17@example.com,1708.5,Gold\n18,Customer 18,customer18@example.com,1809.0,Gold\n19,Customer 19,customer19@example.

In [12]:
# Preview just the first 5 lines (header + 4 records)
print("üëÄ Quick preview (first 5 lines):")
%hdfs cat -n 5 /demo/data/customers.csv

üëÄ Quick preview (first 5 lines):


'customer_id,name,email,total_purchases,loyalty_tier\n1,Customer 1,customer1@example.com,100.5,Bronze\n2,Customer 2,customer2@example.com,201.0,Bronze\n3,Customer 3,customer3@example.com,301.5,Bronze\n4,Customer 4,customer4@example.com,402.0,Bronze'

## Step 5: Downloading Files

### User Story
*As a business analyst, I need to download processed data from HDFS to create reports in Excel.*

Let's download our file and work with it locally.

In [13]:
# Download file from HDFS
print("üì• Downloading from HDFS...")
%hdfs get /demo/data/customers.csv ./downloaded_customers.csv
print("‚úì Download complete!")

üì• Downloading from HDFS...
‚úì Download complete!


In [14]:
# Verify downloaded file
df_downloaded = pd.read_csv('downloaded_customers.csv')

print("‚úì File downloaded successfully!")
print(f"  Records: {len(df_downloaded)}")
print("\nData summary:")
print(df_downloaded.describe())

‚úì File downloaded successfully!
  Records: 20

Data summary:
       customer_id  total_purchases
count     20.00000        20.000000
mean      10.50000      1055.250000
std        5.91608       594.566018
min        1.00000       100.500000
25%        5.75000       577.875000
50%       10.50000      1055.250000
75%       15.25000      1532.625000
max       20.00000      2010.000000


## Step 6: Batch Operations with Wildcards

### User Story
*As a data engineer processing daily sales data, I receive multiple files that need to be uploaded to HDFS efficiently.*

webhdfsmagic supports wildcards for batch operations, making it easy to handle multiple files.

In [15]:
# Generate multiple daily sales files
from datetime import datetime, timedelta

print("üìä Generating daily sales data...\n")

for i in range(3):
    date = datetime.now() - timedelta(days=i)
    date_str = date.strftime('%Y%m%d')

    # Generate sales data
    sales_df = pd.DataFrame({
        'date': [date.strftime('%Y-%m-%d')] * 15,
        'product_id': [f'PROD{j:03d}' for j in range(1, 16)],
        'quantity': [10 + i*5 + j for j in range(15)],
        'unit_price': [50.0 + j*10 for j in range(15)],
        'total': [(50.0 + j*10) * (10 + i*5 + j) for j in range(15)]
    })

    filename = f'sales_{date_str}.csv'
    sales_df.to_csv(filename, index=False)

    print(f"  ‚úì {filename}: {len(sales_df)} transactions, ${sales_df['total'].sum():,.2f}")

print("\n‚úì All sales files generated!")

üìä Generating daily sales data...

  ‚úì sales_20251204.csv: 15 transactions, $33,400.00
  ‚úì sales_20251203.csv: 15 transactions, $42,400.00
  ‚úì sales_20251202.csv: 15 transactions, $51,400.00

‚úì All sales files generated!


In [16]:
# Create sales directory
%hdfs mkdir /demo/sales

{'boolean': True}

In [17]:
# Upload all sales files at once using wildcards
print("üì§ Uploading all sales_*.csv files...")
%hdfs put sales_*.csv /demo/sales/
print("‚úì Batch upload complete!")

üì§ Uploading all sales_*.csv files...
‚úì Batch upload complete!


In [18]:
# Verify all files were uploaded
print("üìÇ Files in /demo/sales:")
%hdfs ls /demo/sales

üìÇ Files in /demo/sales:


Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,raw,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 12:56:03.360,0
1,sales_20251202.csv,FILE,562,testuser,supergroup,rw-r--r--,134217728,2025-12-04 13:34:29.696,3
2,sales_20251203.csv,FILE,560,testuser,supergroup,rw-r--r--,134217728,2025-12-04 13:34:29.240,3
3,sales_20251204.csv,FILE,559,testuser,supergroup,rw-r--r--,134217728,2025-12-04 13:34:29.755,3


## Step 7: Data Validation Workflow

### User Story
*As a data quality analyst, I need to verify that uploaded files are complete and readable before proceeding with processing.*

In [19]:
# Quick validation: preview each sales file
import glob

print("üîç Validating uploaded sales files...\n")

for local_file in sorted(glob.glob('sales_*.csv')):
    hdfs_file = f"/demo/sales/{local_file}"
    print(f"File: {local_file}")
    print("Preview (first 3 lines):")
    result = %hdfs cat -n 3 {hdfs_file}
    print(result)
    print("-" * 60)

üîç Validating uploaded sales files...

File: sales_20251202.csv
Preview (first 3 lines):
date,product_id,quantity,unit_price,total
2025-12-02,PROD001,20,50.0,1000.0
2025-12-02,PROD002,21,60.0,1260.0
------------------------------------------------------------
File: sales_20251203.csv
Preview (first 3 lines):
date,product_id,quantity,unit_price,total
2025-12-03,PROD001,15,50.0,750.0
2025-12-03,PROD002,16,60.0,960.0
------------------------------------------------------------
File: sales_20251204.csv
Preview (first 3 lines):
date,product_id,quantity,unit_price,total
2025-12-04,PROD001,10,50.0,500.0
2025-12-04,PROD002,11,60.0,660.0
------------------------------------------------------------


## Step 8: Cleanup Operations

### User Story
*As a storage administrator, I need to remove obsolete files and directories to free up space.*

Let's clean up our demo data.

In [20]:
# Delete a single file
print("üóëÔ∏è Deleting single file...")
%hdfs rm /demo/data/customers.csv
print("‚úì File deleted")

üóëÔ∏è Deleting single file...
‚úì File deleted


In [21]:
# Delete entire directory recursively
print("üóëÔ∏è Deleting /demo/sales directory (recursive)...")
%hdfs rm -r /demo/sales
print("‚úì Directory deleted")

üóëÔ∏è Deleting /demo/sales directory (recursive)...
‚úì Directory deleted


In [22]:
# Verify cleanup
print("üìÇ Remaining contents in /demo:")
%hdfs ls /demo

üìÇ Remaining contents in /demo:


Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,data,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 13:34:30.088,0
1,results,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 13:34:28.368,0
2,test123,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:56:37.099,0


In [23]:
# Final cleanup: remove demo directory
print("üóëÔ∏è Final cleanup...")
%hdfs rm -r /demo
print("‚úì All demo data cleaned up!")

üóëÔ∏è Final cleanup...
‚úì All demo data cleaned up!


## üéâ Summary & Key Takeaways

### What We Accomplished

In this demo, we successfully:

1. ‚úÖ **Configured** webhdfsmagic to connect to HDFS via Knox Gateway
2. ‚úÖ **Created** organized directory structures
3. ‚úÖ **Uploaded** single files and batch files with wildcards
4. ‚úÖ **Read** files directly from HDFS with preview options
5. ‚úÖ **Downloaded** files for local analysis
6. ‚úÖ **Validated** data quality through quick previews
7. ‚úÖ **Cleaned up** obsolete data efficiently

### Commands Demonstrated

| Command | Purpose | Example |
|---------|---------|--------|
| `%hdfs ls <path>` | List directory contents | `%hdfs ls /demo` |
| `%hdfs mkdir <path>` | Create directory | `%hdfs mkdir /demo/data` |
| `%hdfs put <local> <hdfs>` | Upload file(s) | `%hdfs put *.csv /demo/` |
| `%hdfs get <hdfs> <local>` | Download file(s) | `%hdfs get /demo/file.csv .` |
| `%hdfs cat <path>` | Read file content | `%hdfs cat /demo/data.csv` |
| `%hdfs cat -n N <path>` | Read first N lines | `%hdfs cat -n 10 /demo/data.csv` |
| `%hdfs rm <path>` | Delete file | `%hdfs rm /demo/old.csv` |
| `%hdfs rm -r <path>` | Delete directory | `%hdfs rm -r /demo/old/` |

### Advantages Over Traditional Methods

1. **93% Less Code**: No verbose client initialization
2. **Intuitive Syntax**: Magic commands feel natural in notebooks
3. **Streaming Support**: Efficient handling of large files
4. **Wildcard Support**: Batch operations made simple
5. **Knox Gateway Ready**: Enterprise security built-in
6. **Better Debugging**: Clear error messages and feedback

### Useful Resources

- **HDFS NameNode UI**: http://localhost:9870
- **WebHDFS Gateway**: http://localhost:8080/gateway/default/webhdfs/v1/
- **PyPI Package**: https://pypi.org/project/webhdfsmagic/
- **GitHub Repository**: https://github.com/ab2dridi/webhdfsmagic

### Next Steps

Now that you've mastered the basics, try:
- Integrating webhdfsmagic into your data pipelines
- Processing large datasets with pandas + HDFS
- Automating file uploads/downloads in workflows
- Combining with Spark for distributed processing

### Stop the Demo Environment

When done, stop the Docker containers:

```bash
# Stop but keep data
docker-compose stop

# Stop and remove everything
docker-compose down -v
```

---

**Thank you for trying webhdfsmagic!** üöÄ

Questions or feedback? Open an issue on [GitHub](https://github.com/ab2dridi/webhdfsmagic/issues)!