## Setup

Before running this demo, ensure you have:
1. Docker and docker-compose installed
2. Started the HDFS environment: `docker-compose up -d`
3. Configuration file at `~/.webhdfsmagic/config.json`

In [1]:
# Load the extension
%load_ext webhdfsmagic

The webhdfsmagic extension is already loaded. To reload it, use:
  %reload_ext webhdfsmagic


In [2]:
# View help and available commands
%hdfs help

Command,Description
%hdfs help,Display this help
"%hdfs setconfig {""knox_url"": ""..."", ""webhdfs_api"": ""..."",  ""username"": ""..."", ""password"": ""..."", ""verify_ssl"": false}",Set configuration and credentials directly in the notebook
%hdfs ls [path],List files on HDFS
%hdfs mkdir <path>,Create a directory on HDFS
%hdfs rm <path or pattern> [-r],Delete a file/directory. Supports wildcards.  Example: %hdfs rm /user/files* [-r]
%hdfs put <local_file_or_pattern> <hdfs_destination>,"Upload one or more local files (wildcards allowed) to HDFS.  If the HDFS path ends with '/' or '.', the original file name is preserved."
%hdfs get <hdfs_file_or_pattern> <local_destination>,"Download one or more files from HDFS.  If the local destination is a directory (or "".""/~),  the original file name is appended."
%hdfs cat <file> [-n <number_of_lines>],"Display file content. Default is 100 lines.  Use ""-n -1"" to display the full file."
%hdfs chmod [-R] <permission> <path>,"Set permissions (SETPERMISSION).  The ""-R"" option applies recursively."
%hdfs chown [-R] <user:group> <path>,"Set owner and group (SETOWNER).  The ""-R"" option applies recursively."


In [3]:
# Check current configuration
import json
import os

config_path = os.path.expanduser('~/.webhdfsmagic/config.json')
with open(config_path) as f:
    config = json.load(f)
    
print("Current configuration:")
print(f"  URL: {config['knox_url']}{config['webhdfs_api']}")
print(f"  User: {config['username']}")
print(f"  SSL: {config['verify_ssl']}")

Current configuration:
  URL: http://localhost:8080/gateway/default/webhdfs/v1
  User: testuser
  SSL: False


## 1Ô∏è‚É£ Directory Listing

In [4]:
# List root directory
%hdfs ls /

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,data,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 12:10:49.489,0
1,demo,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 12:17:39.846,0
2,test_mkdir_direct,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:57:06.101,0
3,test_via_magic,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:57:16.778,0
4,test_webhdfs,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 10:49:59.125,0


## 2Ô∏è‚É£ Creating Directories

In [5]:
# Create a test directory
%hdfs mkdir /demo

{'boolean': True}

In [6]:
# Create nested directories
%hdfs mkdir /demo/data

{'boolean': True}

In [7]:
# Verify directory creation
%hdfs ls /

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,data,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 12:10:49.489,0
1,demo,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 12:17:39.846,0
2,test_mkdir_direct,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:57:06.101,0
3,test_via_magic,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:57:16.778,0
4,test_webhdfs,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 10:49:59.125,0


In [8]:
# List contents of demo directory
%hdfs ls /demo

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,data,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 12:43:18.629,0
1,sales,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 12:45:51.299,0
2,test123,DIR,0,testuser,supergroup,rwxr-xr-x,0,2025-12-04 10:56:37.099,0


## 3Ô∏è‚É£ Uploading Files

In [9]:
# Create a local test file
import pandas as pd

# Create sample data
df = pd.DataFrame({
    'id': range(1, 11),
    'customer': [f'Customer{i}' for i in range(1, 11)],
    'amount': [100.5 * i for i in range(1, 11)]
})

# Save locally
df.to_csv('test_data.csv', index=False)
print("File test_data.csv created:")
print(df.head())

File test_data.csv created:
   id   customer  amount
0   1  Customer1   100.5
1   2  Customer2   201.0
2   3  Customer3   301.5
3   4  Customer4   402.0
4   5  Customer5   502.5


In [10]:
# Upload to HDFS
%hdfs put test_data.csv /demo/data/customers.csv

'/workspaces/webhdfsmagic/examples/test_data.csv uploaded successfully to /demo/data/customers.csv'

In [11]:
# Verify file exists
%hdfs ls /demo/data

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,2024,DIR,0,root,supergroup,rwxr-xr-x,0,2025-12-04 10:56:15.400,0
1,clients.csv,FILE,178,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:19:46.787,3
2,customers.csv,FILE,202,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:47:14.974,3


## 4Ô∏è‚É£ Reading Files

In [12]:
# Read file content
%hdfs cat /demo/data/customers.csv

'id,customer,amount\n1,Customer1,100.5\n2,Customer2,201.0\n3,Customer3,301.5\n4,Customer4,402.0\n5,Customer5,502.5\n6,Customer6,603.0\n7,Customer7,703.5\n8,Customer8,804.0\n9,Customer9,904.5\n10,Customer10,1005.0'

In [13]:
# Read only first 5 lines
%hdfs cat -n 5 /demo/data/customers.csv

'id,customer,amount\n1,Customer1,100.5\n2,Customer2,201.0\n3,Customer3,301.5\n4,Customer4,402.0'

## 5Ô∏è‚É£ Downloading Files

In [14]:
# Download from HDFS
%hdfs get /demo/data/customers.csv ./downloaded_customers.csv

'/demo/data/customers.csv downloaded to ./downloaded_customers.csv'

In [15]:
# Verify downloaded file
df_downloaded = pd.read_csv('downloaded_customers.csv')
print("File downloaded from HDFS:")
print(df_downloaded)

File downloaded from HDFS:
   id    customer  amount
0   1   Customer1   100.5
1   2   Customer2   201.0
2   3   Customer3   301.5
3   4   Customer4   402.0
4   5   Customer5   502.5
5   6   Customer6   603.0
6   7   Customer7   703.5
7   8   Customer8   804.0
8   9   Customer9   904.5
9  10  Customer10  1005.0


## 6Ô∏è‚É£ Complete Workflow Example

In [16]:
# Generate multiple sales data files
from datetime import datetime, timedelta

print("üìä Generating sales data...")

for i in range(3):
    date = datetime.now() - timedelta(days=i)
    date_str = date.strftime('%Y%m%d')
    
    # Generate data
    df_sales = pd.DataFrame({
        'date': [date.strftime('%Y-%m-%d')] * 10,
        'product_id': range(1, 11),
        'quantity': [10 + i*5 + j for j in range(10)],
        'price': [50.0 + j*10 for j in range(10)]
    })
    
    filename = f'sales_{date_str}.csv'
    df_sales.to_csv(filename, index=False)
    
    print(f"  Created: {filename} ({len(df_sales)} rows)")

print("\n‚úì Data generated")

üìä Generating sales data...
  Created: sales_20251204.csv (10 rows)
  Created: sales_20251203.csv (10 rows)
  Created: sales_20251202.csv (10 rows)

‚úì Data generated


In [17]:
# Create destination directory
%hdfs mkdir /demo/sales

{'boolean': True}

In [18]:
# Upload all files using wildcards
%hdfs put sales_*.csv /demo/sales/

'/workspaces/webhdfsmagic/examples/sales_20251203.csv uploaded successfully to /demo/sales/sales_20251203.csv\n/workspaces/webhdfsmagic/examples/sales_20251202.csv uploaded successfully to /demo/sales/sales_20251202.csv\n/workspaces/webhdfsmagic/examples/sales_20251204.csv uploaded successfully to /demo/sales/sales_20251204.csv'

In [24]:
# Verify uploaded files
print("üìÅ Files in HDFS:\n")
%hdfs ls /demo/sales

üìÅ Files in HDFS:



Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,sales_20251202.csv,FILE,247,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:47:16.113,3
1,sales_20251203.csv,FILE,247,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:47:15.683,3
2,sales_20251204.csv,FILE,247,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:47:16.166,3


## 7Ô∏è‚É£ Cleanup

In [20]:
# Delete a file
%hdfs rm /demo/data/customers.csv

{'boolean': True}

In [26]:
# Delete a directory recursively (be careful!)
%hdfs ls /demo/sales/

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,sales_20251202.csv,FILE,247,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:47:16.113,3
1,sales_20251203.csv,FILE,247,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:47:15.683,3
2,sales_20251204.csv,FILE,247,testuser,supergroup,rw-r--r--,134217728,2025-12-04 12:47:16.166,3


In [22]:
# Verify deletion
%hdfs rm /demo/sales/raw

{'boolean': False}

## ‚úÖ Summary

If all cells above executed successfully, webhdfsmagic is working correctly with your HDFS cluster!

### Features demonstrated:

- ‚úÖ Configuration and connection through Knox Gateway
- ‚úÖ Directory listing (`ls`)
- ‚úÖ Directory creation (`mkdir`)
- ‚úÖ File upload (`put`) with streaming support
- ‚úÖ File reading (`cat`) with line limit option
- ‚úÖ File download (`get`) with streaming support
- ‚úÖ Wildcard support for batch operations
- ‚úÖ File deletion (`rm`) with recursive option
- ‚úÖ Complete data workflow

### Useful URLs:

- **HDFS NameNode UI**: http://localhost:9870
- **WebHDFS Gateway**: http://localhost:8080/gateway/default/webhdfs/v1/

### To stop the environment:

```bash
docker-compose down
# or to also remove data:
docker-compose down -v
```

### Advantages of webhdfsmagic:

1. **Simpler syntax**: Magic commands vs Python API calls
2. **Less boilerplate**: No client initialization code needed
3. **Better integration**: Works naturally in Jupyter notebooks
4. **Streaming support**: Efficient for large files
5. **Wildcard support**: Batch operations made easy
6. **Knox Gateway ready**: Built-in support for enterprise security