# üöÄ webhdfsmagic - Complete Tutorial & Showcase

Welcome! This notebook demonstrates **all features** of `webhdfsmagic`, a powerful IPython/Jupyter magic extension for interacting with HDFS clusters via WebHDFS REST API.

## üéØ What You'll Learn

This comprehensive tutorial covers:
- ‚úÖ **Setup & Configuration** - Connect to HDFS via Knox Gateway
- ‚úÖ **Directory Operations** - Create, list, and navigate directories
- ‚úÖ **File Management** - Upload, download, and delete files
- ‚úÖ **Smart Preview** - Intelligent data visualization for CSV/TSV/Parquet
- ‚úÖ **Permissions** - Manage file permissions (chmod)
- ‚úÖ **Advanced Features** - Format options, batch operations, and more

## üìã Prerequisites

Before starting, ensure you have:

1. **Docker environment running**:
   ```bash
   cd demo && docker-compose up -d
   ```
   This starts a local HDFS cluster with Knox Gateway for testing.

2. **webhdfsmagic installed**:
   ```bash
   pip install webhdfsmagic
   ```

3. **Services available**:
   - NameNode UI: http://localhost:9870 (HDFS web interface)
   - Knox Gateway: http://localhost:8080 (REST API gateway)

üí° **Note**: This demo uses a local HDFS cluster, but webhdfsmagic works with any WebHDFS-compatible cluster.

## Step 1: Load the Extension

First, we load the webhdfsmagic extension into Jupyter. This registers the `%hdfs` magic command that we'll use throughout this tutorial.

In [2]:
%load_ext webhdfsmagic

The webhdfsmagic extension is already loaded. To reload it, use:
  %reload_ext webhdfsmagic


## üìñ Step 2: View Available Commands

Let's explore what commands are available. The `help` command shows all webhdfsmagic operations with detailed documentation.

**üí° Tip**: If you've just updated webhdfsmagic, restart the kernel and reload this cell to see the latest help.

In [3]:
%hdfs help

Command,Description
%hdfs help,Display this help
%hdfs setconfig {...},Set configuration (JSON format)
%hdfs ls [path],List files and directories
%hdfs mkdir <path>,Create directory
%hdfs rm <path> [-r],Delete file/directory  -r : recursive deletion
%hdfs put <local> <hdfs>,"Upload files (supports wildcards)  -t, --threads <N> :  use N parallel threads for multi-file uploads"
%hdfs get <hdfs> <local>,"Download files (supports wildcards)  -t, --threads <N> :  use N parallel threads for multi-file downloads"
%hdfs cat <file> [options],"Smart file preview (CSV/TSV/Parquet)  -n <lines> :  limit to N rows (default: 100)  --format <type> :  force format (csv, parquet, pandas, polars, raw)  --raw :  display raw content without formatting  Auto-detects: file format, delimiter,  data types  Formats:  pandas (classic),  polars (with schema),  grid (default table)"
%hdfs chmod [-R] <mode> <path>,"Change permissions (e.g., 644, 755)  -R : recursive"
%hdfs chown [-R] <user:group> <path>,Change owner and group  -R : recursive


## Step 3: Configure HDFS Connection

Now we'll create a configuration file that tells webhdfsmagic how to connect to your HDFS cluster.

‚ö†Ô∏è **Important**: Knox Gateway requires the `/gateway/default` path in the URL.

üìù **What this does**:
- Creates `~/.webhdfsmagic/config.json` with connection settings
- Specifies Knox Gateway URL, credentials, and SSL settings

‚ö†Ô∏è **After running this cell**: You MUST restart the kernel (Kernel ‚Üí Restart) and reload the extension (re-run cells 1-2).

**Why?** The extension loads configuration at startup. If you create the config after loading the extension, it will use default (incorrect) settings.

In [4]:
import json
import os

# Create configuration directory
config_dir = os.path.expanduser('~/.webhdfsmagic')
config_path = os.path.join(config_dir, 'config.json')
os.makedirs(config_dir, exist_ok=True)

# Configuration settings
config = {
    "knox_url": "http://localhost:8080/gateway/default",  # Knox Gateway endpoint
    "webhdfs_api": "/webhdfs/v1",                         # WebHDFS API path
    "username": "hdfs",                                    # HDFS username
    "password": "password",                                # HDFS password
    "verify_ssl": False                                    # Disable SSL verification (dev only!)
}

# Write configuration file
with open(config_path, 'w') as f:
    json.dump(config, f, indent=2)

print("‚úÖ Configuration file created successfully!")
print("‚ö†Ô∏è  IMPORTANT: Restart the Jupyter kernel now and reload the extension!")
print(f"üìÅ Config saved to: {config_path}")
config

‚úÖ Configuration file created successfully!
‚ö†Ô∏è  IMPORTANT: Restart the Jupyter kernel now and reload the extension!
üìÅ Config saved to: /home/codespace/.webhdfsmagic/config.json


{'knox_url': 'http://localhost:8080/gateway/default',
 'webhdfs_api': '/webhdfs/v1',
 'username': 'hdfs',
 'password': 'password',
 'verify_ssl': False}

## ‚ö†Ô∏è RESTART KERNEL NOW

**Required actions**:
1. Click **Kernel ‚Üí Restart** (or the ‚ü≥ button in the toolbar)
2. After restart, **re-execute cells 1 and 2** to reload the extension with the correct configuration

**Why is this necessary?**

The webhdfsmagic extension loads its configuration when `%load_ext webhdfsmagic` is first executed. If you create the config file *after* loading the extension, it will continue using default (incorrect) settings.

By restarting the kernel and re-loading the extension, we ensure it picks up the new configuration.

‚úÖ **After restarting**: Continue with the next section to start working with HDFS!

## Step 4: Create Directory Structure

```

Let's create a demo directory structure in HDFS. We'll use the `mkdir` command to create directories.‚îî‚îÄ‚îÄ /archive   (for temporary/archived files)

‚îú‚îÄ‚îÄ /data      (for storing our data files)

**Directory structure we're creating**:/demo
```

In [5]:
# Create the root demo directory
%hdfs mkdir /demo

'Directory /demo created.'

In [6]:
# Create a subdirectory for data files
%hdfs mkdir /demo/data

'Directory /demo/data created.'

In [7]:
# Create a subdirectory for archives
%hdfs mkdir /demo/archive

'Directory /demo/archive created.'

In [8]:
# List contents to verify directories were created
%hdfs ls /demo

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,archive,DIR,0,hdfs,supergroup,rwxr-xr-x,0,2025-12-21 15:05:39.680,0
1,data,DIR,0,hdfs,supergroup,rwxr-xr-x,0,2025-12-21 15:05:39.647,0


## Step 5: Create Sample Data Files

This demonstrates webhdfsmagic's ability to work with multiple data formats.

Now let's create sample data files locally to demonstrate webhdfsmagic's file upload and preview capabilities.

- **Parquet** (columnar format) - product catalog

We'll create three different file formats:- **TSV** (Tab-Separated Values) - customer data  
- **CSV** (Comma-Separated Values) - sales data

In [9]:
import pandas as pd

# Create CSV file: Sales data with dates, products, quantities, and prices
sales = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=5),
    'product': ['Laptop', 'Monitor', 'Keyboard', 'Laptop', 'Monitor'],
    'quantity': [1, 2, 3, 1, 1],
    'price': [1000.0, 300.0, 80.0, 1000.0, 300.0]
})
sales.to_csv('sales.csv', index=False)
print("‚úÖ sales.csv created")

# Create TSV file: Customer data with tab separator
customers = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol'],
    'email': ['alice@example.com', 'bob@example.com', 'carol@example.com']
})
customers.to_csv('customers.tsv', sep='\t', index=False)
print("‚úÖ customers.tsv created")

# Create Parquet file: Product catalog (binary columnar format)
products = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Laptop', 'Monitor', 'Keyboard'],
    'stock': [50, 120, 500]
})
products.to_parquet('products.parquet')
print("‚úÖ products.parquet created")

‚úÖ sales.csv created
‚úÖ customers.tsv created
‚úÖ products.parquet created


## Step 6: Upload Files to HDFS (PUT)

This transfers files from your local filesystem to HDFS.

Now let's upload our local files to HDFS using the `put` command.

**Syntax**: `%hdfs put <local_path> <hdfs_path>`

In [10]:
# Upload CSV file
%hdfs put sales.csv /demo/data/sales.csv


sales.csv uploaded to /demo/data/sales.csv



In [11]:
# Upload TSV file
%hdfs put customers.tsv /demo/data/customers.tsv


customers.tsv uploaded to /demo/data/customers.tsv



In [12]:
# Upload Parquet file
%hdfs put products.parquet /demo/data/products.parquet


products.parquet uploaded to /demo/data/products.parquet



In [13]:
# Verify files were uploaded successfully
%hdfs ls /demo/data

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,customers.tsv,FILE,88,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.250,3
1,products.parquet,FILE,2699,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.682,3
2,sales.csv,FILE,163,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.219,3


## Step 7: Preview Files with Smart Cat

The `cat` command displays file contents with **intelligent format detection**!

Let's see it in action!

üß† **Smart Cat Features**:

- ‚úÖ Auto-detects file format (CSV, TSV, Parquet)
- ‚úÖ Auto-detects delimiters (`,` `;` `|` `\t`)
- ‚úÖ Formats output as beautiful tables
- ‚úÖ Handles all Parquet data types (int, float, bool, datetime, etc.)
- ‚ö° **Ultra-fast Parquet processing** with Polars (3.7x faster than PyArrow)

In [14]:
# Preview CSV file - automatically formatted as table
%hdfs cat /demo/data/sales.csv

+------------+-----------+------------+---------+
| date       | product   |   quantity |   price |
| 2025-01-01 | Laptop    |          1 |    1000 |
+------------+-----------+------------+---------+
| 2025-01-02 | Monitor   |          2 |     300 |
+------------+-----------+------------+---------+
| 2025-01-03 | Keyboard  |          3 |      80 |
+------------+-----------+------------+---------+
| 2025-01-04 | Laptop    |          1 |    1000 |
+------------+-----------+------------+---------+
| 2025-01-05 | Monitor   |          1 |     300 |
+------------+-----------+------------+---------+


In [15]:
# Preview TSV file - tab delimiter auto-detected
%hdfs cat /demo/data/customers.tsv

+------+--------+-------------------+
|   id | name   | email             |
|    1 | Alice  | alice@example.com |
+------+--------+-------------------+
|    2 | Bob    | bob@example.com   |
+------+--------+-------------------+
|    3 | Carol  | carol@example.com |
+------+--------+-------------------+


In [16]:
# Preview Parquet file - binary format decoded automatically
%hdfs cat /demo/data/products.parquet

+------+----------+---------+
|   id | name     |   stock |
|    1 | Laptop   |      50 |
+------+----------+---------+
|    2 | Monitor  |     120 |
+------+----------+---------+
|    3 | Keyboard |     500 |
+------+----------+---------+


In [17]:
# Preview with limit: show only first 2 rows
%hdfs cat -n 2 /demo/data/sales.csv

+------------+-----------+------------+---------+
| date       | product   |   quantity |   price |
| 2025-01-01 | Laptop    |          1 |    1000 |
+------------+-----------+------------+---------+
| 2025-01-02 | Monitor   |          2 |     300 |
+------------+-----------+------------+---------+

... (showing first 2 of 6 rows)


### üéØ Advanced CAT Options

Let's explore more `cat` command features:
- Custom row limits with `-n`
- Format comparisons
- Delimiter detection demos

In [18]:
# Custom preview: Show first 5 rows instead of default
print("üìÑ First 5 rows of CSV file:")
%hdfs cat -n 5 /demo/data/sales.csv

üìÑ First 5 rows of CSV file:
+------------+-----------+------------+---------+
| date       | product   |   quantity |   price |
| 2025-01-01 | Laptop    |          1 |    1000 |
+------------+-----------+------------+---------+
| 2025-01-02 | Monitor   |          2 |     300 |
+------------+-----------+------------+---------+
| 2025-01-03 | Keyboard  |          3 |      80 |
+------------+-----------+------------+---------+
| 2025-01-04 | Laptop    |          1 |    1000 |
+------------+-----------+------------+---------+
| 2025-01-05 | Monitor   |          1 |     300 |
+------------+-----------+------------+---------+


In [19]:
# Parquet format (optimized columnar storage)
print("üìä Parquet format (optimized read):")
%hdfs cat /demo/data/products.parquet

üìä Parquet format (optimized read):
+------+----------+---------+
|   id | name     |   stock |
|    1 | Laptop   |      50 |
+------+----------+---------+
|    2 | Monitor  |     120 |
+------+----------+---------+
|    3 | Keyboard |     500 |
+------+----------+---------+


In [20]:
# TSV format - tab delimiter automatically detected
print("üìã TSV format (tab delimiter auto-detected):")
%hdfs cat /demo/data/customers.tsv

üìã TSV format (tab delimiter auto-detected):
+------+--------+-------------------+
|   id | name   | email             |
|    1 | Alice  | alice@example.com |
+------+--------+-------------------+
|    2 | Bob    | bob@example.com   |
+------+--------+-------------------+
|    3 | Carol  | carol@example.com |
+------+--------+-------------------+


### üêº Format Option: --format pandas

**What's the difference?**

- Familiar format for pandas users

By default, `cat` displays data in a formatted grid (using tabulate). With `--format pandas`, you get the standard pandas DataFrame text representation instead.- Simpler, more compact output

- Copying/pasting data into terminals or text reports
**When to use it?**

In [21]:
# Default format: GRID (tabulate)
print("üìä GRID format (default):")
%hdfs cat -n 3 /demo/data/sales.csv

üìä GRID format (default):
+------------+-----------+------------+---------+
| date       | product   |   quantity |   price |
| 2025-01-01 | Laptop    |          1 |    1000 |
+------------+-----------+------------+---------+
| 2025-01-02 | Monitor   |          2 |     300 |
+------------+-----------+------------+---------+
| 2025-01-03 | Keyboard  |          3 |      80 |
+------------+-----------+------------+---------+

... (showing first 3 of 6 rows)


In [22]:
# Pandas format: DataFrame style
print("üêº PANDAS format (--format pandas):")
%hdfs cat -n 3 /demo/data/sales.csv --format pandas

üêº PANDAS format (--format pandas):
         date   product  quantity   price
0  2025-01-01    Laptop         1  1000.0
1  2025-01-02   Monitor         2   300.0
2  2025-01-03  Keyboard         3    80.0


In [23]:
# Works with Parquet too!
print("üêº PANDAS format with Parquet:")
%hdfs cat -n 3 /demo/data/products.parquet --format pandas

üêº PANDAS format with Parquet:
   id      name  stock
0   1    Laptop     50
1   2   Monitor    120
2   3  Keyboard    500


üí° **Pro tip**: Use `--format pandas` when you need plain text output for copying into reports, emails, or terminals.

### ‚ö° Format Option: --format polars

**What's Polars?**

Polars is a **blazingly fast** DataFrame library written in Rust. When you use `--format polars`, you get:
- **3.7x faster** processing than pandas for Parquet files
- **Explicit data types** (str, i64, f64, bool) shown in the preview
- **Schema information** for better data validation
- **Memory efficient** - uses lazy evaluation where possible

**When to use it?**
- When you need to validate data types in large files
- For performance-critical workflows with Parquet files
- When you want to see the exact schema at a glance



In [24]:
# Polars format: Shows schema with explicit types
print("‚ö° POLARS format (--format polars) - CSV with schema:")
%hdfs cat -n 3 /demo/data/sales.csv --format polars


‚ö° POLARS format (--format polars) - CSV with schema:
shape: (3, 4)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ date       ‚îÜ product  ‚îÜ quantity ‚îÜ price  ‚îÇ
‚îÇ ---        ‚îÜ ---      ‚îÜ ---      ‚îÜ ---    ‚îÇ
‚îÇ str        ‚îÜ str      ‚îÜ i64      ‚îÜ f64    ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 2025-01-01 ‚îÜ Laptop   ‚îÜ 1        ‚îÜ 1000.0 ‚îÇ
‚îÇ 2025-01-02 ‚îÜ Monitor  ‚îÜ 2        ‚îÜ 300.0  ‚îÇ
‚îÇ 2025-01-03 ‚îÜ Keyboard ‚îÜ 3        ‚îÜ 80.0   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

... (showing first 3 of 6 rows)


In [25]:
# Polars format with Parquet - much faster than pandas!
print("‚ö° POLARS format (--format polars) - Parquet (3.7x faster!):")
%hdfs cat -n 3 /demo/data/products.parquet --format polars


‚ö° POLARS format (--format polars) - Parquet (3.7x faster!):
shape: (3, 3)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ id  ‚îÜ name     ‚îÜ stock ‚îÇ
‚îÇ --- ‚îÜ ---      ‚îÜ ---   ‚îÇ
‚îÇ i64 ‚îÜ str      ‚îÜ i64   ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 1   ‚îÜ Laptop   ‚îÜ 50    ‚îÇ
‚îÇ 2   ‚îÜ Monitor  ‚îÜ 120   ‚îÇ
‚îÇ 3   ‚îÜ Keyboard ‚îÜ 500   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò


### üìù Format Option: --raw

**Raw format** displays the **unformatted, raw content** of the file - exactly as it is stored.

- No table formatting, no DataFrames
- Shows the original file content (CSV text, Parquet header bytes, etc.)
- Useful for debugging or when you want to see the file "as-is"

**When to use it?**
- Debugging file encoding or format issues
- Examining raw file content for inspection
- When you need the exact bytes/text without any processing



In [26]:
# Raw format: Unformatted content
print("üìù RAW format (--raw) - First 5 lines of CSV:")
%hdfs cat -n 5 /demo/data/sales.csv --raw


üìù RAW format (--raw) - First 5 lines of CSV:
date,product,quantity,price
2025-01-01,Laptop,1,1000.0
2025-01-02,Monitor,2,300.0
2025-01-03,Keyboard,3,80.0
2025-01-04,Laptop,1,1000.0


## Step 8: Manage Permissions (CHMOD)

Just like Unix/Linux, HDFS has file permissions! Use `chmod` to control access.

**Permission format**: `chmod <mode> <path>`
- **644**: Read/write for owner, read-only for others
- **755**: Read/write/execute for owner, read/execute for others
- **-R**: Recursive (applies to all files in directory)

Let's see current permissions, then modify them.


In [27]:
%hdfs ls /demo/data

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,customers.tsv,FILE,88,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.250,3
1,products.parquet,FILE,2699,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.682,3
2,sales.csv,FILE,163,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.219,3


In [28]:
%hdfs chmod 644 /demo/data/sales.csv

'Permission 644 set for /demo/data/sales.csv'

In [29]:
%hdfs chmod 644 /demo/data/customers.tsv

'Permission 644 set for /demo/data/customers.tsv'

In [30]:
%hdfs chmod 644 /demo/data/products.parquet

'Permission 644 set for /demo/data/products.parquet'

In [31]:
%hdfs chmod -R 755 /demo/archive

'Recursive chmod 755 applied on /demo/archive'

In [32]:
%hdfs ls /demo/data

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,customers.tsv,FILE,88,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.250,3
1,products.parquet,FILE,2699,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.682,3
2,sales.csv,FILE,163,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:40.219,3


## Step 9: Download Files from HDFS (GET)

Need to download files from HDFS to your local machine? Use the `get` command!

**Syntax**: `%hdfs get <hdfs_path> <local_path>`

This is the reverse of `put` - it transfers files from HDFS to your local filesystem.


In [33]:
%hdfs get /demo/data/sales.csv ./local_sales.csv

'/demo/data/sales.csv downloaded to ./local_sales.csv'

In [34]:
if os.path.exists('./local_sales.csv'):
    df = pd.read_csv('./local_sales.csv')
    print(f"‚úÖ File downloaded successfully: {len(df)} rows")
    df

‚úÖ File downloaded successfully: 5 rows


## Step 10: Delete Files and Directories (RM)

Clean up your HDFS workspace with the `rm` command.

**‚ö†Ô∏è Warning**: Deletion is **permanent**! There's no recycle bin in HDFS.

**Options**:
- `-r` or `-R`: Recursive (required for non-empty directories)
- `-skipTrash`: Bypass trash (immediate permanent deletion)

Let's create temporary files, then delete them to demonstrate.


In [35]:
# Create temporary files locally
for i in range(1, 3):
    temp = pd.DataFrame({'id': [i], 'value': [f'temp_{i}']})
    temp.to_csv(f'temp_{i}.csv', index=False)

In [36]:
%hdfs put temp_*.csv /demo/archive/


temp_1.csv uploaded to /demo/archive/
temp_2.csv uploaded to /demo/archive/



In [37]:
%hdfs ls /demo/archive

Unnamed: 0,name,type,size,owner,group,permissions,block_size,modified,replication
0,temp_1.csv,FILE,18,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:41.669,3
1,temp_2.csv,FILE,18,hdfs,supergroup,rw-r--r--,134217728,2025-12-21 15:05:42.083,3


In [38]:
%hdfs rm /demo/archive/temp_1.csv

'/demo/archive/temp_1.csv deleted'

In [39]:
%hdfs rm /demo/archive/temp_2.csv

'/demo/archive/temp_2.csv deleted'

In [40]:
%hdfs ls /demo/archive

{'empty_dir': True, 'path': '/demo/archive'}

## Step 11: Advanced Parquet Features

Let's test webhdfsmagic's ability to handle complex Parquet files with multiple data types.

This demonstrates Smart Cat's sophisticated type handling!

We'll create a Parquet file with:

- **Integers** (id)- **Timestamps** (datetime)

- **Strings** (name)- **Categories** (category)

- **Floats** (score)- **Booleans** (active)

In [41]:
# Create complex Parquet file with multiple data types
import numpy as np

complex_data = pd.DataFrame({
    'id': range(1, 11),                                      # Integer
    'name': [f'User_{i}' for i in range(1, 11)],           # String
    'score': np.random.uniform(0, 100, 10).round(2),       # Float
    'active': np.random.choice([True, False], 10),         # Boolean
    'category': np.random.choice(['A', 'B', 'C'], 10),     # Category
    'timestamp': pd.date_range('2025-01-01', periods=10, freq='D')  # Datetime
})

complex_data.to_parquet('complex_data.parquet')
print("‚úÖ Complex Parquet file created with multiple data types!")
complex_data

‚úÖ Complex Parquet file created with multiple data types!


Unnamed: 0,id,name,score,active,category,timestamp
0,1,User_1,91.52,False,B,2025-01-01
1,2,User_2,9.07,False,A,2025-01-02
2,3,User_3,1.43,False,C,2025-01-03
3,4,User_4,23.72,False,B,2025-01-04
4,5,User_5,94.83,True,B,2025-01-05
5,6,User_6,43.15,False,A,2025-01-06
6,7,User_7,77.44,False,B,2025-01-07
7,8,User_8,47.48,True,C,2025-01-08
8,9,User_9,12.1,False,B,2025-01-09
9,10,User_10,14.69,False,B,2025-01-10


In [42]:
# Upload to HDFS
%hdfs put complex_data.parquet /demo/data/complex_data.parquet


complex_data.parquet uploaded to /demo/data/complex_data.parquet



In [43]:
# Smart Cat handles all column types automatically!
print("üìä Preview complex Parquet file:")
print("   (types: int, str, float, bool, category, datetime)\n")
%hdfs cat /demo/data/complex_data.parquet

üìä Preview complex Parquet file:
   (types: int, str, float, bool, category, datetime)

+------+---------+---------+----------+------------+---------------------+
|   id | name    |   score | active   | category   | timestamp           |
|    1 | User_1  |   91.52 | False    | B          | 2025-01-01 00:00:00 |
+------+---------+---------+----------+------------+---------------------+
|    2 | User_2  |    9.07 | False    | A          | 2025-01-02 00:00:00 |
+------+---------+---------+----------+------------+---------------------+
|    3 | User_3  |    1.43 | False    | C          | 2025-01-03 00:00:00 |
+------+---------+---------+----------+------------+---------------------+
|    4 | User_4  |   23.72 | False    | B          | 2025-01-04 00:00:00 |
+------+---------+---------+----------+------------+---------------------+
|    5 | User_5  |   94.83 | True     | B          | 2025-01-05 00:00:00 |
+------+---------+---------+----------+------------+---------------------+
|    6 | U

In [44]:
# Preview first 3 rows only
print("üìÑ Preview first 3 rows:")
%hdfs cat -n 3 /demo/data/complex_data.parquet

üìÑ Preview first 3 rows:
+------+--------+---------+----------+------------+---------------------+
|   id | name   |   score | active   | category   | timestamp           |
|    1 | User_1 |   91.52 | False    | B          | 2025-01-01 00:00:00 |
+------+--------+---------+----------+------------+---------------------+
|    2 | User_2 |    9.07 | False    | A          | 2025-01-02 00:00:00 |
+------+--------+---------+----------+------------+---------------------+
|    3 | User_3 |    1.43 | False    | C          | 2025-01-03 00:00:00 |
+------+--------+---------+----------+------------+---------------------+

... (showing first 3 of 3 rows)


## Error Handling: Directory Not Found

The updated webhdfsmagic now provides **user-friendly error messages** when accessing non-existent directories.

Instead of showing a long HTTP 404 traceback, it simply displays:
```
Directory not found: <path>
```

This makes it much easier to understand what went wrong!


In [45]:
# Test 1: List a non-existent directory
print("Test 1: Attempting to list a non-existent directory...")
result = %hdfs ls /nonexistent_path
print(f"Result: {result}")
print()

# Test 2: List another non-existent path
print("Test 2: Attempting to list another non-existent directory...")
result2 = %hdfs ls /user/notfound/data
print(f"Result: {result2}")


ERROR: ERROR in GET LISTSTATUS: HTTPError: 404 Client Error: Not Found for url: http://localhost:8080/gateway/default/webhdfs/v1/nonexistent_path?op=LISTSTATUS&user.name=hdfs
ERROR:     url: http://localhost:8080/gateway/default/webhdfs/v1/nonexistent_path
ERROR:     status_code: None
ERROR:     response_text: None
ERROR: Full traceback:
Traceback (most recent call last):
  File "/workspaces/webhdfsmagic/webhdfsmagic/client.py", line 112, in execute
    response.raise_for_status()
  File "/home/codespace/.local/lib/python3.12/site-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:8080/gateway/default/webhdfs/v1/nonexistent_path?op=LISTSTATUS&user.name=hdfs
ERROR: ERROR in GET LISTSTATUS: HTTPError: 404 Client Error: Not Found for url: http://localhost:8080/gateway/default/webhdfs/v1/user/notfound/data?op=LISTSTATUS&user.name=hdfs
ERROR:     

Test 1: Attempting to list a non-existent directory...
Result: Directory not found: /nonexistent_path

Test 2: Attempting to list another non-existent directory...
Result: Directory not found: /user/notfound/data


## üöÄ Step 12: Parallel Uploads & Downloads (Multi-threaded PUT/GET)

Starting from version 0.0.4, webhdfsmagic supports **parallel file transfers** using the `--threads` (or `-t`) option for both `put` and `get` commands.

This allows you to upload or download multiple files simultaneously, greatly speeding up operations on large datasets or many files.

**Key features:**
- Multi-threaded transfers for PUT and GET
- Syntax: `%hdfs put --threads N <local_files> <hdfs_dir>`
- Syntax: `%hdfs get --threads N <hdfs_files> <local_dir>`
- N = number of threads (e.g. 4, 8, 16)

Below, we demonstrate parallel upload and download with example commands and explanations.

In [46]:
# Parallel upload: PUT multiple files to HDFS using 4 threads
# This will upload all CSV files in the current directory to /demo/data in parallel
%hdfs put --threads 4 *.csv /demo/data/


downloaded_customers.csv uploaded to /demo/data/
old_data_1.csv uploaded to /demo/data/
customers.csv uploaded to /demo/data/
old_data_2.csv uploaded to /demo/data/
test_file_4.csv uploaded to /demo/data/
local_sales.csv uploaded to /demo/data/
sales.csv uploaded to /demo/data/
test_file_1.csv uploaded to /demo/data/
sales_20251220.csv uploaded to /demo/data/
sales_20251221.csv uploaded to /demo/data/
sales_20251219.csv uploaded to /demo/data/
test_file_3.csv uploaded to /demo/data/
old_data_3.csv uploaded to /demo/data/
test_file_5.csv uploaded to /demo/data/
test_file_2.csv uploaded to /demo/data/
temp_1.csv uploaded to /demo/data/
temp_2.csv uploaded to /demo/data/



In [47]:
# Parallel download: GET multiple files from HDFS using 4 threads
# This will download all files from /demo/data to the local ./downloads directory in parallel
%hdfs get --threads 4 /demo/data/* ./downloads/

# You can also use the short option -t
# Example: %hdfs put -t 8 *.tsv /demo/data/


complex_data.parquet downloaded to downloads/complex_data.parquet
customers.tsv downloaded to downloads/customers.tsv
downloaded_customers.csv downloaded to downloads/downloaded_customers.csv
customers.csv downloaded to downloads/customers.csv
old_data_2.csv downloaded to downloads/old_data_2.csv
old_data_3.csv downloaded to downloads/old_data_3.csv
old_data_1.csv downloaded to downloads/old_data_1.csv
local_sales.csv downloaded to downloads/local_sales.csv
sales_20251219.csv downloaded to downloads/sales_20251219.csv
products.parquet downloaded to downloads/products.parquet
sales_20251220.csv downloaded to downloads/sales_20251220.csv
sales.csv downloaded to downloads/sales.csv
sales_20251221.csv downloaded to downloads/sales_20251221.csv
temp_2.csv downloaded to downloads/temp_2.csv
temp_1.csv downloaded to downloads/temp_1.csv
test_file_2.csv downloaded to downloads/test_file_2.csv
test_file_1.csv downloaded to downloads/test_file_1.csv
test_file_3.csv downloaded to downloads/test_

## Cleanup (Optional)

‚ö†Ô∏è **Warning**: This will permanently delete all demo files and directories!

Uncomment and run the cell below to clean up the demo workspace.

In [48]:
%hdfs rm -R /demo

'/demo deleted'

In [49]:
%hdfs ls  /demo

ERROR: ERROR in GET LISTSTATUS: HTTPError: 404 Client Error: Not Found for url: http://localhost:8080/gateway/default/webhdfs/v1/demo?op=LISTSTATUS&user.name=hdfs
ERROR:     url: http://localhost:8080/gateway/default/webhdfs/v1/demo
ERROR:     status_code: None
ERROR:     response_text: None
ERROR: Full traceback:
Traceback (most recent call last):
  File "/workspaces/webhdfsmagic/webhdfsmagic/client.py", line 112, in execute
    response.raise_for_status()
  File "/home/codespace/.local/lib/python3.12/site-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:8080/gateway/default/webhdfs/v1/demo?op=LISTSTATUS&user.name=hdfs


'Directory not found: /demo'