# Part 1: Working with HDFS using cmd #

In [1]:
!hdfs dfsadmin -report

Configured Capacity: 42241163264 (39.34 GB)
Present Capacity: 37587231919 (35.01 GB)
DFS Remaining: 37397528576 (34.83 GB)
DFS Used: 189703343 (180.92 MB)
DFS Used%: 0.50%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 127.0.0.1:50010 (localhost)
Hostname: sparkbox
Decommission Status : Normal
Configured Capacity: 42241163264 (39.34 GB)
DFS Used: 189703343 (180.92 MB)
Non DFS Used: 4653931345 (4.33 GB)
DFS Remaining: 37397528576 (34.83 GB)
DFS Used%: 0.45%
DFS Remaining%: 88.53%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Oct 06 16:40:53 UTC 2016




In [2]:
# list the content of the root directory
!hdfs dfs -ls /

Found 2 items
drwxr-xr-x   - vagrant supergroup          0 2016-05-11 19:52 /spark
drwxr-xr-x   - vagrant supergroup          0 2016-10-06 16:40 /user


In [1]:
# -df: amount of available disk space in HDFS
!hdfs dfs -df -h /

Filesystem               Size     Used  Available  Use%
hdfs://localhost:9000  39.3 G  180.9 M     34.8 G    0%


In [2]:
# -du: size of each folder (du stands for disk usage)
!hdfs dfs -du -h /

179.0 M  /spark
473.4 K  /user


In [3]:
# Make a datasets directory
!hdfs dfs -mkdir /datasets

In [4]:
# Transfer files from the hard disk to the node of the DFS
!wget -q http://www.gutenberg.org/cache/epub/100/pg100.txt \
    -O ../datasets/shakespeare_all.txt

In [5]:
# Store the data in the datasets directory that we created
!hdfs dfs -put ../datasets/shakespeare_all.txt \
    /datasets/shakespeare_all.txt

!hdfs dfs -put ../datasets/hadoop_git_readme.txt \
    /datasets/hadoop_git_readme.txt

In [6]:
!hdfs dfs -ls /datasets

Found 2 items
-rw-r--r--   1 vagrant supergroup       1365 2016-10-06 16:58 /datasets/hadoop_git_readme.txt
-rw-r--r--   1 vagrant supergroup    5589889 2016-10-06 16:58 /datasets/shakespeare_all.txt


In [7]:
# Concatenate the files to the standard output
# count the number of newlines 
!hdfs dfs -cat /datasets/hadoop_git_readme.txt | wc -l

30


In [8]:
# Concatenate files that are on HDFS and our local file system
!hdfs dfs -cat \
    hdfs:///datasets/hadoop_git_readme.txt \
    file:///home/vagrant/datasets/hadoop_git_readme.txt | wc -l

60


In [9]:
# Copy in HDFS using -cp
!hdfs dfs -cp /datasets/hadoop_git_readme.txt \
    /datasets/copy_hadoop_git_readme.txt

In [10]:
# Remove a a file using -rm
!hdfs dfs -rm /datasets/copy_hadoop_git_readme.txt

16/10/06 17:18:58 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /datasets/copy_hadoop_git_readme.txt


In [11]:
# Empty thrashed data using -expunge
!hdfs dfs -expunge

16/10/06 17:19:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.


In [12]:
# Get a file from HDFS to the local machine
!hdfs dfs -get /datasets/hadoop_git_readme.txt \
    /tmp/hadoop_git_readme.txt

In [13]:
# Display the last KB of data using -tail
!hdfs dfs -tail /datasets/hadoop_git_readme.txt

ntry, of 
encryption software.  BEFORE using any encryption software, please 
check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to 
see if this is permitted.  See <http://www.wassenaar.org/> for more
information.

The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity 
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms.  The form and manner of this Apache Software Foundation
distribution makes it eligible for export under the License Exception
ENC Technology Software Unrestricted (TSU) exception (see the BIS 
Export Administration Regulations, Section 740.13) for both object 
code and source code.

The following provides more details on the included cryptographic
software:
  Hadoop Core uses the SSL libraries from

# Part 2: Working with HDFS using Snakebite #

In [14]:
# pip install snakebite
from snakebite.client import Client
# Instantiate client and define the port of the Name node(9000)
# Name node - master node which contains the metadata of the files in the distributed filesystem and doesnt store any
# actual data. Data node - store blocks of data in chunks of 64MB typically.
client = Client("localhost", 9000)

In [15]:
# Get generic information
client.serverdefaults()

{'blockSize': 134217728L,
 'bytesPerChecksum': 512,
 'checksumType': 2,
 'encryptDataTransfer': False,
 'fileBufferSize': 4096,
 'replication': 1,
 'trashInterval': 0L,
 'writePacketSize': 65536}

In [16]:
# List the files and directories in root
for x in client.ls(['/']):
    print x['path']

/datasets
/spark
/user


In [17]:
# df(disk free)
client.df()

{'capacity': 42241163264L,
 'corrupt_blocks': 0L,
 'filesystem': 'hdfs://localhost:9000',
 'missing_blocks': 0L,
 'remaining': 37384859648L,
 'under_replicated': 0L,
 'used': 195366912L}

In [18]:
# du(disk usage)
list(client.du(["/"]))

[{'length': 5591254L, 'path': '/datasets'},
 {'length': 187698038L, 'path': '/spark'},
 {'length': 484810L, 'path': '/user'}]

In [19]:
# Delete a file from HDFS
# delete command returns a generator and the execution never fails, the result of the operation is in the result field.
client.delete(['/datasets/shakespeare_all.txt']).next()

{'path': '/datasets/shakespeare_all.txt', 'result': True}

In [20]:
# Copy a file from HDFS to the local filesystem. Output is a generator, therefore we need to check the output dict
# to see for success
(client
.copyToLocal(['/datasets/hadoop_git_readme.txt'], 
             '/tmp/hadoop_git_readme_2.txt')
.next())


{'error': '',
 'path': '/tmp/hadoop_git_readme_2.txt',
 'result': True,
 'source_path': '/datasets/hadoop_git_readme.txt'}

In [21]:
# Create a directory 
list(client.mkdir(['/datasets_2']))

[{'path': '/datasets_2', 'result': True}]

In [22]:
# Delete all files matching a string
list(client.delete(['/datasets*'], recurse=True))

[{'path': '/datasets', 'result': True},
 {'path': '/datasets_2', 'result': True}]