---
layout: page
title: Introduction to Hadoop
subtitle: Understanding HDFS Files and Directories
minutes: 10
---
> ## Learning Objectives {.objectives}
>
> *   Understand how files and directories in HDFS are viewed
>     relative to files and directories in the Linux file systems.

More than just a file storage and management system, HDFS provides an
infrastructure through which parallel processing of massive amount of data is
enabled.

<img src="fig/HDFSBlockView.png" \
     alt="HDFSBlockView" \
     style="height:500px">

To enable large scale processing of big data, Hadoop takes a straight forward
approach in HDFS, which is to simply divide a very large data file into
smaller blocks and distribute these blocks across a cluster of computers
(the Hadoop cluster). The blocks are replicated to ensure that if any
individual computer fails, there are still enough copies of the data on the
remaining computers for uninterrupted operations.

Checking block status of file **ratings.csv**:
    

In [2]:
!ssh dsciu001 hdfs fsck /user/lngo/intro-to-hadoop -files -blocks -locations

Connecting to namenode via http://dscim002.palmetto.clemson.edu:50070/fsck?ugi=lngo&files=1&blocks=1&locations=1&path=%2Fuser%2Flngo%2Fintro-to-hadoop
FSCK started by lngo (auth:KERBEROS_SSL) from /10.125.8.212 for path /user/lngo/intro-to-hadoop at Fri Jul 22 15:04:10 EDT 2016
/user/lngo/intro-to-hadoop <dir>
/user/lngo/intro-to-hadoop/gutenberg-shakespeare.txt 5447744 bytes, 1 block(s):  OK
0. BP-1143747467-10.125.40.142-1413584797204:blk_1099953286_26224275 len=5447744 repl=2 [DatanodeInfoWithStorage[10.125.8.197:1019,DS-9d2f85fa-af96-41a0-9a7a-362e07fb0721,DISK], DatanodeInfoWithStorage[10.125.8.196:1019,DS-af361ac7-1213-46f8-8147-2c1356a2315e,DISK]]

/user/lngo/intro-to-hadoop/ml-10M100K <dir>
/user/lngo/intro-to-hadoop/ml-10M100K/README.html 11563 bytes, 1 block(s):  OK
0. BP-1143747467-10.125.40.142-1413584797204:blk_1099955769_26226793 len=11563 repl=2 [DatanodeInfoWithStorage[10.125.8.206:1019,DS-acc010e2-e3c6-48ae-88f4-48e48e071dfa,DISK], DatanodeInfoWithStorage[10.125.8.224:

To bring out the nature of data locality in this distributed block-based
approach, it is critical to minimize the needs for data transfer between
computers storing these data blocks. A programming approach called
***mapreduce*** is leveraged by Google to make this happen.


> ## mapreduce vs Apache MapReduce {.callout}
>
> It is important to distinguish between the mapreduce programming
> paradigm and the Apache MapReduce implementation. The mapreduce programming
> paradigm includes any implementation approach that ***maps*** the same
> operation to individual data elements of a data collection, and then
> ***reduce*** the resulting data to a final simplified result. For example,
> Apache Spark, the highly touted "MapReduce killer", utilizes in-memory
> operations to implement its mapping and reducing capabilities. Apache
> MapReduce is the defult implementation of the mapreduce
> paradigm for Hadoop.