-
Notifications
You must be signed in to change notification settings - Fork 1
/
HDFS
79 lines (62 loc) · 4.77 KB
/
HDFS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
1. What is name node?
Name Node in Hadoop is the master node, which holds metadata of all files in cluster. Manages the data node and take care of replication factor.
2. Data Node
Data Nodes are the slave nodes in HDFS which stores actual data.
3. Secondary Name Node/ check point node
It is helper deamon for name node. It constantly read all meta data from name node. It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage. The new FsImage is copied back to the NameNode.
4. Job Tracker/ Resource Manager
Job Tracker is the master daemon for both Job resource management and scheduling/monitoring of jobs
JobTracker receives the requests for MapReduce execution from the client, talks to the NameNode to determine the location of the data and finds the best TaskTracker nodes to execute tasks.
5. TaskTracker/Node Manager
It runs on DataNode and executes Mapper and Reducer tasks.
6. basic parameters of a Mapper: LongWritable and Text Text and IntWritable
7. common input formats defined in Hadoop: TextInputFormat KeyValueInputFormat SequenceFileInputFormat
8. sequence file in Hadoop:
To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed.
9. Hadoop’s three configuration files: core-site.xml mapred-site.xml hdfs-site.xml
10. Parquet vs ORC
i. Parquet compress 62% where ORC compress 78%
ii. Parquet default compression is SNAPPY
iii. ORC+Zlib has better performance than Paqruet + Snappy
iv. Hive has a vectorized ORC reader but no vectorized parquet reader.
v. Spark has a vectorized parquet reader and no vectorized ORC reader.
vi. Spark performs best with parquet, hive performs best with ORC.
11. Compression Techniques:
i. GZIP: Compressed data is not splittable and hence not suitable for MapReduce jobs
Good choice for Cold data which is infrequently accessed
ii. BZIP2: Compressed data is splittable.
Even though the compressed data is splittable, it is generally not suited for MR jobs because of high compression/decompression time.
iii. LZO: Compressed data is splittable if an appropriate indexing algorithm is used.
Due to compressed data is splittable and low ompression/decompression time, it is best suited for MR jobs
iv. Snappy: Compressed data is not splittable if used with normal file like .txt, but splittable with Container file formats like Avro and SequenceFile.
12. Kafka version used: 2.11 /0.8.2
13. cluster node 30
cluster size 60 TB
14. What is rack?
The Rack is the collection of around 40-50 DataNodes connected using the same network switch. If the network goes down, the whole rack will be unavailable.
A large Hadoop cluster is deployed in multiple racks. Rack awareness policies which says:
--> Not more than one replica be placed on one node.
--> Not more than two replicas are placed on the same rack.
--> Also, the number of racks used for block replication should always be smaller than the number of replicas.
15. How can we restart name node and demons in hdfs?
First stop
$HADOOP_HOME/bin/stop.dfs.sh
$HADOOP_HOME/bin/hadoop-daemon.sh stop namenode
Then start again
$HADOOP_HOME/bin/start.dfs.sh
$HADOOP_HOME/bin/hadoop-daemon.sh start namenode
To restart all daemons use the following command which deprecated
$HADOOP_HOME/bin/stop.all.sh
$HADOOP_HOME/bin/start.all.sh
16. how to check health of any file system in hdfs?
HDFS fsck is used to check the health of the file system, to find missing files, over replicated, under replicated and corrupted blocks.
hdfs fsck /
17. What is Xms and Xmx memory in JVM?
-xmx and -xms are the parameters used to adjust the heap size.
-Xms: It is used for setting the initial and minimum heap size. This means that your JVM will be started with Xms amount of memory. It is recommended to set the minimum heap size equivalent to the maximum heap size in order to minimize the garbage collection.
-Xmx: It is used for setting the maximum heap size. This means that your JVM will be able to use a maximum of Xmx amount of memory. The performance will decrease if the max heap value is set lower than the amount of live data. It will force frequent garbage collections in order to free up space.
So it is like when memory usage exceeds beyond Xmx we get jvm out of memory exception. When it tries to exceed that, although it may collect garbage to try to free up enough memory. If there still isn't enough memory to satisfy the request and the heap has already reached the maximum size, an OutOfMemoryError will occur.
The Xms flag has no default value, and Xmx typically has a default value of 256 MB. A common use for these flags is when you encounter a java.lang.OutOfMemoryError.
We can set them as below:
-Xms64m or -Xms64M
-Xmx1g or -Xmx1G