# Hands-On Pertemuan 2: Instalasi dan Konfigurasi Hadoop serta Struktur HDFS

## Tujuan:
- Memahami langkah-langkah instalasi dan konfigurasi Hadoop.
- Mempraktikkan bagaimana menggunakan Hadoop dan memahami struktur HDFS.
- Mengeksplorasi command line interface (CLI) Hadoop dan melakukan operasi dasar pada HDFS.

### 1. Instalasi Hadoop di Mode Standalone
1. **Unduh Hadoop**: Kunjungi [Apache Hadoop](https://hadoop.apache.org/releases.html) untuk mengunduh versi terbaru.
2. **Ekstrak dan Setup**: Tambahkan Hadoop ke dalam `$PATH` dan konfigurasi environment variable.
   ```bash
   export HADOOP_HOME=/path/to/hadoop
   export PATH=$PATH:$HADOOP_HOME/bin
   export JAVA_HOME=/path/to/java
   ```
3. **Format HDFS**: Format Hadoop file system dengan:
   ```bash
   hdfs namenode -format
   ```
4. **Start Hadoop**: Jalankan Hadoop dengan perintah:
   ```bash
   start-dfs.sh
   ```
- **Tugas 1**: Format HDFS dan jalankan dalam mode standalone. Verifikasi dengan menjalankan perintah `hadoop version`.

In [None]:
C:\Users\byu>hadoop version
Hadoop 3.3.6
Source code repository https://github.com/apache/hadoop.git -r 1be78238728da9266a4f88195058f08fd012bf9c
Compiled by ubuntu on 2023-06-18T08:22Z
Compiled on platform linux-x86_64
Compiled with protoc 3.7.1
From source with checksum 5652179ad55f76cb287d9c633bb53bbd
This command was run using /C:/hadoop/hadoop-3.3.6/share/hadoop/common/hadoop-common-3.3.6.jar

C:\Users\byu>start-dfs.cmd


### 2. Struktur HDFS dan Operasi Dasar
HDFS merupakan file system terdistribusi yang memungkinkan penyimpanan dan pemrosesan data besar secara paralel.
- **Operasi Dasar HDFS**:
   - Buat direktori baru di HDFS:
   ```bash
   hdfs dfs -mkdir /user/student
   ```
   - Unggah file ke HDFS:
   ```bash
   hdfs dfs -put input.txt /user/student/
   ```
   - Tampilkan file yang telah diunggah:
   ```bash
   hdfs dfs -ls /user/student/
   ```
- **Tugas 2**: Buat direktori di HDFS, upload file teks, tampilkan konten file, dan hapus file tersebut.

In [None]:
C:\Users\byu>hdfs dfs -mkdir /user

C:\Users\byu>hdfs dfs -mkdir /user/mibyu

C:\Users\byu>echo "Djancuk" > byu.txt

C:\Users\byu>hdfs dfs -put byu.txt /user/mibyu/

C:\Users\byu>hdfs dfs -ls /user/mibyu/
Found 1 items
-rw-r--r--   1 byu supergroup         12 2024-09-03 21:23 /user/mibyu/byu.txt


### 3. Operasi File di HDFS
Lakukan operasi pada file yang telah diunggah:
1. **Melihat Konten File**:
   ```bash
   hdfs dfs -cat /user/student/input.txt
   ```
2. **Menduplikasi File**:
   ```bash
   hdfs dfs -cp /user/student/input.txt /user/student/input_copy.txt
   ```
3. **Menghapus File dari HDFS**:
   ```bash
   hdfs dfs -rm /user/student/input_copy.txt
   ```
- **Tugas 3**: Lakukan operasi untuk menampilkan konten file, menduplikasi, dan menghapus file di HDFS.

In [None]:

C:\Users\byu>hdfs dfs -cat byu.txt
cat: `byu.txt': No such file or directory

C:\Users\byu>hdfs dfs -cat /user/mibyu/byu.txt
"Djancuk"

C:\Users\byu>hdfs dfs -cp /user/mibyu/byu.txt /user/mibyu/byu01.txt

C:\Users\byu>hdfs dfs -cat /user/mibyu/byu01.txt
"Djancuk"

C:\Users\byu>hdfs dfs -rm /user/mibyu/byu01.txt
Deleted /user/mibyu/byu01.txt

### 4. Menganalisis Struktur Penyimpanan di HDFS
Untuk memahami bagaimana HDFS mengelola penyimpanan, gunakan perintah berikut:
- **Menampilkan informasi penyimpanan HDFS**:
   ```bash
   hdfs dfsadmin -report
   ```
- **Menampilkan status block**:
   ```bash
   hdfs fsck / -files -blocks -locations
   ```
- **Tugas 4**: Lakukan analisis pada struktur penyimpanan di HDFS dan tuliskan laporan berdasarkan hasil dari `hdfs dfsadmin -report`.

In [None]:
C:\Users\byu>hdfs dfsadmin -report
Configured Capacity: 510938574848 (475.85 GB)
Present Capacity: 9155469655 (8.53 GB)
DFS Remaining: 9155469312 (8.53 GB)
DFS Used: 343 (343 B)
DFS Used%: 0.00%
Replicated Blocks:
        Under replicated blocks: 0
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0
Erasure Coded Block Groups:
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------

In [None]:
C:\Users\byu>hdfs fsck / -files -bloks -locations

Status: HEALTHY
 Number of data-nodes:  1
 Number of racks:               1
 Total dirs:                    3
 Total symlinks:                0

Replicated Blocks:
 Total size:    12 B
 Total files:   1
 Total blocks (validated):      1 (avg. block size 12 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    1
 Average block replication:     1.0
 Missing blocks:                0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Blocks queued for replication: 0

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:   0
 Total block groups (validated):        0
 Minimally erasure-coded block groups:  0
 Over-erasure-coded block groups:       0
 Under-erasure-coded block groups:      0
 Unsatisfactory placement block groups: 0
 Average block group size:      0.0
 Missing block groups:          0
 Corrupt block groups:          0
 Missing internal blocks:       0
 Blocks queued for replication: 0
FSCK ended at Tue Sep 03 21:39:21 WIB 2024 in 23 milliseconds

### 5. Tugas Tambahan: Integrasi Hadoop dengan Spark
- Coba instal Spark dan konfigurasi dengan Hadoop. Lakukan operasi sederhana untuk memproses data menggunakan Spark yang tersimpan di HDFS.

In [None]:
# Import SparkSeassion
from pyspark.sql import SparkSession

# Spark session
spark = SparkSession.builder \
    .appName("Tampilkan Isi File HDFS") \
    .getOrCreate()

# Membaca file teks dari HDFS menggunakan alamat IP
data = spark.read.text("hdfs://127.0.0.1:9000/user/mibyu/byu.txt")

# Menampilkan isi file
data.show(truncate=False)

# Menutup Spark session
spark.stop()