# Instalasi OpenJDK dan Spark

Instalasi OpenJDK karena Spark ditulis dalam bahasa Scala dan berjalan di atas Java Virtual Machine (JVM).

In [1]:
# OpenJDK
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Download Spark with Hadoop dan extract.

In [2]:
# Spark
!wget -q https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
!tar xf spark-3.2.1-bin-hadoop2.7.tgz

# Setting Spark agar dapat berjalan

Set environment PATH

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop2.7/bin"

Install library **Findspark** yang akan mencari spark dalam sistem dan akan menginstallnya sebagai regular library.

In [4]:
!pip install -q findspark
import findspark
findspark.init("/content/spark-3.2.1-bin-hadoop2.7/", edit_rc=True)

Import **SparkSession** dari pyspark.sql and membuat SparkSession

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

Jalankan `spark` untuk melihat info dari Spark

In [6]:
spark

# Download dataset `sample-books.json`

In [7]:
# Default akan terdownload ke dalam /content/sample-books.json
!wget --continue https://raw.githubusercontent.com/rizqinugroho/learning-hadoop/main/learning-spark/sample-books.json

--2022-02-08 13:42:06--  https://raw.githubusercontent.com/rizqinugroho/learning-hadoop/main/learning-spark/sample-books.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1578 (1.5K) [text/plain]
Saving to: ‘sample-books.json’


2022-02-08 13:42:06 (23.3 MB/s) - ‘sample-books.json’ saved [1578/1578]



# Task 1: Load dataset ke dalam Spark DataFrame dan tampilkan

In [8]:
df = spark.read.option("multiline","true").json("/content/sample-books.json")
df.show()

+--------------------+-----------------+-----+--------------------+------------+
|              author|          edition|price|               title|year_written|
+--------------------+-----------------+-----+--------------------+------------+
|        Austen, Jane|          Penguin| 18.2|    Northanger Abbey|        1814|
|        Tolstoy, Leo|          Penguin| 12.7|       War and Peace|        1865|
|        Tolstoy, Leo|          Penguin| 13.5|       Anna Karenina|        1875|
|     Woolf, Virginia|   Harcourt Brace| 25.0|       Mrs. Dalloway|        1925|
|Cunnningham, Michael|   Harcourt Brace|12.35|           The Hours|        1999|
|         Twain, Mark|          Penguin| 5.76|    Huckleberry Finn|        1865|
|    Dickens, Charles|     Random House| 5.75|         Bleak House|        1870|
|         Twain, Mark|     Random House| 7.75|          Tom Sawyer|        1862|
|     Woolf, Virginia|          Penguin| 29.0| A Room of One's Own|        1922|
|       Rowling, J.K.|   Har

# Tasks 2: Menampilkan data dimana tahun ditulis (`year_written`) kurang dari tahun 2000

In [9]:
df_filtered = df.filter("year_written < 2000")
df_filtered.show()

+--------------------+-----------------+-----+--------------------+------------+
|              author|          edition|price|               title|year_written|
+--------------------+-----------------+-----+--------------------+------------+
|        Austen, Jane|          Penguin| 18.2|    Northanger Abbey|        1814|
|        Tolstoy, Leo|          Penguin| 12.7|       War and Peace|        1865|
|        Tolstoy, Leo|          Penguin| 13.5|       Anna Karenina|        1875|
|     Woolf, Virginia|   Harcourt Brace| 25.0|       Mrs. Dalloway|        1925|
|Cunnningham, Michael|   Harcourt Brace|12.35|           The Hours|        1999|
|         Twain, Mark|          Penguin| 5.76|    Huckleberry Finn|        1865|
|    Dickens, Charles|     Random House| 5.75|         Bleak House|        1870|
|         Twain, Mark|     Random House| 7.75|          Tom Sawyer|        1862|
|     Woolf, Virginia|          Penguin| 29.0| A Room of One's Own|        1922|
|             Marquez|Harper

# Task 3: Grouping data berdasarkan `edition` dan tampikan total buku dari setiap edition 

In [10]:
df_grouped = df.groupby('edition').count()
df_grouped.show()

+-----------------+-----+
|          edition|count|
+-----------------+-----+
| Signet  Classics|    1|
|Harper  Perennial|    1|
|     Random House|    2|
|   Harcourt Brace|    3|
|          Penguin|    6|
+-----------------+-----+

