# Handling Volume with Apache Spark

Use Apache Spark to perform word count on product names after tokenization.

## License

MIT License

Copyright (c) 2018 PT Bukalapak.com

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Software Versions

In [1]:
import sys, os
print("Python %s" % sys.version)
import time

Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21) 
[GCC 7.3.0]


In [2]:
import pyspark
print("PySpark %s" % pyspark.__version__)
from pyspark.sql import SparkSession

PySpark 2.4.4


In [3]:
import platform
print("platform %s" % platform.__version__)

platform 1.0.8


In [4]:
print("OS", platform.platform())

OS Linux-4.15.0-65-generic-x86_64-with-debian-buster-sid


In [5]:
import tensorflow as tf
print("TensorFlow %s" % tf.__version__)
from tensorflow.keras.preprocessing.text import text_to_word_sequence

TensorFlow 1.15.0


In [6]:
%%bash
/usr/local/spark/bin/spark-submit --version

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
                        
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_222
Branch 
Compiled by user  on 2019-08-27T21:21:38Z
Revision 
Url 
Type --help for more information.


## Perform Word Count using Notebook (NB)

Setup spark.

In [7]:
APP_NAME = "bukalapak-core-ai.big-data-3v.volume-spark"
spark = SparkSession \
    .builder \
    .appName(APP_NAME) \
    .getOrCreate()

In [8]:
sc = spark.sparkContext

In [9]:
sc

Input and output URLs.

In [10]:
product_names_text_filename = \
    "file:/home/jovyan/work/" + \
    "data/product_names_sample/" + \
    "product_names.rdd"
product_names_text_filename

'file:/home/jovyan/work/data/product_names_sample/product_names.rdd'

In [11]:
product_names_word_count_nb_orc_filename = \
    "file:/home/jovyan/work/" + \
    "data/product_names_sample/" + \
    "product_names_word_count_nb.orc"
product_names_word_count_nb_orc_filename

'file:/home/jovyan/work/data/product_names_sample/product_names_word_count_nb.orc'

Read input file.

In [12]:
product_names_df = spark.read.text(product_names_text_filename)
product_names_df

DataFrame[value: string]

In [13]:
product_names_df.head(10)

[Row(value='DAILY LIFE OF SCHOLAR SHINJIRO KATSURAGI 01: LC-MeruyaBookStore'),
 Row(value='Ready Stock Kulot Cantik'),
 Row(value='Sepaket Tas Dompet Sepatu Jam Kacamata Kalung'),
 Row(value='Baterai HP Pavilion DV3 2000 Compaq Presario CQ35   Black'),
 Row(value='RodFord OCEANOS Stage II Overhead Rod RFOB60-4 AHI - PE#4 (1 Sec.)'),
 Row(value='Giant killing 29'),
 Row(value=''),
 Row(value='New Stock sepatu kulit gio feruji sepatu kerja'),
 Row(value='Best Seller Kemeja Polos Pria Lengan Panjang Merah Marun Cpmm'),
 Row(value='kemeja pria bigsize  4L (Best Seller!)')]

In [14]:
product_names_rdd = product_names_df.rdd
product_names_rdd

MapPartitionsRDD[7] at javaToPython at NativeMethodAccessorImpl.java:0

In [15]:
product_names_rdd.getNumPartitions()

2

In [16]:
product_names_rdd.top(10)

[Row(value='kemeja pria bigsize  4L (Best Seller!)'),
 Row(value='Sepaket Tas Dompet Sepatu Jam Kacamata Kalung'),
 Row(value='RodFord OCEANOS Stage II Overhead Rod RFOB60-4 AHI - PE#4 (1 Sec.)'),
 Row(value='Ready Stock Kulot Cantik'),
 Row(value='New Stock sepatu kulit gio feruji sepatu kerja'),
 Row(value='Giant killing 29'),
 Row(value='DAILY LIFE OF SCHOLAR SHINJIRO KATSURAGI 01: LC-MeruyaBookStore'),
 Row(value='Best Seller Kemeja Polos Pria Lengan Panjang Merah Marun Cpmm'),
 Row(value='Baterai HP Pavilion DV3 2000 Compaq Presario CQ35   Black'),
 Row(value='BARU.. Motorola TLKR T80  GO ADVENTURE  WALKIE TALKIES- GARANSI RESMI 1 TAHUN')]

Perform tokenization.

In [17]:
def tokenize(words):
    return text_to_word_sequence(words['value'])

In [18]:
tokenized_product_names_rdd = \
    product_names_rdd.flatMap(lambda product_name: tokenize(product_name))
tokenized_product_names_rdd

PythonRDD[9] at RDD at PythonRDD.scala:53

In [19]:
tokenized_product_names_rdd.top(20)

['walkie',
 'tlkr',
 'tas',
 'talkies',
 'tahun',
 't80',
 'stock',
 'stock',
 'stage',
 'shinjiro',
 'sepatu',
 'sepatu',
 'sepatu',
 'sepaket',
 'seller',
 'seller',
 'sec',
 'scholar',
 'rodford',
 'rod']

Perform word count.

In [20]:
word_count_product_names_rdd = \
    tokenized_product_names_rdd.map(lambda word: (word, 1)) \
                               .reduceByKey(lambda a, b: a + b)

In [21]:
word_count_product_names_rdd

PythonRDD[15] at RDD at PythonRDD.scala:53

In [22]:
word_count_product_names_rdd.top(20)

[('walkie', 1),
 ('tlkr', 1),
 ('tas', 1),
 ('talkies', 1),
 ('tahun', 1),
 ('t80', 1),
 ('stock', 2),
 ('stage', 1),
 ('shinjiro', 1),
 ('sepatu', 3),
 ('sepaket', 1),
 ('seller', 2),
 ('sec', 1),
 ('scholar', 1),
 ('rodford', 1),
 ('rod', 1),
 ('rfob60', 1),
 ('resmi', 1),
 ('ready', 1),
 ('pria', 2)]

Save the output. __Note:__ Don't forget to delete existing `product_names_word_count_nb.orc` directory in `data/product_names_sample`. Following Spark implementation does not overwrite existing data but it will throw error.

In [23]:
word_count_product_names_df = spark.createDataFrame(word_count_product_names_rdd)
word_count_product_names_df.write.save(product_names_word_count_nb_orc_filename, \
                                       format="orc")

Read back the word count.

In [24]:
new_word_count_product_names_df = spark.read.orc(product_names_word_count_nb_orc_filename)
new_word_count_product_names_df

DataFrame[_1: string, _2: bigint]

In [25]:
new_word_count_product_names_df.head(20)

[Row(_1='daily', _2=1),
 Row(_1='life', _2=1),
 Row(_1='scholar', _2=1),
 Row(_1='shinjiro', _2=1),
 Row(_1='katsuragi', _2=1),
 Row(_1='01', _2=1),
 Row(_1='meruyabookstore', _2=1),
 Row(_1='stock', _2=2),
 Row(_1='tas', _2=1),
 Row(_1='sepatu', _2=3),
 Row(_1='kacamata', _2=1),
 Row(_1='kalung', _2=1),
 Row(_1='baterai', _2=1),
 Row(_1='pavilion', _2=1),
 Row(_1='dv3', _2=1),
 Row(_1='2000', _2=1),
 Row(_1='cq35', _2=1),
 Row(_1='black', _2=1),
 Row(_1='rodford', _2=1),
 Row(_1='ii', _2=1)]

Stop Spark.

In [26]:
sc.stop()

In [27]:
spark.stop()

## Perform Word Count using Spark Submit (SS)

In [28]:
%%writefile bukalapak-core-ai.big-data-3v.volume-spark.py
# Copyright (c) 2018 PT Bukalapak.com
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

from pyspark.sql import SparkSession


APP_NAME = "bukalapak-core-ai.big-data-3v.volume-spark"


def tokenize(words):
    from tensorflow.keras.preprocessing.text import text_to_word_sequence
    return text_to_word_sequence(words['value'])


def main(spark):
    # Input
    product_names_text_filename = \
        "file:/home/jovyan/work/" + \
        "data/product_names_sample/" + \
        "product_names.rdd"
    # Output
    product_names_word_count_ss_orc_filename = \
        "file:/home/jovyan/work/" + \
        "data/product_names_sample/" + \
        "product_names_word_count_ss.orc"
    # Read input
    product_names_df = spark.read.text(product_names_text_filename)
    product_names_rdd = product_names_df.rdd
    # Perform tokenization and word count
    tokenized_product_names_rdd = \
        product_names_rdd.flatMap(lambda product_name: tokenize(product_name))
    word_count_product_names_rdd = \
        tokenized_product_names_rdd.map(lambda word: (word, 1)) \
                                   .reduceByKey(lambda a, b: a + b)
    # Write output
    word_count_product_names_df = spark.createDataFrame(word_count_product_names_rdd)
    word_count_product_names_df.write.save(product_names_word_count_ss_orc_filename, \
                                           format="orc")


if __name__ == "__main__":
    # Configure Spark
    spark = SparkSession \
        .builder \
        .appName(APP_NAME) \
        .getOrCreate()
    main(spark)
    spark.stop()


Overwriting bukalapak-core-ai.big-data-3v.volume-spark.py


__Note:__ Don't forget to delete existing `product_names_word_count_ss.orc` directory in `data/product_names_sample`. Following Spark implementation does not overwrite existing data but it will throw error.

In [29]:
%%bash
/usr/local/spark/bin/spark-submit \
    --executor-memory 1g --executor-cores 1 --num-executors 2 \
    bukalapak-core-ai.big-data-3v.volume-spark.py

19/10/18 17:34:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/10/18 17:34:52 INFO SparkContext: Running Spark version 2.4.4
19/10/18 17:34:52 INFO SparkContext: Submitted application: bukalapak-core-ai.big-data-3v.volume-spark
19/10/18 17:34:52 INFO SecurityManager: Changing view acls to: jovyan
19/10/18 17:34:52 INFO SecurityManager: Changing modify acls to: jovyan
19/10/18 17:34:52 INFO SecurityManager: Changing view acls groups to: 
19/10/18 17:34:52 INFO SecurityManager: Changing modify acls groups to: 
19/10/18 17:34:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jovyan); groups with view permissions: Set(); users  with modify permissions: Set(jovyan); groups with modify permissions: Set()
19/10/18 17:34:52 INFO Utils: Successfully started service '

Read back the word count.

In [30]:
APP_NAME = "bukalapak-core-ai.big-data-3v.volume-spark"
spark = SparkSession \
    .builder \
    .appName(APP_NAME) \
    .getOrCreate()

In [31]:
product_names_word_count_ss_orc_filename = \
    "file:/home/jovyan/work/" + \
    "data/product_names_sample/" + \
    "product_names_word_count_ss.orc"
product_names_word_count_ss_orc_filename

'file:/home/jovyan/work/data/product_names_sample/product_names_word_count_ss.orc'

In [32]:
new_word_count_product_names_df = spark.read.orc(product_names_word_count_ss_orc_filename)
new_word_count_product_names_df

DataFrame[_1: string, _2: bigint]

In [33]:
new_word_count_product_names_df.head(20)

[Row(_1='daily', _2=1),
 Row(_1='life', _2=1),
 Row(_1='scholar', _2=1),
 Row(_1='shinjiro', _2=1),
 Row(_1='katsuragi', _2=1),
 Row(_1='01', _2=1),
 Row(_1='meruyabookstore', _2=1),
 Row(_1='stock', _2=2),
 Row(_1='tas', _2=1),
 Row(_1='sepatu', _2=3),
 Row(_1='kacamata', _2=1),
 Row(_1='kalung', _2=1),
 Row(_1='baterai', _2=1),
 Row(_1='pavilion', _2=1),
 Row(_1='dv3', _2=1),
 Row(_1='2000', _2=1),
 Row(_1='cq35', _2=1),
 Row(_1='black', _2=1),
 Row(_1='rodford', _2=1),
 Row(_1='ii', _2=1)]

Stop Spark.

In [34]:
sc.stop()

In [35]:
spark.stop()

## Software Versions

In [36]:
%%bash
cat /etc/os-release

NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic


In [37]:
%%bash
pip freeze

absl-py==0.8.1
alembic==1.0.11
asn1crypto==0.24.0
astor==0.8.0
async-generator==1.10
attrs==19.1.0
backcall==0.1.0
beautifulsoup4==4.7.1
bleach==3.1.0
blinker==1.4
bokeh==1.0.4
certifi==2019.9.11
certipy==0.1.3
cffi==1.12.3
chardet==3.0.4
Click==7.0
cloudpickle==0.8.1
conda==4.7.10
conda-package-handling==1.3.11
cryptography==2.7
cycler==0.10.0
Cython==0.29.13
cytoolz==0.10.0
dask==1.1.5
decorator==4.4.0
defusedxml==0.5.0
dill==0.2.9
distributed==1.28.1
entrypoints==0.3
fastcache==1.1.0
gast==0.2.2
gmpy2==2.1.0b1
google-pasta==0.1.7
grpcio==1.24.1
h5py==2.9.0
heapdict==1.0.0
idna==2.8
imageio==2.5.0
ipykernel==5.1.1
ipython==7.7.0
ipython-genutils==0.2.0
ipywidgets==7.5.0
jedi==0.14.1
Jinja2==2.10.1
json5==0.8.5
jsonschema==3.0.1
jupyter-client==5.3.1
jupyter-core==4.4.0
jupyterhub==1.0.0
jupyterlab==1.0.4
jupyterlab-server==1.0.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
libarchive-c==2.8
llvmlite==0.27.1
locket==0.2.0
Mako==1.0.10
Markdown==3.1.1
MarkupSa

In [38]:
%%bash
conda list

# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    defaults
absl-py                   0.8.1                    pypi_0    pypi
alembic                   1.0.11                     py_0    conda-forge
arrow-cpp                 0.13.0           py37h246e31e_6    conda-forge
asn1crypto                0.24.0                py37_1003    conda-forge
astor                     0.8.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
attrs                     19.1.0                     py_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
beautifulsoup4            4.7.1                 py37_1001    conda-forge
blas                      2.10                   openblas    conda-forge
bleach                    3.1.0                      py_0    conda-forge
blinker                   1.4  