## Query Example - Word Count

Let us see how we can perform word count using Spark SQL. Using word count as an example we will understand how we can come up with the solution using pre-defined functions available.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Predefined Functions").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@19a7b454


org.apache.spark.sql.SparkSession@19a7b454

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Create table by name lines.
* Insert data into the table.
* Split lines into array of words.
* Explode array of words from each line into individual records.
* Use group by and get the count. We cannot use `GROUP BY` directly on exploded records and hence we need to use nested sub query.

In [3]:
%%sql

DROP DATABASE IF EXISTS itv002461_demo CASCADE

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql

CREATE DATABASE IF NOT EXISTS itv002461_demo

++
||
++
++



In [6]:
%%sql

USE itv002461_demo

++
||
++
++



In [7]:
%%sql

CREATE TABLE lines (s STRING)

++
||
++
++



In [8]:
%%sql

INSERT INTO lines VALUES
  ('Hello World'),
  ('How are you'),
  ('Let us perform the word count'),
  ('The definition of word count is'),
  ('to get the count of each word from this data')

++
||
++
++



In [9]:
%%sql

SELECT * FROM lines

+--------------------+
|                   s|
+--------------------+
|         Hello World|
|         How are you|
|Let us perform th...|
|The definition of...|
|to get the count ...|
+--------------------+



In [10]:
%%sql

SELECT split(s, ' ') AS word_array FROM lines

+--------------------+
|          word_array|
+--------------------+
|      [Hello, World]|
|     [How, are, you]|
|[Let, us, perform...|
|[The, definition,...|
|[to, get, the, co...|
+--------------------+



In [None]:
spark.sql("SHOW functions").show(300, false)

In [12]:
%%sql

SELECT explode(split(s, ' ')) AS words FROM lines

+-------+
|  words|
+-------+
|  Hello|
|  World|
|    How|
|    are|
|    you|
|    Let|
|     us|
|perform|
|    the|
|   word|
+-------+
only showing top 10 rows



In [13]:
%%sql

SELECT count(1) FROM (SELECT explode(split(s, ' ')) AS words FROM lines)

+--------+
|count(1)|
+--------+
|      27|
+--------+



In [14]:
%%sql

SELECT explode(split(s, ' ')) AS words, count(1) FROM lines
GROUP BY explode(split(s, ' '))

Magic sql failed to execute with error: 
Generators are not supported outside the SELECT clause, but got: 'Aggregate [explode(split(s#52,  ))], [explode(split(s#52,  )) AS words#50, count(1) AS count(1)#53L];

In [15]:
%%sql

SELECT word, count(1) FROM (
  SELECT explode(split(s, ' ')) AS word FROM lines
) q
GROUP BY word

+-----+--------+
| word|count(1)|
+-----+--------+
|World|       1|
|   us|       1|
|  you|       1|
|count|       3|
|   is|       1|
| each|       1|
| data|       1|
|Hello|       1|
|  How|       1|
|  the|       2|
+-----+--------+
only showing top 10 rows



In [16]:
%%sql

SELECT count(1) FROM
(
    SELECT word, count(1) FROM (
        SELECT explode(split(s, ' ')) AS word FROM lines
    ) q
    GROUP BY word
)

+--------+
|count(1)|
+--------+
|      21|
+--------+

