# 7 - Using Complex Types to Analyse Unstructured or JSON Data
Our challenge of today is to go beyond processing well structured data, which complies to a schema and where all values are clearly seperated into typed columns. Today we want to analyse the stock descriptions in the retail data set, which come as unstructured text. This is our use case to investigate Sparks complex datataypes like arrays and maps. Next to that, we want to get familiar with the processing of semi-structured data like JSON.

In [None]:
from pyspark_start import *

retailDF = spark.read\
   .option("header", "true")\
   .option("inferSchema", "true")\
   .format("csv")\
   .load("./data/retail-data/by-day/*.csv")

There are two questions, we want to investigate regarding the description data:
* What is the average number of words in the Description per StockCode?
* Which are the most frequently used words?

## Data Preparation
The granularity of our analysis is StockCode and not individual invoice items. So to prevent StockCode duplicates, we tailor the data set to get a DataFrame containing distinct StockCodes and their description.

In [None]:
distinctDF = retailDF.select(
        "StockCode",
        "Description").distinct()

distinctDF.orderBy("StockCode").show(10, truncate=False)

Apparently the null value problem, we investigated yesterday, occures again. Rows having null values in any column are uselesss for our analysis, so we want to remove them.

In [None]:
cleanedDF = distinctDF.dropna(how="any")

cleanedDF.orderBy("StockCode").show(10, truncate=False)

## Arrays
Next we've  to do is to split up the text strings into arrays of words. The words in the descriptions are seperated by blanks, so we define this as split seperator. The result looks like Python lists but in contrast to lists, all array elements must have the same data type.

In [None]:
from pyspark.sql.functions import split

splittedDF = cleanedDF.select(
        "StockCode",
        split("Description", " ").alias("word_list")
)

splittedDF.show(10, truncate=False)

Like with normal Python lists we can grab specific elements, i.e. words from our word lists, by referencing their index starting with 0 for the first element. So to get the second word in each description, we need to refer to index 1.

In [None]:
from pyspark.sql.functions import col

splittedDF.select("StockCode", col("word_list")[1]).show(10)

Interesting to note that InvoiceNo 21249 seems to have a double blank after the first word. Maybe a typo in a free-text field? Anyway, we dont to count words, not blanks, so we have to removing them later. First, we want to double check, if this is a more general or single-case issue. 

We can easily check wether or not a word list contains specific key words by using the `array_contains()` function. For our analysis, we want to identify rows having empty words in the list, which I'dont want to count.

In [None]:
from pyspark.sql.functions import array_contains

splittedDF.select(
    "StockCode", 
    "word_list", 
    array_contains("word_list", "").alias("empty strings inside")
).show(10, truncate=False)

So now lets let's clean up the word lists and remove any empty words.

In [None]:
from pyspark.sql.functions import array_remove

cleanedWordListDF = splittedDF.select(
    "StockCode", 
    array_remove("word_list", "").alias("word_list")
)

cleanedWordListDF.show(10, truncate=False)   

Did it work?

In [None]:
cleanedWordListDF.select(
    "StockCode", 
    "word_list", 
    array_contains("word_list", "").alias("empty strings inside")
).show(10, truncate=False)

yes, it did.

Back to our questions. Now, after having cleaned up the data the number of words per stock description is simply the array length which is provided by the `size()` function.

In [None]:
from pyspark.sql.functions import size

cleanedWordListDF.select(
    "StockCode", 
    size("word_list").alias("num_of_words")
).show(10)

In [None]:
from pyspark.sql.functions import avg

avgDF = cleanedWordListDF.select(
    avg(
        size("word_list")
    ).alias("avg_num_of_words")
)

avgDF.show(10)

So the answer to our first question is that stock descriptions are quite short, just about four words in average.

Pyspark module [pyspark.sql.functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions) provides further array related functions, which we just list here for later reference:

* **array()** - creates a new array column from a list of columns or column expressions that have the **same data type**
* **array_distinct(col)** - Collection function: removes duplicate values from the array 
* **array_except(col1, col2)** - Collection function: returns an array of the elements in col1 but not in col2, without duplicates
* **array_intersect(col1, col2)** - Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates 
* **array_join()** 
* **array_max()** - Collection function: returns the maximum value of the array
* **array_min()** - Collection function: returns the maximum value of the array
* **array_position()** - Collection function: Locates the position of the first occurrence of the given value in the given array
* **array_repeat(col, count)** - Collection function: creates an array containing a column repeated count times
* **array_sort(col)** - Collection function: sorts the input array in ascending order
* **array_union(col1, col2)** - Collection function: returns an array of the elements in the union of col1 and col2, without duplicates
* **arrays_overlap(a1, a2)** - Collection function: returns true if the arrays contain any common non-null element
* **arrays_zip()** - Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays

## Explode
Two answer our secondf question, it would be easier for us having all words in in column instead of spread across many lists. To turn array elements into rows, we need to apply the `explode()` function. As the name of the function indicates, this can heavily increase the number of rows and the values of all remaining columns get duplicated.

In [None]:
from pyspark.sql.functions import explode

explodedDF = cleanedWordListDF.select(
    "StockCode",
    explode("word_list").alias("words")
)

explodedDF.orderBy("StockCode").show(20)

The anser to our second question is simply a count of rows per word sorted in descending order.

In [None]:
from pyspark.sql.functions import desc, count, lit

explodedDF\
    .groupBy("words")\
    .agg(count(lit(1)).alias("word_count"))\
    .orderBy(desc("word_count"))\
    .show(10)

Pink stocks seems to be quite popular.

## Maps
For handling data in key:value structure, Spark provides another complex datatype: *maps*.

Our testdata does not provide key:value structured data. So first, we will transform our existing data into maps and second, we can investigate, how to handle key:value source data as an input to our ETL dataprocessing.

### Creating Maps

In [None]:
dfFlight = spark.read\
   .option("inferSchema", "true")\
   .option("header", "true")\
   .csv("./data/flight-data/2015-summary.csv")

from pyspark.sql.functions import lit, struct, array, col
from pyspark.sql.types import StringType

arrDF = dfFlight.select(
    array(
        lit("destination"),
        lit("origin"),
        lit("count")
    ).alias("key"),
    array(
        "DEST_COUNTRY_NAME",
        "ORIGIN_COUNTRY_NAME",
        col("count").cast(StringType())
    ).alias("value")
)

arrDF.show(10, truncate=False)

In [None]:
from pyspark.sql.functions import map_from_arrays

mapDF = arrDF.select(
    map_from_arrays("key", "value").alias("data_map")
)

mapDF.show(10, truncate=False)

In [None]:
mapDF.select(col("data_map")["destination"]).show(10)

In [None]:
mapDF.select(col("data_map")["origin"]).show(10)

In [None]:
from pyspark.sql.functions import map_keys

mapDF.select(
        map_keys("data_map")
).show(10, truncate=False)

In [None]:
from pyspark.sql.functions import map_values

mapDF.select(
        map_values("data_map")
).show(10, truncate=False)

The data we've  processed so far looks at least semi-structured because the keys and values all appear in identical order. So there is still an implicit schema because all rows match to the same pattern:

destination -> descVal, origin -> origValue, count -> cntVal

What would happen, if rows have keys and values in different order? Because our testdata does not provide examples for this, we create a DataFrame manuall with synthetic data in multiple orders.

In [None]:

unstructuredDF = spark.createDataFrame(
        [
            (["destination", "origin", "count"], ["United States", "Germany", "10"],), 
            (["count", "origin", "destination"], ["25", "France", "Spain"],),
            (["count", "destination", "origin"], ["75", "Italy", "Spain"],)
        ], 
        ["key", "value"]
)

unstructuredDF.show(truncate=False)

In [None]:
mapDF2 = unstructuredDF.select(
    map_from_arrays("key", "value").alias("data_map")
)

mapDF2.show(truncate=False)

In [None]:
mapDF2.select(col("data_map")["origin"]).show(10)

Luckily the odering doesn't matter because we reference the values by keys and not by positions. Maps are more like dictionaries than lists or arrays.

### Turning Maps into DataFrames

So with our self-created map we can now investigate how to handle such data as input for our ETL process which finally will write data in tabular form into a file or database table. So as intermediate step, we will have to align more or less ordered *key:value* pairs with the schema of a `DataFrame`. 

Can the `explode()` function help again?

In [None]:
mapDF2.select(explode("data_map")).show(10)

Well, yes and no. Yes, `explode()` accepts both arrays as well as maps as an argument. No, because now we've  lost the information, which three rows belong together. Additionally our intention was to gain three columns, one for each key value, and not just two. For maps referencing by key is always a better approach than referencing by position.

In [None]:
mapDF2.select(
    col("data_map")["destination"].alias("destination"),
    col("data_map")["origin"].alias("origin"),
    col("data_map")["count"].alias("count")
).show(10)

So with Spark handling nearly unstructured data records of key:value pairs in different orders is not a big problem.

Pyspark module <a href=https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions>pyspark.sql.functions</a> provides further map related functions, which we also just list here for later reference:

* **map_concat()** - Returns the union of all the given maps
* **map_from_entries()** - Collection function: Returns a map created from the given array of entries

### Turning Arrays or Maps into JSON
A nice Spark feature is the `to_json()` function which converts StructType, ArrayType or MapType data into JSON. This can be relevant for us if we have to call a REST API which expects JSON documents as paylod.

In [None]:
from pyspark.sql.functions import to_json

mapDF2.select(to_json("data_map")).show(10, truncate=False)

In [None]:
mapDF2.select(to_json("data_map")).printSchema()

## Processing Semi-structured JSON data
As we've  learned on day 3, reading data from JSON file and transforming it into a DataFreame is quite simple. Just for repetition:

In [None]:
jsonDF = spark.read\
   .option("inferSchema", "true")\
   .format("json")\
   .load("./data/flight-data/2015-summary.json")\

jsonDF.printSchema()

In [None]:
jsonDF.show(10)

But what have we to to in case of having tabular data where only one column contains JSON strings? To check this out first we create same testdata.

In [None]:
df = spark.createDataFrame(
        [
            (123, "DUS", '{"destinations" : ["FRA", "MUC", "TXL"], "airlines" : ["LH", "EW", "RY"]}'), 
            (456, "FRA", '{"destinations" : ["CDG", "MUC", "JFK"], "airlines" : ["AF", "LH", "DL"]}'),
            (789, "MUC", '{"destinations" : ["FRA", "ZUC", "DUS"], "airlines" : ["EW", "LH", "EW"]}')
        ], 
        ["key", "airport", "dest"]
)

In [None]:
df.show(truncate=False)

### Navigation along JSON Paths
Each row in the "dest" column contains a valid JSON document. Now we can use the `get_json_object()` function to access the values inside of the JSON documents by specifiying the path from the root element (represented by `$`) down the nesting hierarchie to the specific JSON obect we want to extract. 

path: `$.key_level1.key_level_2....key_level_n`

Since in our DataFrame the objects "destinations", and "airlines" have value lists, we have to specify the list index to get one singular value per row.

In [None]:
from pyspark.sql.functions import get_json_object

df.select(
        "key", 
        "airport",
        get_json_object("dest", '$.destinations[2]').alias("destination"),
        get_json_object("dest", '$.airlines[1]').alias("airline"),
).show(truncate=False)

If we omitt the list index, I'll get the entire value list in our result DataFrame.

In [None]:
df.select(
        "key", 
        "airport",
        get_json_object("dest", '$.destinations').alias("destination"),
        get_json_object("dest", '$.airlines').alias("airline"),
).show(truncate=False)

There is a similar function `json_tuple()` but we are not sure if it provides any benefits to me, because:
1. we cannot use it if the JSON document has more than one level of nesting, and
1. we cannot refer to single list elements

In [None]:
from pyspark.sql.functions import json_tuple
df.select("key", 
          "airport",
          json_tuple("dest", "destinations", "airlines").alias("destination", "airline"),
).show(truncate=False)

### Turning JSON to Map based on Schema
Finally, like we can read from JSON files using an explicit schema definition, we can also apply `from_json()` on DataFrame columns containing JSON by using a schema. Depending on the schema definition `from_json()` will return StructType, ArrayType or MapType. Actually we could perform a conversion round-trip  from StructType, ArrayType or MapType -> `to_json()` -> {Json} -> `from_json()` ->  StructType, ArrayType or MapType.

convert the Json. 

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json

jsonSchema = MapType(
    StringType(), 
    ArrayType(StringType(), True),
    True
)

mappedDF = df.select("key", 
          "airport",
         from_json("dest", jsonSchema).alias("json_data")
)

mappedDF.show(truncate=False)

Now we can navigate on the Map structure to extract single values similar to navigating the JSON path using `get_json_object()`, e.g. grabbing the third element of the destinations lists.

In [None]:
mappedDF.select(
    "key", 
    "airport",
    col("json_data")["destinations"][2]
).show()

The question is: what is the benefit of taking these extra effort, defining a schema and converting JSON to Map? In our opinion this leads to cleaner code and a better design, because:
1. now the JSON structure, a mexpecting is explicitly documented in the code by the schema instead of implicitly assumed 
1. the Map structure is a unifying abstraction of any key:value data, regardles of the source format, e.g. CSV file, JSON documents or key-value database tables