## Find the user that used apache-spark tag the most times. Also find out how many times he used all other tags.

In this notebook you will use higher order functions to analyze the tag usage by users that asked questions.

Answer to this question should have this format: 
* The user that used `apache-spark` tag most frequently has id=xxx
* He use the tag xxx times
* Here is the frequency of all other tags he used:
```
{
  'hadoop': x,
  'sql': y,
  'python': z, # similarly for all other tags he used
}
```

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, split, collect_list, lit, 
    concat, flatten, length, aggregate, map_concat, create_map, coalesce, map_contains_key, desc, count
)
from pyspark.sql.types import *

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Higher Order Functions II')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_json_input_path = os.path.join(project_path, 'data/questions-json')

In [None]:
questionsDF = (
    spark
    .read
    .format('json')
    .option('path', questions_json_input_path)
    .load()
)

### Split the tags into an array and find out what tags were used by each user

Hint:
* check the tags column, it is in this format `<tag1><tag2><tag3>`
* convert the tags into an array: [tag1, tag2, tag3, ...]
  * there are different ways how to do it, you will need split + some other technique how to remove the angle brackets
* group by user and collect all tags into single array
  * check [flatten](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.flatten.html) to handle nested arrays
* output DataFrame should contain 2 columns: user_id, tags where tags is an array of all tags used by the user (some of the text will repeat, which is good, we will count them in the next section)

In [None]:
user_tags = (
    questionsDF
    .withColumn('fixed_tags', col('tags').substr(lit(2), length('tags') - 2))
    .withColumn('fixed_tags', split('fixed_tags', '><'))
    .groupBy('user_id')
    .agg(
        flatten(collect_list('fixed_tags')).alias('tags')
    )
)

In [None]:
user_tags.show(n=5)

### Count the frequency of the tags for each user

Hint (this is the more complex part of the task)
* Use a MapType to store the tags as follows:
  * tag1->4, tag2->1, tag3->10, i.e the key is the name of the tag and the value is its frequency
  * you will need [aggregate](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.aggregate.html), [create_map](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.create_map.html), [map_concat](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.map_concat.html)
  * you may need to set [spark.sql.mapKeyDedupPolicy](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L4709) config property to LAST_WIN to deal with duplicated keys

In [None]:
spark.conf.set('spark.sql.mapKeyDedupPolicy', "LAST_WIN")

In [None]:
tags_with_frequency = (
    user_tags
    .withColumn(
        'tags',
        aggregate(
            'tags',
            create_map().cast(MapType(StringType(), IntegerType())),
            lambda acc, x: map_concat(acc, create_map(x, coalesce(acc[x], lit(0)) + 1))
        )
    )    
)

tags_with_frequency.show(n=5, truncate=120)

### Find the user that used apache-spark tag with highest frequency

Hint:
* filter the DataFrame using map_contains_key to get users that used the particular tag
* add the frequncy of the particular tag to a new column and sort the DataFrame by it to find the user

In [None]:
users_with_spark_tag = (
    tags_with_frequency
    .filter(map_contains_key('tags', 'apache-spark'))
    .withColumn('spark_frequency_tag', col('tags')['apache-spark'])
    .orderBy(desc('spark_frequency_tag'))
)

users_with_spark_tag.show(n=5, truncate=110)

### Find the frequency of all other tags for the user

Hint:
* just collect the row - it should contain all the tags already, the MapType will become Python dictionary of all tags used by the user

In [None]:
(
    
    users_with_spark_tag
    .limit(1)
    .select('user_id', 'tags')
).collect()[0]['tags']

In [None]:
spark.stop()