In [7]:
import pyspark.sql.functions as sf

from pyspark.sql.window import Window

In [2]:
from pyspark.sql import SparkSession

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

# Twitter Example

In this notebook, we will work with some Twitter data. It was downloaded from *The Interet Archive* at https://archive.org/details/twitterstream. To demonstrate some use case for Spark window functions, we want to find the latest tweet for each hashtag.

## 1 Load Twitter Data

In a first step, we load the Twitter data. It is stored as JSONs, which are well supported by Spark.

In [None]:
basedir = "s3://dimajix-training/data"

In [3]:
data = spark.read\
    .json(basedir + "/twitter-sample/00.json")

### Inspect Schema

Now let us inspect the schema. As we will see, the meta data for each tweet is really massive and the complete data model quite complex. Fortunately we are only interested in the tweet itself and the list of hashtags. Note that the hashtags are already extracted for us, so there is no need to use some custom extractor.

In [None]:
data.printSchema()

# 2 Reduce Schema

Since we don't want to work with the whole schema, let us select only the relevant columns. Note that this is only a simplification for us human beings. Spark itself would also only extract the required columns anyway, so there is no performance improvement here (which is a good thing, since Spark automatically optimizes performance).

Specifically we are interested in the following columns:
* `created_at` contains the date and time when the tweet was originally created
* `text` contains the full text of the tweet
* `entities.hashtags.text` contains an array of all hash tags

In [None]:
hashtags = data.select(
    data["created_at"],
    data["text"],
    data["entities.hashtags.text"].alias("hashtags_array")
)
hashtags.printSchema()

## 3 Unpack Hashtags

Now the schema contains an array element with a list of all hashtags. But what we want and need is one record per hashtag with all other attributes copied into the generated records. This can be done with the Spark function `explode`. So we try again, but this time we generate a new record for every entry in the hashtag array.

In [None]:
hashtags = data.select(
    data["created_at"],
    data["text"],
    # YOUR CODE HERE
)
hashtags.printSchema()

# 4 Count Hashtag Frequency

Our primary goal is to find the latest tweet for every hashtag. But this only makes sense, if individual hashtags are present more than only once in our data set. So as a pre-analysis step, let us count the frequency of all hashtags.

In [None]:
result = # YOUR CODE HERE

result.orderBy(result["count"].desc()).limit(20).toPandas()

# 5 Find Latest Tweet per Hashtag

Now we want to find the newest/latest tweet for every hashtag. This could be done using a self join, but using windows is much simpler and more natural. In addition to the latest tweet, we also want to have the count of every hashtag. We already did that before, but if we want to combine both data sets, this would require a join. Instead we also count using a window function.

In the first step, we simply perform the window aggregation and inspect the intermediate result

In [None]:
# First window for finding the newest hash tag. We will use the row number within the window to select the newest hash tag
rank_window = # YOUR CODE HERE

# Second window for counting the total frequency of every hash tag
count_window = # YOUR CODE HERE

ranked_hashtags = hashtags.select(
    # YOUR CODE HERE
)
ranked_hashtags.printSchema()

### Inspect result

Now let us inspect the intermediate result. We do not want to view all records, but we want to restrict ourselves to the non-trivial cases where there are multiple tweets for a given hashtag (i.e. `count > 1`). 

Moreover we also want to sort the result
* First sort by count, descending. This ensures that the most commonly used hashtag comes first
* Then sort by hashtag in case that there are two hashtags with the same count
* Finally sort by rank
This sorting more or less gives us the windows concatenated into a new data frame.

In [None]:
result = ranked_hashtags.filter(ranked_hashtags["count"] > 1) \
    .orderBy(ranked_hashtags["count"].desc(), ranked_hashtags["hashtag"], ranked_hashtags["rank"].asc())

result.limit(10).toPandas()

### Find latest Tweet

Now we only need to filter the result and select the tweets with `rank == 1`.

In [None]:
result = # YOUR CODE HERE

result.limit(10).toPandas()