Streaming Tweets into Spark
--------------------------

For this exercise, you need to use the Twitter streaming API to answer a question about what's happening on Twitter at the moment you're running your test.

We need to take a few steps before you get going on the actual analysis. Let's list it out so we can tackle this with a plan:

1. Find a question!!! For example, below, we're trying something simple - just find tweets sent from New York City that contain a hashtag, and then count the occurrences of that hashtag. Simple! Using this we can tell very simply what the most popular hashtag is while we're running our server.

2. Before we can get started, you need to get credentials to access the Twitter API. This is easy: go to [apps.twitter.com](https://apps.twitter.com/), click on "Create New App", and follow the instructions. Once you've set up your "app," copy the following tokens: Consumer Key, Consumer Secret, Access Token, and Access Secret.
    * A note on the tokens: treat them as you would passwords. This means you should store them in a safe place that is outside of source control. This can be as simple as a text file backed up on Dropbox, or a more secure location like a secure note in LastPass. The easiest way to expose them to your application is to set them as environment variables and reference the variables in Python. You'll see this in action when we set up our server.
    
3. At this point you're ready to get started. You'll be running two containers on a small Docker network. Follow the instructions in the prior steps to get going there.

4. Before running this notebook, go to the `data_server` container (reminder: spen the `bash` shell for the terminal with the command `docker attach data_server`) and review the `twitter` directory. There is a single file called `server.py`. Open it with your favorite text editor (remember, the way we created this file, it's a volume and lives locally on your machine - if you don't want to use `vi` to edit inside the container). Note a few things:
    * If you haven't yet, from the terminal `bash` shell, create the four variables as follows: `export CONSUMER_KEY=<paste-your-key-without-quotes>`. Repeat that command for each of the keys / secrets.
    * You are authenticating to Twitter with a `requests_oauthlib` object. Using the four keys mentioned above makes this very clean and simple.
    * There is a `get_tweets` function that builds a URL and sends the query to Twitter. Take a look at the various options here. They come from the Twitter Realtime Filter API. You should modify the query to reflect the question you want to answer - check out the [API Documentation](https://developer.twitter.com/en/docs/tweets/filter-realtime/overview) for details on parameters and usage.
    * The API returns a JSON object, which we then use to send the full text of the tweet to Spark. You're not limited to this, though - you can examine the object to see what else is there that you might want to test.
    
5. Once you're happy with the query and the information you want to analyze in Spark, you can start the server: `python3.6 server.py`. The server will wait for your notebook to initiate its connection.

----------------------------------------------------------------------------------------------------------------------

Running Your Queries In Spark
---------------------------

You need to take the data from Twitter, and perform your analysis based on the question you pose above.

In our hashtag counting example, we do the following:
1. Explode the tweet into its individual words.
2. Filter out hashtags, then each tag gets assigned the count 1 with the `map` method.
3. Append the current hashtags to the main dataframe and update the counts appropriately.
4. Query the temporary table to return the current top 10 hashtags trending, and show to the console.

You can extend this into a web dashboard, or plots inside this notebook, if you choose.

----------------------------------------------------------------------------------------------------------------------

#### Import the relevant modules

In [None]:
from pyspark import SparkConf,SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import Row,SQLContext, SparkSession
import sys
import requests

#### Set some constants, initialize Spark, and then open the socket with the remote host.

In [None]:
TCP_REMOTE_HOST = "data_server"
TCP_PORT = 9009

# create spark configuration
conf = SparkConf()
conf.setAppName("TwitterStreamApp")

# create spark context with the above configuration
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")

# create the Streaming Context from the above spark context with interval size 2 seconds
ssc = StreamingContext(sc, 2)

# setting a checkpoint to allow RDD recovery
ssc.checkpoint("checkpoint_TwitterApp")

# read data from port 9009
dataStream = ssc.socketTextStream(TCP_REMOTE_HOST, TCP_PORT)

#### We now create some helper functions to allow Spark to maintain our running count.

In [None]:
def aggregate_tags_count(new_values, total_sum):
    return sum(new_values) + (total_sum or 0)

def getSparkSessionInstance(sparkConf):
    if ("sparkSessionSingletonInstance" not in globals()):
        globals()["sparkSessionSingletonInstance"] = SparkSession \
            .builder \
            .config(conf=sparkConf) \
            .getOrCreate()
    return globals()["sparkSessionSingletonInstance"]

def process_rdd(time, rdd):
    print(f"----------- {str(time)} -----------")
    try:
        # Get spark sql singleton context from the current context
        spark = getSparkSessionInstance(rdd.context.getConf())
        
        # convert the RDD to Row RDD
        row_rdd = rdd.map(lambda w: Row(hashtag=w[0], hashtag_count=w[1]))
        
        # create a DF from the Row RDD
        hashtags_df = spark.createDataFrame(row_rdd)
        
        # Register the dataframe as table
        hashtags_df.registerTempTable("hashtags")
        #hashtags_df.createOrReplaceTempView("hashtags")
        
        # get the top 10 hashtags from the table using SQL and print them
        hashtag_counts_df = spark.sql(
            "select hashtag, hashtag_count from hashtags order by hashtag_count desc limit 10")
        hashtag_counts_df.show()

    except Exception as e:
        e = sys.exc_info()[1]
        print(f"Error: {e}")

#### Finally, we assign our primary workflow that will utilize the above functions.

After that's complete, we begin the streaming with `ssc.start()`. The query stays open until we terminate it (`ssc.awaitTermination()`).

We can end the streaming by going to the `data_server` container, and typing `ctrl-C`. Recall that you can exit the container shell with the `Ctrl-P, Ctrl-Q` sequence.

If you are using tweets from a geofenced area (as we are here) - you should let it run for a while in order to build up enough data to be useful.

In [None]:
# split each tweet into words
words = dataStream.flatMap(lambda line: line.split(" "))

# filter the words to get only hashtags, then map each hashtag to be a pair of (hashtag,1)
hashtags = words.filter(lambda w: '#' in w).map(lambda x: (x, 1))

# adding the count of each hashtag to its last count
tags_totals = hashtags.updateStateByKey(aggregate_tags_count)

# do processing for each RDD generated in each interval
tags_totals.foreachRDD(process_rdd)

# start the streaming computation
ssc.start()

# wait for the streaming to finish
ssc.awaitTermination()