# Gain Actionable Insights from Twitter Data

In this capstone project, I will use Structured Streaming to gain insight from streaming Twitter data.

The executive team would like to have access to some key business metrics such as
* most 10 tweeted hashtag in last 5 minute window
* a map of where tweets are coming from

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Getting Started</h2>

In [0]:
%run "./Includes/Classroom-Setup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 1</h2>
<h3>Read Streaming Data from Input Source</h3>

The input source is a a Kafka feed of Twitter data

For this step you will need to:
0. Use the `format()` operation to specify "kafka" as the type of the stream
0. Specify the location of the Kafka server by setting the option "kafka.bootstrap.servers" with one of the following values (depending on where you are located): 
 * **server1.databricks.training:9092** (US-Oregon)
 * **server2.databricks.training:9092** (Singapore)
0. Indicate which topics to listen to by setting the option "subscribe" to "tweets"
0. Throttle Kafka's processing of the streams
0. Rewind stream to beginning when we restart notebook
0. Load the input data stream in as a DataFrame
0. Select the column `value` - cast it to a `STRING`

In [0]:
# TODO
from pyspark.sql.functions import col

spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

# kafkaServer = "server1.databricks.training:9092"   # US (Oregon)
kafkaServer = "server2.databricks.training:9092" # Singapore

# Carefull with option, ex: "maxOffsetPerTrigger" will make job fail
rawDF = (spark.readStream
 .format("kafka")                                                   # Specify "kafka" as the type of the stream
 .option("kafka.bootstrap.servers", kafkaServer)                    # Set the location of the kafka server
 .option("subscribe", "tweets")                                     # Indicate which topics to listen to
 .option("maxOffsetsPerTrigger", 1000)                              # Throttle Kafka's processing of the streams
 .option("startingOffsets", "earliest")                             # Rewind stream to beginning when we restart notebook
 .load()                                                            # Load the input data stream in as a DataFrame
 .select(col("value").cast("STRING"))                               # Select the "value" column and cast to a string
)

display(rawDF, streamName = "rawStream")

value
"{""hashTags"":[],""text"":""i mean i know there’s gonna be some heavy drama involved but i still started this drama knowing it will hurt me deeply 😭😭😭😭😭"",""id"":1398893770188345344,""createdAt"":1622357239000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""andie⁷ 🐯""}"
"{""hashTags"":[],""text"":""Green tea 🤢"",""id"":1398893770196795392,""createdAt"":1622357239000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""Dlomo elihle""}"
"{""hashTags"":[],""text"":""@S4R4KIYOT thank you pretty"",""id"":1398893770184159239,""createdAt"":1622357239000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""moon suspended 🌙""}"
"{""hashTags"":[],""text"":""I MEAN i like my friends rn ... it’s not too deep im just commenting on the dynamic"",""id"":1398893770188361733,""createdAt"":1622357239000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""manic picky dweam girlboss bestie""}"
"{""hashTags"":[],""text"":""ok me my whole life"",""id"":1398893770209333248,""createdAt"":1622357239000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""⁴ˣ⁴""}"
"{""hashTags"":[],""text"":""Congratulations on having this pic featured in Your pictures of Scotland on @BBCScotlandNews this week!"",""id"":1398893770192609280,""createdAt"":1622357239000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""artycrafty""}"
"{""hashTags"":[],""text"":""Sleepy Tina is sleepy.\n\nWe successfully built 2 PCs in 12h.\nMy brain is fried!\nSuper exhausted but just as full of… https://t.co/AfE1trBSxh"",""id"":1398893770184265728,""createdAt"":1622357239000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""ReadyToTina🍝""}"
"{""hashTags"":[],""text"":""RT @TheCryptoLark: Be careful of those in panic, like a drowning man they will drown you too. Calm your mind and think rationaly and you wi…"",""id"":1398893774386778117,""createdAt"":1622357240000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""#Dogecoin India 🇮🇳""}"
"{""hashTags"":[],""text"":""RT @Cashbro303: $100 to 1 follower in 24hrs! who retweets and follows @ViruxToken ❤\n\nVirux is a Charity token, launched under 24hrs ago, fi…"",""id"":1398893774399442955,""createdAt"":1622357240000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""मारिया ♡""}"
"{""hashTags"":[],""text"":""RT @neerajarora91: My Brother Vikas Wadhwa is fighting Black fungus. \nHis one eye has already been removed. \nIt's been three days since he…"",""id"":1398893774412017674,""createdAt"":1622357240000,""place"":{""coordinates"":[],""name"":null,""placeType"":null,""fullName"":null,""countryCode"":null},""retweetCount"":0,""lang"":""en"",""favoriteCount"":0,""user"":""jishnu""}"


In [0]:
# TEST - Run this cell to test schema.
schemaStr = str(rawDF.schema)

dbTest("SS-06-schema-value",     True, "(value,StringType,true)" in schemaStr)
dbTest("SS-06-is-streaming",     True, rawDF.isStreaming)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 2</h2>
<h3>A Schema for parsing JSON</h3>

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, ArrayType

twitSchema = StructType([
  StructField("hashTags", ArrayType(StringType(), False), True),
  StructField("text", StringType(), True),   
  StructField("userScreenName", StringType(), True),
  StructField("id", LongType(), True),
  StructField("createdAt", LongType(), True),
  StructField("retweetCount", IntegerType(), True),
  StructField("lang", StringType(), True),
  StructField("favoriteCount", IntegerType(), True),
  StructField("user", StringType(), True),
  StructField("place", StructType([
    StructField("coordinates", StringType(), True), 
    StructField("name", StringType(), True),
    StructField("placeType", StringType(), True),
    StructField("fullName", StringType(), True),
    StructField("countryCode", StringType(), True)]), 
  True)
])

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 3</h2>
<h3>Create a JSON DataFrame</h3>

From the `rawDF` parse out the json subfields using `from_json`. Create a DataFrame that has fields
* `time`
* `json`, a nested field that has all the rest of the data
* promote all `json` subfields to fields.

In [0]:
# TODO
from pyspark.sql.functions import from_json, expr

cleanDF = (rawDF
  .withColumn("json", from_json(col("value"), twitSchema))                      # Add the column "json" by parsing the column "value" with "from_json"
  .select(
    expr("cast(cast(json.createdAt as double)/1000 as timestamp) as time"),     # Cast "createdAt" column properly, call it "time"
    col("json.hashTags").alias("hashTags"),                                     # Promote subfields of "json" column e.g. "json.field" to "field"
    col("json.text").alias("text"),                                             # Repeat for each subfields of "json"
    col("json.userScreenName").alias("userScreenName"),
    col("json.id").alias("id"), 
    col("json.retweetCount").alias("retweetCount"),
    col("json.lang").alias("lang"),
    col("json.favoriteCount").alias("favoriteCount"),
    col("json.user").alias("user"),
    col("json.place.coordinates").alias("coordinates"),
    col("json.place.name").alias("name"),
    col("json.place.placeType").alias("placeType"),
    col("json.place.fullName").alias("fullName"),
    col("json.place.countryCode").alias("countryCode")   
  )
)

In [0]:
# TEST - Run this cell to test schema.
schemaStr = str(cleanDF.schema)

dbTest("SS-06-schema-hashTag",  True, "hashTags,ArrayType(StringType,true)" in schemaStr)
dbTest("SS-06-schema-text",  True, "(text,StringType,true)" in schemaStr)
dbTest("SS-06-schema-userScreenName",  True, "(userScreenName,StringType,true)" in schemaStr)
dbTest("SS-06-schema-id",  True, "(id,LongType,true)" in schemaStr)
dbTest("SS-06-schema-time",  True, "(time,TimestampType,true)" in schemaStr)
dbTest("SS-06-schema-retweetCount",  True, "(retweetCount,IntegerType,true)" in schemaStr)
dbTest("SS-06-schema-lang",  True, "(lang,StringType,true)" in schemaStr)
dbTest("SS-06-schema-favoriteCount",  True, "(favoriteCount,IntegerType,true)" in schemaStr)
dbTest("SS-06-schema-user",  True, "(user,StringType,true)" in schemaStr)
dbTest("SS-06-schema-coordinates",  True, "(coordinates,StringType,true)" in schemaStr)
dbTest("SS-06-schema-name",  True, "(name,StringType,true)" in schemaStr)
dbTest("SS-06-schema-placeType",  True, "(placeType,StringType,true)" in schemaStr)
dbTest("SS-06-schema-fullName",  True, "(fullName,StringType,true)" in schemaStr)
dbTest("SS-06-schema-countryCode",  True, "(countryCode,StringType,true)" in schemaStr)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 4</h2>
<h3>Display Twitter Data as a Table</h3>

In [0]:
# TODO
display(cleanDF, streamName = "cleanStream")  # display "cleanDF"

time,hashTags,text,userScreenName,id,retweetCount,lang,favoriteCount,user,coordinates,name,placeType,fullName,countryCode
2021-05-30T06:47:19.000+0000,List(),i mean i know there’s gonna be some heavy drama involved but i still started this drama knowing it will hurt me deeply 😭😭😭😭😭,,1398893770188345344,0,en,0,andie⁷ 🐯,[],,,,
2021-05-30T06:47:19.000+0000,List(),Green tea 🤢,,1398893770196795392,0,en,0,Dlomo elihle,[],,,,
2021-05-30T06:47:19.000+0000,List(),@S4R4KIYOT thank you pretty,,1398893770184159239,0,en,0,moon suspended 🌙,[],,,,
2021-05-30T06:47:19.000+0000,List(),I MEAN i like my friends rn ... it’s not too deep im just commenting on the dynamic,,1398893770188361733,0,en,0,manic picky dweam girlboss bestie,[],,,,
2021-05-30T06:47:19.000+0000,List(),ok me my whole life,,1398893770209333248,0,en,0,⁴ˣ⁴,[],,,,
2021-05-30T06:47:19.000+0000,List(),Congratulations on having this pic featured in Your pictures of Scotland on @BBCScotlandNews this week!,,1398893770192609280,0,en,0,artycrafty,[],,,,
2021-05-30T06:47:19.000+0000,List(),Sleepy Tina is sleepy. We successfully built 2 PCs in 12h. My brain is fried! Super exhausted but just as full of… https://t.co/AfE1trBSxh,,1398893770184265728,0,en,0,ReadyToTina🍝,[],,,,
2021-05-30T06:47:20.000+0000,List(),"RT @TheCryptoLark: Be careful of those in panic, like a drowning man they will drown you too. Calm your mind and think rationaly and you wi…",,1398893774386778117,0,en,0,#Dogecoin India 🇮🇳,[],,,,
2021-05-30T06:47:20.000+0000,List(),"RT @Cashbro303: $100 to 1 follower in 24hrs! who retweets and follows @ViruxToken ❤ Virux is a Charity token, launched under 24hrs ago, fi…",,1398893774399442955,0,en,0,मारिया ♡,[],,,,
2021-05-30T06:47:20.000+0000,List(),RT @neerajarora91: My Brother Vikas Wadhwa is fighting Black fungus. His one eye has already been removed. It's been three days since he…,,1398893774412017674,0,en,0,jishnu,[],,,,


In [0]:
# TEST - Run this cell to check whether stream is running.
dbTest("SS-06-numActiveStreams", True, len(spark.streams.active) > 0)
       
print("Tests passed!")

Stop the stream:

In [0]:
# TODO
for streamingQuery in spark.streams.active:
  print("stopping: ", streamingQuery.name)
  streamingQuery.stop()

In [0]:
# TEST - Run this cell to test whether streamings stoped.
dbTest("SS-06-numActiveStreams1", 0, len(spark.streams.active))

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 5</h2>
<h3>Hashtag Processing</h3>

In this exercise, we do ETL processing on the `hashTags` column.

The goal is to first convert hash tags all to lower case then group tweets and count by hash tags.

You will notice that `hashTags` is an array of hash tags, which you will have to break up (use `explode` function).

The `explode` method allows you to split an array column into multiple rows, copying all the other columns into each new row.

In [0]:
# TODO
from pyspark.sql.functions import window, explode, lower

# GroupBy + count only generate related column
twitCountsDF = (cleanDF                                           # Start with "cleanDF"
  .withColumn("hashTag", explode("hashTags"))                     # Explode the array "hashTags" into "hashTag"
  .withColumn("hashTag", lower(col("hashTag")))                   # Convert "hashTag" to lower case
  .groupBy(col("hashTag"), window(col("time"), "5 minutes"))      # Aggregate by "hashTag" and window 5 minutes
  .count()                                                        # For the aggregate, produce a count  
  .select(col("window.start").alias("start"),                     # Elevate field to column
          col("hashTag"),                                         # Include hashTag
          col("count"))                                           # Include count
  .orderBy(col("start").desc(), col("count").desc())              # Sort by "count"
  .limit(10)                                                      # Limit the result to 10 records
)

In [0]:
# TEST - Run this cell to test schema.
schemaStr = str(twitCountsDF.schema)

dbTest("SS-06-schema-hashTag", True, "(hashTag,StringType,true)" in schemaStr)
dbTest("SS-06-schema-count",   True, "(count,LongType,false)" in schemaStr)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 6</h2>
<h3>Plot Counts of Top 10 Most Popular Hashtags</h3>

Under <b>Plot Options</b>, use the following:
* <b>Keys:</b> `hashTag`
* <b>Values:</b> `count`

In <b>Display type</b>, use <b>Pie Chart</b> and click <b>Apply</b>.

Once you apply the plot options, be prepared to increase the size of the plot graphic using the resize widget in the lower right corner of the graphic area.

In [0]:
# TODO
display(twitCountsDF, streamName = "twitCountStream") # display twitCountsDF

start,hashTag,count
2021-05-30T09:50:00.000+0000,reservation_scam_up69k,6
2021-05-30T09:50:00.000+0000,thebestleader2021,2
2021-05-30T09:50:00.000+0000,westandwithstalin,2
2021-05-30T09:50:00.000+0000,shopeegiveawayalbumkpop,2
2021-05-30T09:50:00.000+0000,whatshappeninginmyanmar,2
2021-05-30T09:50:00.000+0000,nctdream,2
2021-05-30T09:50:00.000+0000,jfcjustinbieber,2
2021-05-30T09:50:00.000+0000,इसलिए_nrc_चाहिए,2
2021-05-30T09:50:00.000+0000,dutchmillxbrightwin,1
2021-05-30T09:50:00.000+0000,oklahomacity,1


In [0]:
# TEST - Run this cell to check whether stream is running.
dbTest("SS-06-numActiveStreams", True, len(spark.streams.active) > 0)
       
print("Tests passed!")

When you are done, stop the stream:

In [0]:
for streamingQuery in spark.streams.active:
  streamingQuery.stop()

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 7</h2>
<h3>Read in File with Two Letter to Three Letter Country Codes</h3>

For this next part we are going to take a look at the number of requests per country.

To get started, we first need a lookup table that will give us the 3-character country code.

0. Read in the file at `/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet`
0. We will be interested in the `alpha2Code` and `alpha3Code` fields later

In [0]:
# TODO
parquetFile = "/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.parquet"
countryCodeDF = spark.read.parquet(parquetFile)
display(countryCodeDF)

EnglishShortName,alpha2Code,alpha3Code,numericCode,ISO31662SubdivisionCode,independentTerritory
Afghanistan,AF,AFG,4,ISO 3166-2:AF,Yes
Åland Islands,AX,ALA,248,ISO 3166-2:AX,No
Albania,AL,ALB,8,ISO 3166-2:AL,Yes
Algeria,DZ,DZA,12,ISO 3166-2:DZ,Yes
American Samoa,AS,ASM,16,ISO 3166-2:AS,No
Andorra,AD,AND,20,ISO 3166-2:AD,Yes
Angola,AO,AGO,24,ISO 3166-2:AO,Yes
Anguilla,AI,AIA,660,ISO 3166-2:AI,No
Antarctica,AQ,ATA,10,ISO 3166-2:AQ,No
Antigua and Barbuda,AG,ATG,28,ISO 3166-2:AG,Yes


In [0]:
# TEST - Run this cell to test your solution.
schemaStr = str(countryCodeDF.schema)

dbTest("SS-06-schema-1", True, "(EnglishShortName,StringType,true)" in schemaStr)
dbTest("SS-06-schema-2", True, "(alpha2Code,StringType,true)" in schemaStr)
dbTest("SS-06-schema-3", True, "(alpha3Code,StringType,true)" in schemaStr)
dbTest("SS-06-schema-4", True, "(numericCode,StringType,true)" in schemaStr)
dbTest("SS-06-schema-5", True, "(ISO31662SubdivisionCode,StringType,true)" in schemaStr)
dbTest("SS-06-schema-6", True, "(independentTerritory,StringType,true)" in schemaStr)

dbTest("SS-06-streaming-7", False, countryCodeDF.isStreaming)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 8</h2>
<h3>Join Tables &amp; Aggregate By Country</h3>

In `cleanDF`, there is a `countryCode` field. However, it is in the form of a two-letter country code.

The `display` map expects a three-letter country code.

In order to retrieve tweets with three-letter country codes, we will have to join `cleanDF` with `countryCodesDF`.

In [0]:
# TODO
mappedDF = (cleanDF
  .filter(col("countryCode").isNotNull())                                        # Filter out any nulls for "countryCode"
  .join(countryCodeDF, cleanDF["countryCode"] == countryCodeDF["alpha2Code"])    # Join the two tables on "countryCode" and "alpha2Code"
  .groupBy(countryCodeDF["alpha3Code"])                                          # Aggregate by country, "alpha3Code"
  .count()                                                                       # Produce a count of each aggregate
)

In [0]:
# TEST - Run this cell to test your solution.
schemaStr = str(mappedDF.schema)
print(schemaStr)

dbTest("SS-06-schema-1",  True, "alpha3Code,StringType,true" in schemaStr)
dbTest("SS-06-schema-2",  True, "count,LongType,false" in schemaStr)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 9</h2>
<h3>Plot Tweet Counts on a World Map</h3>

Under <b>Plot Options</b>, use the following:
* <b>Keys:</b> `alpha3Code`
* <b>Values:</b> `count`

In <b>Display type</b>, use <b>World map</b> and click <b>Apply</b>.

<img src="https://files.training.databricks.com/images/eLearning/Structured-Streaming/plot-options-map-06.png"/>

In [0]:
# TODO
display(mappedDF, streamName = "mappedStream")  # display mappedDF

alpha3Code,count
IND,99
CHL,1
MWI,2
BHR,1
URY,1
NZL,12
KEN,9
TUR,7
PRT,1
LKA,3


In [0]:
# TEST - Run this cell to test your solution.
dbTest("SS-06-numActiveStreams", True, len(spark.streams.active) > 0)
       
print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 10: Write Stream</h2>

Write the stream to an in-memory table
0. Use appropriate `format`
0. For this exercise, we want to append new records to the results table
0. Gives the query a name
0. Start the query
0. Assign the query to `mappedTable`

In [0]:
# TODO
mappedQuery = (mappedDF       
.writeStream                                                       # From the DataFrame get the DataStreamWriter
.format("memory")                                                  # Specify the sink format as "memory"
.outputMode("complete")                                            # Configure the output mode as "complete"
.queryName("mappedTablePython")                                   # Name the query "mappedTablePython"
.start()                                                           # Start the query
)

In [0]:
# TEST  - Run this cell to test your solution.
dbTest("SS-06-isActive", True, mappedQuery.isActive)
dbTest("SS-06-name", "mappedTablePython", mappedQuery.name)

print("Tests passed!")

Wait until stream is done initializing...

In [0]:
untilStreamIsReady("mappedTablePython")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 11: Use SQL Syntax to Display a Few Rows</h2>

Do a basic SQL query to display all columns and, say, 10 rows.

In [0]:
%sql
--TODO 
SELECT *
FROM mappedTablePython
LIMIT 10

alpha3Code,count
IND,14
NZL,1
KEN,1
LKA,1
AUS,6
NIC,1
NGA,6
SAU,1
PHL,2
BWA,1


In [0]:
# TEST - Run this cell to test your solution.
try: tableExists = (spark.table("mappedTablePython") is not None)
except: tableExists = False
dbTest("SS-06-1", True, tableExists)  

firstRowCol = spark.sql("SELECT * FROM mappedTablePython limit 1").first()[0]
dbTest("SS-06-rowsExist", True, len(firstRowCol) > 0) 

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Step 12: Stop Streaming Jobs</h2>

Before we can conclude, we need to shut down all active streams.

In [0]:
# TODO
for streamingQuery in spark.streams.active:
  streamingQuery.stop()

In [0]:
# TEST - Run this cell to test your solution.
dbTest("SS-06-numActiveStreams", 0, len(spark.streams.active))

print("Tests passed!")

Congratulations: ALL DONE!!