<h1>Twitter Streaming Assignment</h1>
In this assignment, you need to calculate the average sentiment of selected tweets and draw a dynamic graph that shows how this average sentiment is changing over time. Roughly, you need to:

<li>create a twitter stream listener</li>
<li>collect tweets in batches</li>
<li>get the sentiment associated with each tweet</li>
<li>create windows on the stream</li>
<li>calculate the average sentiment within each window</li>
<li>create a dynamic graph that updates every x seconds with the window time on the x axis and the average sentiment in the window on the y-axis</li>

<h2>The graph</h2>
<li>Note that the graph will pop up in a separate window</li>
<li>Which may be behind your browser - so look for it!</li>


<h2>Resources required</h2>
<li><span style="color:blue">streaming twitter</span>: based on the java twitter library <span style="color:blue">twitter4j</span>, this library provides Spark streaming support for twitter</li>
<li><span style="color:blue">JFree Chart</span>: a java library for drawing charts (<a href="https://www.jfree.org/jfreechart/javadoc/index.html">https://www.jfree.org/jfreechart/javadoc/index.html</a></li>
<li><b>Note:</b> I've given examples of what you need from these libraries below and all you'll need to bring to the assignment is your knowledge of Scala and Spark</li>

<h2>Twitter developer account</h2>
<li>Create a twitter developer account by following the instructions at https://developer.twitter.com/en/support/twitter-api/developer-account</li>
<li>Note that you will first need to create a twitter user account</li>

<h2>World cup sentiment</h2>
<li>We'll analyze the changes in sentiment for the ongoing 2022 Football World Cup in Qatar</li>
<li>The basic process:</li>
<ul>
    <li>Save your twitter keys in the various key variables and convert them to Java properties (see below)</li>
    <li>Read in a list of words that will help identify if a tweet is about the world cup (file: fifa2022_words.txt)</li>
    <li>Read in a list of (word, sentiment) pairs (file AFINN-111.txt). Sentiment scores for words vary from -3 to +3</li>
    <li>Initiate a twitter stream</li>
    <li>Filter the stream to include only english tweets</li>
    <li>Extract the text from each tweet and see if ANY of the words in the filter list are present in the tweet. If even one of the words is present, keep the tweet. Otherwise discard it</li>
    <li>Convert each tweet into an array of words and then each rdd into a single array of words (i.e., use flatMap for converting an rdd into words)
    <li>Join each RDD with the sentiment scores. You'll need to count the total number of words, and for the words that have a sentiment, multiply the count by the sentiment. Then, add up all the resulting values and divide by the total number of words</li>
    <li>Accumulate each sentiment in a mutable Array Buffer</li>
    <li>Follow the steps to draw and update a chart as the sentiment changes</li>
    

<h3>Notes on the chart</h3>
<li>We're using an XYLineChart from JFreeChart</li>
<li>Both the x and the y axes are numbers</li>
<li>In our case, the x-axis is Int (unix time values) and the y-axis is Double (sentiment averages)</li>
<li>We'll construct a window on our stream and calculate the  sentiment in that window. The time stamp for the window is the x-axis value and the sentiment is the y-axis value</li>
<li>x and y values are stored in an XYSeries object (https://www.jfree.org/jfreechart/api/javadoc/org/jfree/data/xy/XYSeries.html) and, as each new value arrives, the addOrUpdate function updates the graph with the new value</li>

<h2>Installing packages</h2>
<li>The two packages you need to install are spark-streaming-twitter_2.12:2.4.0 and org.jfree:jfreechart:1.5.3</li>
<li>Run the next cell before you run any code to initialize Spark</li>

In [1]:
%%init_spark
launcher.num_executors = 4
launcher.executor_cores = 2
launcher.driver_memory = '10g'
launcher.packages= ["org.apache.bahir:spark-streaming-twitter_2.12:2.4.0",
                   "org.jfree:jfreechart:1.5.3"]



In [2]:
sc.setLogLevel("ERROR")

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.149:4040
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1670384061109)
SparkSession available as 'spark'


<h2>Outer join of two rdds</h2>
<li>You're going to need this. The next cell gives an example</li>
<li>For joins to work on rdds, both rdds must be Paired RDDs</li>
<li>Note that an RDD of (String, (Option[Int],Option[Int])) is returned. Use match and case to remove the option for each possible combination</li>

In [3]:
import org.apache.spark.rdd.PairRDDFunctions._
val rdd2 = sc.parallelize(Array(("this",1),("is",1),("world",1),("good",1)))
val rdd1 = sc.parallelize(Array(("good",2),("is",-1),("not",2)))
val outer_joined_rdd = rdd1.fullOuterJoin(rdd2)
outer_joined_rdd.collect


import org.apache.spark.rdd.PairRDDFunctions._
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:25
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:26
outer_joined_rdd: org.apache.spark.rdd.RDD[(String, (Option[Int], Option[Int]))] = MapPartitionsRDD[4] at fullOuterJoin at <console>:27
res1: Array[(String, (Option[Int], Option[Int]))] = Array((is,(Some(-1),Some(1))), (world,(None,Some(1))), (not,(Some(2),None)), (good,(Some(2),Some(1))), (this,(None,Some(1))))


<h2>Checking to see if ANY of the strings in an array is in a string</h2>
<li>The function <b>contains</b> returns true if a string is contained in another string</li>
<li>The function <b>exists</b> returns true if any element in a sequence satisfies a condition</li>
<li>Use a combination of exists and contains to see if ANY of the strings in an array a is in a string s</li>

In [4]:
val contains_example = "The world cup is in Qatar this year".contains("Qatar")
val exists_example_true = Array(1,3,5,7,8,13).exists(v => v%2==0)
val exists_example_false = Array(1,3,5,7,9,13).exists(v => v%2==0)

contains_example: Boolean = true
exists_example_true: Boolean = true
exists_example_false: Boolean = false


<h2>Mutable Array Buffer</h2>
<li>Scala Array is mutable in content but not mutable in size</li>
<li>An ArrayBuffer object is mutable in content as well as size</li>
<li>The += operator adds a new element at the back of the array</li>

In [5]:
import scala.collection.mutable.ArrayBuffer
var mutable_array_example = ArrayBuffer[Double]()
mutable_array_example += 5.8
mutable_array_example += -2.4
println(mutable_array_example)

ArrayBuffer(5.8, -2.4)


import scala.collection.mutable.ArrayBuffer
mutable_array_example: scala.collection.mutable.ArrayBuffer[Double] = ArrayBuffer(5.8, -2.4)


<h1>Do the assignment</h1>
<li>Follow the steps below and fill in the pieces of code wherever necessary</li>
<li>If you run into serializability errors, put all the code in a single cell and rerun it</li>

<h2>Set twitter keys</h2>

In [6]:
// val CONSUMER_KEY = "MgWa1tEhKu863diqEdpJhSwhk"
// val CONSUMER_SECRET = "NY1pq2DJ3QKKEGKiRvrNZLAvxV095yh6eaDnQbEazp3TS369QP"
// val ACCESS_TOKEN = "1597350057111584768-AjcWhOV1TgkmrX6wGKRTwV1kFMmasy"
// val ACCESS_TOKEN_SECRET = "EWfirqvdRMObkAJQXjsLIkm92cHttRMAcHi4Nq0MTA9cG"

// //Twitter API keys attached to twitter4j
// System.setProperty("twitter4j.oauth.consumerKey",CONSUMER_KEY)
// System.setProperty("twitter4j.oauth.consumerSecret",CONSUMER_SECRET)
// System.setProperty("twitter4j.oauth.accessToken",ACCESS_TOKEN)
// System.setProperty("twitter4j.oauth.accessTokenSecret",ACCESS_TOKEN_SECRET)

<h2>Data preparation functions and twitter stream setup</h2>

In [9]:
//All the imports you will need

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.twitter._
import org.apache.spark.rdd.PairRDDFunctions._
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import org.jfree.data.xy.{XYSeries, XYSeriesCollection} 
import org.jfree.chart.{ChartFactory, ChartFrame, JFreeChart} 
import org.jfree.chart.plot.{PlotOrientation, XYPlot} 
import org.jfree.chart.util.PaintUtils
import java.awt.Paint
import java.awt.Color._

//Read the key words that will identify world cup texts
//(Adjust the path if your file is in a different folder)
//Read these into a Scala Array
val filters = Source.fromFile("fifa2022_words.txt").getLines.toArray


//Read (word,sentiment pairs into an RDD)
val sentimentFilePath = "AFINN-111.txt"

val wordSentiments = sc.textFile(sentimentFilePath).map { line => 
    //FILL THIS IN
    (line.split("\t")(0),line.split("\t")(1).toDouble)
}.cache()

//Write a function getWords that takes a string and breaks it up into an array of words
//Example: "Jack, the man\nWoolly the mammoth!" should generate the array
//   Array(Jack, the, man, Woolly, the, mammoth)
//or: remove punctuation and replace \n by a space. 
//The isLetter function returns true if a character is a letter

def getWords(text: String): Array[String] = {
    //FILL THIS IN
    text.split(" ").map(a=>a.toArray.filter(b=>b.isLetter)).map(c=>c.mkString)
}

//Set up the stream and get the text of each tweet
val ssc = new StreamingContext(sc,Seconds(10.toLong))
ssc.checkpoint("checkpoint")
val text = ssc.socketTextStream("localhost", 4444)
// val stream = TwitterUtils.createStream(ssc, None, Array[String]())
// val tweets = stream.filter(_.getLang == "en")
// val text = tweets.map(status => status.getText)

//filter tweets only keeping world cup related ones
//Use transform to work on each rdd (text is a DStream object, not an RDD)
//filter using exists and contains
//Also, convert the text to lowercase (all the keywords are in lowercase)
val filteredText = text.transform(
                    rdd => rdd.filter( //rdd
                        a => filters.exists( //element
                            key => a.toLowerCase().contains(key))))

//convert all tweets in filteredText into a single array of words (think flatMap)
val words = filteredText.flatMap(getWords(_))

//get sentiments
//we need to convert each word into a pair (word, 1) to count the number of words
//apply transform to join each rdd with the wordSentiment rdd using fullOuterJoin
//Use match to convert Option to a new paired rdd (count, 1*sentiment)

val sentiment = words.map(word => (word, 1.0))
                  .transform{
                      rdd => rdd.fullOuterJoin(wordSentiments)
                          .flatMap(pair => pair match {
                                    case (word,(Some(count),None)) => Some(0.0,0.0)
                                    case (word,(None,senti)) => None
                                    case (word,(Some(count),Some(senti))) => Some(count,1*senti)
                                    }
                              )
                            }


//Define a window of length 120 that slides every 40 seconds
val sentiment_window = sentiment.window(Seconds(120),Seconds(40))

//Create an empty ArrayBuffer all_sentiments that contains sentiments
//And a second array buffer that contains the (timestamp,moving average)
//Because we'll modify them, these need to be var, not val

val MOVING_AVERAGE_LENGTH = 3

var all_sentiments = ArrayBuffer[Double]()
var all_averages = ArrayBuffer[(String,Double)]()

/*
1. apply foreachRDD to each sentiment window

2. Update all_sentiments by the sentiment of the rdd (divide total sentiment by the count
of all words and multiply by 100.0

3. Calculate the total for count and sentiment (sentiment should be (Double,Double) pairs)
Example of all_sentiments:
res6: scala.collection.mutable.ArrayBuffer[Double] = ArrayBuffer(0.17157852240613647, 0.10092344956350609, 
0.10092175200161475, 0.07737334320123797, 
0.19847944560317568, 0.20185029436501262, 
0.1883936080740118, 0.10765711209796798, 
0.04709998654286099, 0.026914278024491995)

4. Compute the moving average. If the number of elements in all_sentiments is less 
than MOVING_AVERAGE_LENGTH, then a simple average works. If greater, then compute
the average of the last MOVING_AVERAGE_LENGTH elements (the scala function slice may help)

5. Uodate all_averages with the timestamp (cleaned) and the moving average. Example:
ArrayBuffer((5820000,0.17157852240613647), 
(5860000,0.13625098598482127), 
(5900000,0.12447457465708577), 
(5940000,0.09307284825545294), 
(5980000,0.12559151360200946), 
(6020000,0.15923436105647543), 
(6060000,0.19624111601406669), 
(6100000,0.16596700484566412), 
(6140000,0.1143835689049469), 
(6180000,0.060557125555107))

6. You also need to clean the timestamp. Convert it into a string, 
drop the "ms" from the end, and then drop everything other than last 7 digits
You might find the function takeRight useful

*/
sentiment_window.foreachRDD((r,t) => {
    val sum = r.map(t => t._2).fold(0.0)((a,b)=>a+b)
    val count = r.count()
    val clean_timestamp = t.toString.dropRight(3).takeRight(9)
    all_sentiments += (sum/count) * 100.0
    //during some window, the input can be empty, the sentiment will be NaN. 
    //NaN will affect all_averages, so I filter NaN first.
    all_sentiments = all_sentiments.filter(x=> !x.isNaN)
    if (all_sentiments.length >= 1) {
        if (all_sentiments.length == 1) all_averages += ((clean_timestamp, all_sentiments(0)))
        else if (all_sentiments.length < MOVING_AVERAGE_LENGTH)
                all_averages += ((clean_timestamp,all_sentiments.sum/all_sentiments.length))
        else {
    //val slice = all_sentiments.takeRight(MOVING_AVERAGE_LENGTH)
            val slice = all_sentiments.slice(all_sentiments.length-3,all_sentiments.length+1)
            all_averages += ((clean_timestamp,slice.sum/MOVING_AVERAGE_LENGTH))
        }

    //Print new values
        println(all_sentiments(all_sentiments.length-1),all_averages(all_averages.length-1))
    }
})
//Configure and show the (initally empty) chart
//I've done all the chart work for you


//Create a new XYSeries object that holds the data for the graph 
//And a dataset that contains this XYSeries object
//The goal is to update xy whenever there is a new average in all_averages

val xy = new XYSeries("") 
val dataset = new XYSeriesCollection(xy)

//Creates the chart object 
val chart = ChartFactory.createXYLineChart( 
  "2022 World Cup Sentiment Chart",  // chart title 
  "Time",               // x axis label 
  "Sentiment",                   // y axis label 
  dataset,                   // data 
  PlotOrientation.VERTICAL, 
  false,                    // include legend 
  true,                     // tooltips 
  false                     // urls 
)

//From the chart, grab the plot so that we can configure formatting info (done for you)

val plot = chart.getXYPlot() 

def configurePlot(plot: XYPlot): Unit = { 
  plot.setBackgroundPaint(WHITE) 
  plot.setDomainGridlinePaint(BLACK) 
  plot.setRangeGridlinePaint(BLACK) 
  plot.setOutlineVisible(false) 
} 

//A function that shows the chart.
def show(chart: JFreeChart) { 
  val frame = new ChartFrame("plot", chart) 
  frame.pack() 
  frame.setVisible(true) 
}

//Call the plot configuration function
//Call the show chart function (now it will actually pop up)
//Note that the chart is in a separate window so you might need to look for it

configurePlot(plot) 
show(chart) 



import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.twitter._
import org.apache.spark.rdd.PairRDDFunctions._
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import org.jfree.data.xy.{XYSeries, XYSeriesCollection}
import org.jfree.chart.{ChartFactory, ChartFrame, JFreeChart}
import org.jfree.chart.plot.{PlotOrientation, XYPlot}
import org.jfree.chart.util.PaintUtils
import java.awt.Paint
import java.awt.Color._
filters: Array[String] = Array(fifa, qatar, soccer, football, world cup, ronaldo, cristiano, messi, usa, brazil, france, ecuador, senegal, netherlands, iran, england, wales, argentina, saudi arabia, mexico, poland, australia, denmark, tunisia, spain, costa rica, germany, japan, belgium, canada, morocco, croatia, serbia...


In [10]:
/*
1. Start the stream
2. Inside a while loop, sleep for a bit (Thread.sleep(10000) for 10 seconds)
3. then check if there are new elements in all_averages

4. To check if there are new elements, initialize a variable index to 0 and,
at each interval (after sleep), check if the array length of all_averages is
greater than index. If it is, there are length-index new elements

5. if there are new elements, add them to xy using addOrUpdate (see documentation linked above)
 add the elements in all_averages.length - previous_length to xy ()
Use addOrUpdate (not add) so that the graph updates

6. The while should run as long as the length of all_averages is less than NUM_BATCHES

7. Call ssc.stop(false) after the while loop

8. Note that once the stream stops, DStream elements are no longer accessible but
RDDs are (all_sentiments and all_averages)

9. Enjoy! Do note that for this to make sense, we should run this for a long time and 
take a moving average of a longer period (e.g., several hours). Treat this as a
learning exercise, not a diagnostic one

*/

val NUM_BATCHES = 10 //So that you don't get banned from twitter
var index = 0
ssc.start
while (all_averages.length < NUM_BATCHES ) {
    Thread.sleep(10000);
    var len = all_averages.length
    if (len > index) {
        index = len
        xy.addOrUpdate(all_averages(all_averages.length-1)._1.toDouble, all_averages(all_averages.length-1)._2)
    }
    
}
ssc.stop(false)

(-1.8867924528301887,(384350000,-1.8867924528301887))
(-1.0752688172043012,(384390000,-1.481030635017245))
(-0.9592326139088728,(384430000,-1.3070979613144542))
(-0.6430868167202572,(384470000,-0.8925294159444771))
(-0.6134969325153374,(384510000,-0.7386054543814892))
(0.0,(384550000,-0.4188612497451982))
(-2.4390243902439024,(384590000,-1.0175071075864133))
(-4.166666666666666,(384630000,-2.2018970189701896))
(-4.166666666666666,(384670000,-3.5907859078590785))
(0.0,(384710000,-2.7777777777777772))
22/12/06 22:45:15 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver


NUM_BATCHES: Int = 10
index: Int = 10


In [None]:
// (clean_timestamp, all_sentiments(0))