# Activity: Min and Max Temperatures

Task: find the min and max temperature by weather station

In [23]:
import scala.math.min
import scala.math.max

import scala.math.min
import scala.math.max


## Data Import

In [24]:
spark

res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@973714a


In [25]:
val lines = sc.textFile("../../data/1800.csv")

lines: org.apache.spark.rdd.RDD[String] = ../../data/1800.csv MapPartitionsRDD[9] at textFile at <console>:29


Look at how each row look like. Take a sample from the data.

In [26]:
lines.top(5).foreach(println)

ITE00100554,18001231,TMIN,25,,,E,
ITE00100554,18001231,TMAX,50,,,E,
ITE00100554,18001230,TMIN,31,,,E,
ITE00100554,18001230,TMAX,50,,,E,
ITE00100554,18001229,TMIN,16,,,E,


Each row of the csv shows: `stationID, dateTime, entryTime, temperature,...`. 

## Helper Function

Define a method called `parseLine` that takes each row and output it in the form of `stationID, entryType, temperature`

In [39]:
def parseLine(line: String) = {
    val fields = line.split(",")     // Split the string by ','
    val stationID = fields(0)
    val entryType = fields(2)
    val temp = fields(3).toFloat * 0.1f*(9.0/5.0) + 32.0f //Convert to Fahrenheit
    (stationID, entryType, temp)
    
}

parseLine: (line: String)(String, String, Double)


## Minimum Temperature

Parse the lines with the `parseLine` helper function.

In [40]:
val parsedLines = lines.map(parseLine(_))

parsedLines: org.apache.spark.rdd.RDD[(String, String, Double)] = MapPartitionsRDD[19] at map at <console>:32


Filter out all but `TMIN` entries.

In [41]:
val minTemps = parsedLines.filter(x=> x._2 == "TMIN")

minTemps: org.apache.spark.rdd.RDD[(String, String, Double)] = MapPartitionsRDD[20] at filter at <console>:30


Convert to `(stationID, temperature)`

In [42]:
val stationMinTemps = minTemps.map(x => (x._1, x._3.toFloat))

stationMinTemps: org.apache.spark.rdd.RDD[(String, Float)] = MapPartitionsRDD[21] at map at <console>:30


Reduce by `stationID` retaining the minimum temperature found.

In [43]:
val minTempsByStation = stationMinTemps.reduceByKey((x,y) => min(x,y))

minTempsByStation: org.apache.spark.rdd.RDD[(String, Float)] = ShuffledRDD[22] at reduceByKey at <console>:30


Collect results at the line.

In [44]:
val results = minTempsByStation.collect()

results: Array[(String, Float)] = Array((EZE00100082,7.7), (ITE00100554,5.3599997))


In [45]:
for (result<- results.sorted){
    val station = result._1
    val temp = result._2
    val formattedTemp = f"$temp%.2f F"
    println(s"$station minimum temperature: $formattedTemp")
}

EZE00100082 minimum temperature: 7.70 C
ITE00100554 minimum temperature: 5.36 C


## Maximum Temperature

In [46]:
val maxTempsByStation = {lines.map(parseLine(_))
                              .filter(_._2 == "TMAX")
                              .map(x=>(x._1, x._3.toFloat))
                              .reduceByKey((x,y) => max(x,y))
                        }

maxTempsByStation: org.apache.spark.rdd.RDD[(String, Float)] = ShuffledRDD[26] at reduceByKey at <console>:35


In [47]:
val maxResults = maxTempsByStation.collect()

maxResults: Array[(String, Float)] = Array((EZE00100082,90.14), (ITE00100554,90.14))


In [49]:
for (result<- maxResults.sorted){
    val station = result._1
    val temp = result._2
    val formattedTemp = f"$temp%.2f F"
    println(s"$station maximum temperature: $formattedTemp")
}

EZE00100082 maximum temperature: 90.14 F
ITE00100554 maximum temperature: 90.14 F
