### Filtering weather data

Filtering a RDD to determine minimum temperature of historical weather data from the year 1800.  Format of weather data is:

- weather station id, day, type of reading,reading (tenth of celsius)
- ITE00100554,18000101,TMAX,-75,,,E,
- ITE00100554,18000101,TMIN,-148,,,E,

In [1]:
#import relevant libraries
import findspark
findspark.init('/home/richard/Documents/spark/spark-2.4.0-bin-hadoop2.7')
from pyspark import SparkConf, SparkContext

In [2]:
#set up as local and name app
conf = SparkConf().setMaster("local").setAppName("MinTemperatures")
sc = SparkContext(conf = conf)

In [5]:
#define function to strip and process data from each line of sc
def parseLine(line):
    fields = line.split(',')
    stationID = fields[0]
    entryType = fields[2]
    #if scale == centigrade:
    temperature = float(fields[3]) * 0.1
    #else:
        #temperature = float(fields[3]) * 0.1 * (9.0 / 5.0) + 32.0
    return (stationID, entryType, temperature)

In [7]:
#put into lines RDD
lines = sc.textFile("data/1800.csv")

#apply the function
parsedLines = lines.map(parseLine)

#filter for 'TMIN'
minTemps = parsedLines.filter(lambda x: "TMIN" in x[1])

#create stationid and temp key/value pairs
stationTemps = minTemps.map(lambda x: (x[0], x[2]))

#aggregate every station id, only smallest value will survive for that station id
minTemps = stationTemps.reduceByKey(lambda x, y: min(x,y))

results = minTemps.collect();

#print(results)

for result in results:
    print(result[0] + "\t{:.2f}C".format(result[1]))

ITE00100554	-14.80C
EZE00100082	-13.50C
