# Working with meteorological data 2

We will use meteorological data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

The objective is to **calculate the average temperature per day**.

## Load data

In [1]:
rdd = sc.textFile('datasets/meteogalicia.txt')

In [2]:
rdd.takeSample(withReplacement=True,num=5)

[u'      1          2017-06-16 22:10:00    Chuvia (L/m2)                             0',
 u'      1          2017-06-08 14:40:00    Visibilidade (m)                          20057',
 u'      1          2017-06-24 09:30:00    Visibilidade (m)                          20059',
 u'      1          2017-06-26 10:40:00    Temperatura media (\ufffdC)                    19,06',
 u'      1          2017-06-03 16:50:00    Visibilidade (m)                          20036']

## Extract date and temperature information

Filter data from the RDD keeping only "Temperatura media" lines and keeping the date information.

In [3]:
temperatures = rdd.filter(lambda line: 'Temperatura media' in line) \
    .map(lambda line: (line.split()[1], float(line.split()[6].replace(',','.'))   ))

Take 5 elements of the dataset to verify the contents of the RDD:

In [4]:
temperatures.take(5)

[(u'2017-06-01', 13.82),
 (u'2017-06-01', 13.71),
 (u'2017-06-01', 13.61),
 (u'2017-06-01', 13.52),
 (u'2017-06-01', 13.33)]

## Filter out invalid values

As we saw in part 1, a temperature value of -9999 indicates a non existing value, so we filter out these values before performing calculations on the data:

In [None]:
temperatures.min()

In [5]:
temperatures_clean = temperatures.filter(lambda (a,b): b != -9999)

In [6]:
temperatures_clean.min()

(u'2017-06-01', 11.28)

## Calculate the average temperature per day

In [7]:
def sum_pairs(a, b):
    return (a[0]+b[0], a[1]+b[1])

In [8]:
averages = temperatures_clean.map(lambda (date, temp): (date, (temp, 1))) \
    .reduceByKey(sum_pairs) \
    .map(lambda (date, (temp, count)): (date, temp / count))

In [None]:
averages.take(5)

## Show the results sorted by date

In [None]:
averages.sortByKey().collect()