# Working with meteorological data  - Structured API

We will use meteorological data from Meteogalicia that contains the measurements of a weather station in Santiago during June 2017.

The objective is to **calculate the average temperature per day**.

## Load data

In [5]:
from pyspark.sql import SparkSession

In [6]:
df = spark.read.csv('datasets/meteogalicia.txt')

In [10]:
df.show(20, truncate=False)

+---------------------------------------------------------------------------------------+
|_c0                                                                                    |
+---------------------------------------------------------------------------------------+
|ESTACI�N AUTOM�TICA:Santiago-EOAS                                                      |
|CONCELLO:Santiago de Compostela                                                        |
|PROVINCIA:A Coru�a                                                                     |
|C�DIGOS DE VALIDACI�N DOS DATOS:                                                       |
|0:  Dato sen validar                                                                   |
|1:  Dato v�lido orixinal                                                               |
|2:  Dato sospeitoso                                                                    |
|3:  Dato err�neo                                                                       |
|4:  Dato 

In [24]:
# Read as RDD, skip 20 rows, then apply schema
rdd = spark.sparkContext.textFile("datasets/meteogalicia.txt")
rdd_skipped = rdd.zipWithIndex().filter(lambda x: x[1] >= 22).map(lambda x: x[0])

rdd_skipped.take(5)

[u'',
 u'      1          2017-06-01 00:10:00    Temperatura media (\ufffdC)                    13,82',
 u'      1          2017-06-01 00:20:00    Temperatura media (\ufffdC)                    13,71',
 u'      1          2017-06-01 00:30:00    Temperatura media (\ufffdC)                    13,61',
 u'      1          2017-06-01 00:40:00    Temperatura media (\ufffdC)                    13,52']

In [34]:
temperatures = rdd.filter(lambda line: 'Temperatura media' in line) \
    .map(lambda line: (line.split()[1], float(line.split()[6].replace(',','.'))   ))
temperatures.take(5)

[(u'2017-06-01', 13.82),
 (u'2017-06-01', 13.71),
 (u'2017-06-01', 13.61),
 (u'2017-06-01', 13.52),
 (u'2017-06-01', 13.33)]

In [38]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

schema = StructType([
    StructField("data", StringType(), True),
    StructField("valor", DoubleType(), True)
])

df_temp = spark.createDataFrame(temperatures,schema=schema)
df_temp.show(5)

+----------+-----+
|      data|valor|
+----------+-----+
|2017-06-01|13.82|
|2017-06-01|13.71|
|2017-06-01|13.61|
|2017-06-01|13.52|
|2017-06-01|13.33|
+----------+-----+
only showing top 5 rows



## Calculate the average temperature per day

In [40]:
df_temp.groupBy('data').avg('valor').show()

+----------+------------------+
|      data|        avg(valor)|
+----------+------------------+
|2017-06-22| 19.56493055555555|
|2017-06-07| 17.76305555555556|
|2017-06-24|           17.6775|
|2017-06-29|13.477083333333331|
|2017-06-19|25.422708333333333|
|2017-06-03|14.511736111111105|
|2017-06-23| 18.57861111111111|
|2017-06-28|15.242361111111105|
|2017-06-12|20.020138888888884|
|2017-06-30|             11.59|
|2017-06-26|18.298125000000002|
|2017-06-04|14.889375000000005|
|2017-06-18|26.350069444444443|
|2017-06-06|14.901041666666666|
|2017-06-09| 17.86694444444445|
|2017-06-21| 23.28430555555555|
|2017-06-25| 19.57138888888889|
|2017-06-14| -51.6271527777778|
|2017-06-16|22.042708333333337|
|2017-06-11|17.806250000000006|
+----------+------------------+
only showing top 20 rows



## Show the results sorted by date

In [42]:
df_temp.groupBy('data').avg('valor').orderBy('data').show()

+----------+------------------+
|      data|        avg(valor)|
+----------+------------------+
|2017-06-01|17.179580419580425|
|2017-06-02|16.007500000000004|
|2017-06-03|14.511736111111105|
|2017-06-04|14.889375000000005|
|2017-06-05| 13.67486111111111|
|2017-06-06|14.901041666666666|
|2017-06-07| 17.76305555555556|
|2017-06-08| 17.49979166666667|
|2017-06-09| 17.86694444444445|
|2017-06-10|19.207222222222224|
|2017-06-11|17.806250000000006|
|2017-06-12|20.020138888888884|
|2017-06-13|18.769027777777776|
|2017-06-14| -51.6271527777778|
|2017-06-15|18.135486111111103|
|2017-06-16|22.042708333333337|
|2017-06-17|25.475902777777772|
|2017-06-18|26.350069444444443|
|2017-06-19|25.422708333333333|
|2017-06-20|26.977916666666665|
+----------+------------------+
only showing top 20 rows

