In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","64G") \
        .getOrCreate()

spark

# Get Data from S3

First we load the data source containing raw weather measurements from S3. Since the data doesn't follow any well-known format (like CSV or JSON), we load it as raw text data and extract all required information. 

But first let's load a single year, just to get an impression of the data

In [None]:
storageLocation = "s3://dimajix-training/data/weather"

Read in the year 2003 as `text` using the `spark.read.text` method. The data can be found at `storageLocation + "/2003"` and should be stored in a variable called `weather_raw_2003`. Also using `limit` and `toPandas` retrieve the first 10 rows and display them as a Pandas DataFrame.

In [None]:
# Read in weather data from 2003
raw_weather_2003 = ...

# Display first 10 records
# YOUR CODE HERE

## Read in all years

Now we read in all years by creating a union. We also add the year as a logical partition column, this will be used later.

In [None]:
from functools import reduce

# Read in all years, store them in an Python array
raw_weather_per_year = [spark.read.text(storageLocation + "/" + str(i)).withColumn("year", lit(i)) for i in range(2003,2020)]

# Union all years together
raw_weather = reduce(lambda l,r: l.union(r), raw_weather_per_year)                        

# Display first 10 records
raw_weather.limit(10).toPandas()

## Extract Information

The raw data is not exactly nice to work with, so we need to extract the relevant information by using appropriate substr operations.

In [None]:
weather = raw_weather.select(
    col("year"),
    substring(col("value"),5,6).alias("usaf"),
    substring(col("value"),11,5).alias("wban"),
    substring(col("value"),16,8).alias("date"),
    substring(col("value"),24,4).alias("time"),
    substring(col("value"),42,5).alias("report_type"),
    substring(col("value"),61,3).alias("wind_direction"),
    substring(col("value"),64,1).alias("wind_direction_qual"),
    substring(col("value"),65,1).alias("wind_observation"),
    (substring(col("value"),66,4).cast("float") / lit(10.0)).alias("wind_speed"),
    substring(col("value"),70,1).alias("wind_speed_qual"),
    (substring(col("value"),88,5).cast("float") / lit(10.0)).alias("air_temperature"),
    substring(col("value"),93,1).alias("air_temperature_qual")
)
    
weather.limit(10).toPandas()

## Read in Station Metadata

Fortunately station metadata is stored as CSV, so we can directly read that using Sparks `spark.read.csv` mechanisum. The data can be found at `storageLocation + '/isd-history'`.

You should also specify the `DataFrameReader` option `header` to be `True`, this will use the first line of the CSV for creating column names.

Store the result in a variable called `stations` and again print the first 10 lines using the `toPandas()` method.

In [None]:
# Read in stations metadata
stations = ...

# Display first 10 records    
stations.limit(10).toPandas()

# Process Data

Now we want to perform a simple analysis on the data: Calculate minimum and maximum wind speed and air temperature per country and year. This needs to be performed in three steps:

1. Join weather data and stations on the columns 'usaf' and 'wban'
2. Group the data by the relevant columns year and country
3. Perform min/max aggregations. Also pay attentions to the fields `air_temperature_qual` and `wind_speed_qual`, where "1" means valid value

**Since processing the full date range may take a considerable amount of time, you might first want to start with a single year. This can be done by temporarily replacing `raw_weather` with `raw_wather_2003`**

In [None]:
joined_weather = ... # Join weather and stations on usaf and wban

# Print Results
pdf = joined_weather.limit(10).toPandas()    
pdf

In [None]:
# Group df by year and country
# Aggregate min/max temperature. Note that you can also use conditional expressions via when(a == v, x).otherwise(y)
result = ...

# Print Results
pdf = result.toPandas()    
pdf