# Generate the daily weather per state
We have all the weather data, but we would like to calculate the weather data per day and per state. We will use that later on to join the weather data with the sales data so we will know what kind of weather it was when a sales transaction was made. 

This requires us to modify the data we already have in our master store and generate *views* of or data. In this case we will generate the **daily_weather_per_state** view.

## Reading the data
But first let's get started by loading the master weather data. The data is stored in a CSV file, but natively Spark has no handlers to talk to CSV's directly. This means we need to read the data as a *textFile*

In [None]:
weather_data = sc.textFile("/data/master/weather")
weather_data.take(1)

Ok, so as we can see using the take(1) action, we now have a list of lines, but that is still not easy to work with. We will need to split the line using the field delimiter **\t**. 

We can use the **rdd.map(fn)** transformation to iterate over the elements in a collection and perform a function on them at the same time. The **fn** argument in that function is actually a lambda expression. Lambda expressions in python are written as:

**lambda** argument_1, ..., argument_n: *do_something*

with a map transformation, the result returned by *do_something* in a lambda expression is used to replace the element in the collection.

In [None]:
weather_records = weather_data.map(lambda line: line.split("\t"))
weather_records.take(1)

## Generating a key
The next step would be to generate a key for each record. In our case, we want to have a key including the data and the state, since we want to end up with one record per day per state.

The format of the key in our case looks like **yyyymmdd-state**, meaning for a record with the following data:

['2006', '1', '1', 'AB', 'PRCP', '4.57089552238806']

our key would be '20060201-AB'. Keep in mind that the month is 0 based, so you will need to increment it in order to get the real month index.
    
For this we will need to perform a block of code, something that is not directly possible within a lambda expression. But it is allowed to call a method in a lambda expression so we will write our code in a method and point the lambda expression to our method

In [None]:
def weather_key(record) :
    res = str(record[0]);
    res += min_two_digits(str(int(record[1]) +1));
    res += min_two_digits(str(record[2]));
    res += str('-');
    res += str(record[3]);
    return res;

def min_two_digits(data):
    return('00' + str(data))[-2:];

weather = weather_records.map(lambda record: (weather_key(record), record))
weather.take(1)

## Taking all weather facts together
We now have a collection of weather facts, but each fact is still an element of its own. We will need to take the weather facts together by their key so we can create records out of them.

An RDD has an aggregateByKey(initVal, sequenceFn, combineFn) function we can use for this. The function takes an initial value, a sequence function and a combine function. The sequence function is used to sequentially add data to the initial value. Since the aggergation will be executed in a different manner, there will be several nodes starting with the initial value and running through a part of the data. This will result in several different results, one for each partition of the data. The combine function is used to merge these partition results together into one value.

We want to create our record in the form of a dictionary, a key/value like data structure. An empty dictionary can be created using **{}** and data can be assigned to it using **myDict['field_name'] = fieldValue**. The value of a dictionary element can be retrieved using **myDict['field_name']**.

You can use the **merge_two_dicts(dict1, dict2)** to combine two dictionaries into a single one.

In [None]:
def merge_two_dicts(x, y):
    '''Given two dicts, merge them into a new dict as a shallow copy.'''
    z = x.copy()
    z.update(y)
    return z

In [None]:
def append_to_map(m, element):
    m['date'] = str(element[0]) + '-' + min_two_digits(int(element[1]) + 1) + '-' + min_two_digits(element[2]);
    m[element[4]] = float(element[5]);
    return m;
    
seqFunc = (lambda r, v: append_to_map(r, v))
combFunc = (lambda r1, r2: merge_two_dicts(r1, r2))
weather = weather.aggregateByKey({}, seqFunc, combFunc)
weather.take(1)

## Storing the result
Well done! We have the result we wanted and since we do not want to loose it, we will store it onto hdfs.

It is a good practice to store all your views under a same folder structure. In our case this would be **/data/views** and the name of this view would be **daily_weather_per_state**.

We can choose out of different formats in which to store the data, but for now we will go with a sequence file. An RDD has a **saveAsSequenceFile(path)** method we can use for this.

In [None]:
weather.saveAsSequenceFile('/data/views/daily_weather_per_state')