# Combine sales and weather data
Up until now, we have read the weather and retail data from the database and stored it on HDFS. After that we transformed the data we read so it made more sense and was easier to process in subsequent steps. We generated views to group the data by date and state so the logical next step would be to combine the two views we created into one large view.

## Loading the data
We will need to load the two views we created earlier.

### The weather view
During our previous processing, we stored the weather view in HDFS. Now we need to read it back again from **/data/views/daily_weather_per_state**

In [1]:
weather = sc.sequenceFile("/data/views/daily_weather_per_state")
weather.take(1)

[(u'20060223-AZ',
  {u'PRCP': 0.010526315789473684,
   u'TAVG': 21.19047619047619,
   u'TMAX': 131.49579831932772,
   u'TMIN': -37.9915611814346,
   u'date': u'2006-02-23'})]

### The sales view
The same goes for the sales view. That one is located at **/data/views/daily_sales_per_state**

In [2]:
sales = sc.sequenceFile("/data/views/daily_sales_per_state")
sales.take(1)

[(u'20060525-CT',
  {u'customer_age': 33,
   u'customer_gender': u'Male',
   u'customer_key': u'1',
   u'customer_marital_status': u'Divorced',
   u'customer_name': u'Kevin J. Dobisz',
   u'customer_state': u'CO',
   u'date': u'2006-05-25T00:00:00.000000Z',
   u'employee_gender': u'Female',
   u'employee_job_title': u'Cashier',
   u'employee_key': u'7679',
   u'employee_name': u'Samantha Reyes',
   u'employee_state': u'CA',
   u'price': 199.0,
   u'product_category': u'Food',
   u'product_department': u'Dairy',
   u'product_description': u'Brand #56310 butter',
   u'product_key': u'18730',
   u'product_price': 384.0,
   u'product_version': u'1',
   u'quantity': 10.0,
   u'store_key': u'108',
   u'store_name': u'Store108',
   u'store_state': u'CT',
   u'tender_type': u'Credit',
   u'transaction': u'3368440',
   u'transaction_type': u'purchase'})]

## Joining sales and weather into one
The time has come to join our two views together.

In [3]:
joined = sales.join(weather)
joined.take(2)

[(u'20060718-CO',
  ({u'customer_age': 40,
    u'customer_gender': u'Male',
    u'customer_key': u'793',
    u'customer_marital_status': u'Unknown',
    u'customer_name': u'Jose B. Fortin',
    u'customer_state': u'CA',
    u'date': u'2006-07-18T00:00:00.000000Z',
    u'employee_gender': u'Male',
    u'employee_job_title': u'Head of Marketing',
    u'employee_key': u'309',
    u'employee_name': u'Alexander Sanchez',
    u'employee_state': u'TX',
    u'price': 419.0,
    u'product_category': u'Food',
    u'product_department': u'Meat',
    u'product_description': u'Brand #39192 pork',
    u'product_key': u'13069',
    u'product_price': 116.0,
    u'product_version': u'1',
    u'quantity': 8.0,
    u'store_key': u'96',
    u'store_name': u'Store96',
    u'store_state': u'CO',
    u'tender_type': u'Debit',
    u'transaction': u'3534503',
    u'transaction_type': u'purchase'},
   {u'PRCP': 1.0159045725646123,
    u'TAVG': 152.45569620253164,
    u'TMAX': 266.3030303030303,
    u'TMIN': 73.

## Storing the joined result
We are going to save the results as a CSV file. This means we will have to restructure the records into lines which we can then store into a text file.

In [4]:
def to_line(v):
    rec = [];
    rec.append(v[0])
    
    sales_data = v[1][0];
    for key in sorted(sales_data):
        rec.append(str(sales_data[key]));
    
    weather_data = v[1][1];
    for key in sorted(weather_data):
        if key != 'date':
            rec.append(str(weather_data[key]));
         
    return ','.join(rec);
    
formatted = joined.map(lambda v: to_line(v))
formatted.take(1)

[u'20061129-OR,69,Male,7229,Engaged,Ben E. Lang,FL,2006-11-29T00:00:00.000000Z,Male,Shelf Stocker,5056,Daniel Farmer,TN,206.0,Non-food,Cleaning supplies,Brand #56327 rubber gloves,18736,396.0,1,9.0,39,Store39,OR,Credit,3894424,purchase,6.24285714286,42.3631840796,130.357142857,-15.7335164835']

In [5]:
formatted.saveAsTextFile('/data/views/daily_weather_sales_per_state')