# Weather Underground Parser Script

## Data Model

This script takes as input json files and extract .csv files to be inserted in ERDL database. The obtained JSON format is as follows: 

There are 2 keys on the root level: 

 * Response
 * History
 
Response contains the status of the response, but does not inform, for example, if there are no readings available for the given day. A JSON will still be returned. 

Within `History` there are 2 other keys:

 * Observations
 * Daily Summary
 
The readings are available within the `Observations` root key. The observation key is a **list** of keys. Each **element of the list** can be of one of the 3 formats: 
 
 * reading: A single valued dictionary of the form  `<sensor_type>:<value>`
 * date: A muti-valued dictionary, contains the date from a local timezone (not sure if based on request IP or the location of the station).
 * utcdate: Similar to date, except the timestamp is guaranteed to be UTC. To avoid confusion, this timezone will be used. 
  
The following image shows an example of one observation in the **list of observations**. The bottom shows the start of a second observation (which starts on the date key again).

<img src="img/wu_json_format.png",width=400,height=400>

## JSON to ERDL Database Parsing Script 

In [None]:
from os import listdir
from os.path import isfile, join
datapath = 'data'

# Obtain the list of all file names, filter out folder names if any
file_names = [f for f in listdir(datapath) if isfile(join(datapath, f))]

# Define the rules to format the JSON into a CSV file that conforms to ERDL database schema
import json
import pandas
#import csv
def extract_daily_readings(file_path):
    #resultFile = open("output.csv",'w')
    with open(file_path) as json_data:
        d = json.load(json_data)
        observations = d['history']['observations']
        if observations:
            for observation in observations:
                
                # I only assume a "date", "utcdate" and "softwaretype" key exist. If a new sensor is added then the script will include it accordingly. 
                
                #Since we will only use "utcdate" key to avoid various timezones being loaded.
                observation.pop('date',None) 
                # Extract Software Type value
                software_type = observation['softwaretype']
                observation.pop('softwaretype',None)
                # Extract Timestamp
                    #TODO
                observation.pop('utcdate',None)
                # The remaining keys should pertain each for a sensor type that was made available. 
                
                # Extract Sensor variables 
                #print (type(observation))
                day_df = pandas.DataFrame([observation])
                
                #RESULT = ['apple','cherry','orange','pineapple','strawberry']
                #RESULT = [observation]                
                #wr = csv.writer(resultFile, dialect='excel')
                #wr.writerow(RESULT)
                
                #for sensor_type, reading in observation.items():
                #    print (sensor_type,':',reading)
                #print('\n')
                print (day_df)
                

# Loop through each file, extracting a pandas dataframe that conforms to ERDL database schema for each day
formatted_readings = []
for file_name in file_names:
    path = join(datapath,file_name)
    formatted_readings.append(extract_daily_readings(path))


   UV dewpti dewptm heatindexi heatindexm hum precip_ratei precip_ratem  \
0  10   67.1   19.5       91.9       33.3  49         0.00          0.0   

  precip_totali precip_totalm  ...  tempi tempm wdird wdire wgusti wgustm  \
0          0.00           0.0  ...   88.7  31.5    61   ENE   24.8   39.9   

  windchilli windchillm wspdi wspdm  
0       -999       -999  22.4  36.0  

[1 rows x 23 columns]
   UV dewpti dewptm heatindexi heatindexm hum precip_ratei precip_ratem  \
0  10   67.1   19.5       91.9       33.3  49         0.00          0.0   

  precip_totali precip_totalm  ...  tempi tempm wdird wdire wgusti wgustm  \
0          0.00           0.0  ...   88.7  31.5    85  East   17.5   28.2   

  windchilli windchillm wspdi wspdm  
0       -999       -999  15.2  24.5  

[1 rows x 23 columns]
   UV dewpti dewptm heatindexi heatindexm hum precip_ratei precip_ratem  \
0  10   66.9   19.4       91.5       33.1  49         0.00          0.0   

  precip_totali precip_totalm  ...  tem