__W205, Fall 2016__   
__Final Project:__ Solar Fields and Weather   
__Group:__ Boris Kletser, Maya Miller-Vedam, Geoff Striling, Laura Williams   

# Initial Data Exploration

__OVERVIEW:__ This file contains example API calls to www.eia.gov and www.noaa.org and a brief exploration of the data available, its schema and contents.

In [1]:
# imports
import os
import requests
import numpy as np
import pandas as pd
from ftplib import FTP

In [None]:
# global vars -- you will need the following keys in order to run the code below
EIA_API_KEY = ''
NOAA_CDO_TOKEN = ''

### I. Laura's Demo Code

Getting EIA data via API Call:  


__NOTE:__  
* SEGS I -  no power generated in 2016  
* SEGS II - no power generated since 2014  
* SEGS III link: http://www.eia.gov/opendata/qb.cfm?category=4246&sdid=ELEC.PLANT.GEN.10439-SUN-ALL.M  

In [2]:
# Net Generation for specific plants
url = 'http://api.eia.gov/series/?api_key=' + EIA_API_KEY + 
          '&series_id=ELEC.PLANT.GEN.10439-SUN-ALL.M'
segs3 = requests.get(url)

In [3]:
# Can check if data downloaded properly by checking the status_code
segs3.status_code

200

In [6]:
# Can turn the data into a json style dictionary
segs3_dict = segs3.json()

In [10]:
# create Dataframe
segs3_df = pd.io.json.json_normalize(segs3_dict['series'])
segs3_df

Unnamed: 0,copyright,data,description,end,f,geography,iso3166,lat,latlon,lon,name,series_id,source,start,units,updated
0,,"[[201608, 6272], [201607, 6351], [201606, 6609...",All solar powered electricity generation (incl...,201608,M,USA-CA,USA-CA,35.00694,"35.00694,-117.555768",-117.555768,Net generation : SEGS III (10439) : solar : al...,ELEC.PLANT.GEN.10439-SUN-ALL.M,"EIA, U.S. Energy Information Administration",200101,megawatthours,2016-10-25T13:28:18-0400


In [12]:
# A closer look at the data: 188 tuples
data_df = pd.DataFrame(segs3_df['data'][0], columns = ["date", 'megawatts'])
print data_df.describe()
data_df.tail()

         megawatts
count   188.000000
mean   4842.345745
std    2910.088007
min       5.000000
25%    2024.750000
50%    5405.000000
75%    7092.750000
max    9759.000000


Unnamed: 0,date,megawatts
183,200105,8523
184,200104,6181
185,200103,4888
186,200102,1805
187,200101,774


### II. Weather Data From NOAA's Online Data Center

First, just to get my feet wet w/ the API, here are a few queries.

In [14]:
# Defining a Token Authorization Class (needed for noaa API calls)
from requests.auth import AuthBase

class TokenAuth(AuthBase):
    """Attaches Token Authentication to the Request Object"""
    def __init__(self, token):
        self.token = token
        
    def __call__(self, r):
        # modify and return the request
        r.headers['Token'] = self.token
        return r

In [15]:
# Fetch all available datasets
noaa = requests.get('http://www.ncdc.noaa.gov/cdo-web/api/v2/datasets', auth=TokenAuth(NOAA_CDO_TOKEN))
print noaa.status_code, 

200


In [16]:
# convert them into a dataframe and take a look:
noaa_datasets_df = pd.io.json.json_normalize(noaa.json()['results'])
noaa_datasets_df.head()

Unnamed: 0,datacoverage,id,maxdate,mindate,name,uid
0,1.0,GHCND,2016-11-07,1763-01-01,Daily Summaries,gov.noaa.ncdc:C00861
1,1.0,GSOM,2016-10-01,1763-01-01,Global Summary of the Month,gov.noaa.ncdc:C00946
2,1.0,GSOY,2016-01-01,1763-01-01,Global Summary of the Year,gov.noaa.ncdc:C00947
3,0.95,NEXRAD2,2016-11-07,1991-06-05,Weather Radar (Level II),gov.noaa.ncdc:C00345
4,0.95,NEXRAD3,2016-11-05,1994-05-20,Weather Radar (Level III),gov.noaa.ncdc:C00708


In [17]:
# note that the requests API call above was really slow... 
# curls provides a command line alternative that could be piped to a file:
# for example:
! curl -H "token:bwptzltBRUPKGcIptOARfSHnMBmShaLh" "http://www.ncdc.noaa.gov/cdo-web/api/v2/datasets" > test.txt
# WARNING: running this cell will create a file names test.txt 
# that contains the query results in the current direcory

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1625    0  1625    0     0  13233      0 --:--:-- --:--:-- --:--:-- 13319



_For exploratory purposes I used_ [this noaa tool](http://www.ncdc.noaa.gov/cdo-web/datatools/findstation) _to manually locate a weather station close to the energy plant above._

__Weather Station:__
Bakersfield Airport, CA  
Lat/long: 35.4344, -119.0542  
Station ID: GHCND:USW00023155  

In [18]:
# Fetch data from the Bakersfield station
bakersfield = requests.get('http://www.ncdc.noaa.gov/cdo-web/api/v2/stations/GHCND:USW00023155', auth=TokenAuth(NOAA_CDO_TOKEN))
print bakersfield.status_code

200


In [19]:
# convert it into a dataframe and take a look:
bakersfield_df = pd.io.json.json_normalize(noaa.json()['results'])
bakersfield_df.head()

Unnamed: 0,datacoverage,id,maxdate,mindate,name,uid
0,1.0,GHCND,2016-11-07,1763-01-01,Daily Summaries,gov.noaa.ncdc:C00861
1,1.0,GSOM,2016-10-01,1763-01-01,Global Summary of the Month,gov.noaa.ncdc:C00946
2,1.0,GSOY,2016-01-01,1763-01-01,Global Summary of the Year,gov.noaa.ncdc:C00947
3,0.95,NEXRAD2,2016-11-07,1991-06-05,Weather Radar (Level II),gov.noaa.ncdc:C00345
4,0.95,NEXRAD3,2016-11-05,1994-05-20,Weather Radar (Level III),gov.noaa.ncdc:C00708


In [20]:
# fetching daily summaries for this station, May 1 2010
url = 'http://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&station=GHCND:USW00023155&startdate=2010-05-01&enddate=2010-05-01'
daily = requests.get(url, auth=TokenAuth(NOAA_CDO_TOKEN))
print daily.status_code

200


In [22]:
# convert it into a dataframe and take a look:
daily_df = pd.io.json.json_normalize(daily.json()['results'])
daily_df

Unnamed: 0,attributes,datatype,date,station,value
0,",,S,",PRCP,2010-05-01T00:00:00,GHCND:AE000041196,0
1,"H,,S,",TAVG,2010-05-01T00:00:00,GHCND:AE000041196,324
2,",,S,",TMAX,2010-05-01T00:00:00,GHCND:AE000041196,397
3,",,S,",TMIN,2010-05-01T00:00:00,GHCND:AE000041196,227
4,",,S,",PRCP,2010-05-01T00:00:00,GHCND:AEM00041194,0
5,"H,,S,",TAVG,2010-05-01T00:00:00,GHCND:AEM00041194,341
6,",,S,",TMAX,2010-05-01T00:00:00,GHCND:AEM00041194,387
7,",,S,",TMIN,2010-05-01T00:00:00,GHCND:AEM00041194,293
8,"H,,S,",TAVG,2010-05-01T00:00:00,GHCND:AEM00041217,327
9,",,S,",TMAX,2010-05-01T00:00:00,GHCND:AEM00041217,383


## III. Solar Radiation Data.

Solar Radation data for dates before 2010 are housed in the National Solar Radiation Database([docs here](ftp://ftp.ncdc.noaa.gov/pub/data/nsrdb-solar/documentation-2010/NSRDB_UserManual_r20120906.pdf)). More current information (2004 onwards) is available in the monthly, daily and hourly "Quality Controlled Datasets" available from NOAA. The code below explores:
*  __Monthly USCRN Data__: [online file system interface](http://www1.ncdc.noaa.gov/pub/data/uscrn/products/monthly01/) & [documentation](http://www1.ncdc.noaa.gov/pub/data/uscrn/products/monthly01/README.txt)

__A.__ Data from a single station (_monthly summaries for 2004-2016 from USCRN station in Merced, CA_)

In [61]:
# getting headers
url = 'http://www1.ncdc.noaa.gov/pub/data/uscrn/products/monthly01/HEADERS.txt'
headers = requests.get(url)
cnames = headers.text.split('\n')[1].split()
print cnames

[u'WBANNO', u'LST_YRMO', u'CRX_VN_MONTHLY', u'PRECISE_LONGITUDE', u'PRECISE_LATITUDE', u'T_MONTHLY_MAX', u'T_MONTHLY_MIN', u'T_MONTHLY_MEAN', u'T_MONTHLY_AVG', u'P_MONTHLY_CALC', u'SOLRAD_MONTHLY_AVG', u'SUR_TEMP_MONTHLY_TYPE', u'SUR_TEMP_MONTHLY_MAX', u'SUR_TEMP_MONTHLY_MIN', u'SUR_TEMP_MONTHLY_AVG']


In [77]:
# pulling monthly data for Merced, CA
url = 'http://www1.ncdc.noaa.gov/pub/data/uscrn/products/monthly01/CRNM0102-CA_Merced_23_WSW.txt'

In [78]:
# OPTION #1 - use pandas
merced_df = pd.read_csv(url, sep = '\s+', header=None, names=cnames)

In [73]:
# OPTION #2 - use requests and parse manually
def parse(string):
    """Helper function to parse text from request"""
    return [line.split() for line in string.split('\n')]

# pulling data
merced = requests.get(url)
merced_df = pd.DataFrame(parse(merced.text), index=None,columns = cnames)

In [79]:
# take a look
merced_df.head()

Unnamed: 0,WBANNO,LST_YRMO,CRX_VN_MONTHLY,PRECISE_LONGITUDE,PRECISE_LATITUDE,T_MONTHLY_MAX,T_MONTHLY_MIN,T_MONTHLY_MEAN,T_MONTHLY_AVG,P_MONTHLY_CALC,SOLRAD_MONTHLY_AVG,SUR_TEMP_MONTHLY_TYPE,SUR_TEMP_MONTHLY_MAX,SUR_TEMP_MONTHLY_MIN,SUR_TEMP_MONTHLY_AVG
0,93243,200403,1.201,-120.8825,37.2381,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,U,-9999.0,-9999.0,-9999.0
1,93243,200404,1.201,-120.8825,37.2381,25.1,4.9,15.0,15.4,1.2,23.9,R,-9999.0,-9999.0,19.8
2,93243,200405,1.201,-120.8825,37.2381,28.5,6.6,17.6,18.3,3.7,27.9,R,-9999.0,-9999.0,24.6
3,93243,200406,1.201,-120.8825,37.2381,32.6,9.1,20.8,21.4,0.0,29.5,R,-9999.0,-9999.0,24.8
4,93243,200407,1.201,-120.8825,37.2381,34.7,11.8,23.2,23.6,0.0,28.6,R,-9999.0,-9999.0,28.4


__B.__ List of Available Stations

In [85]:
# USCERN stations indexed by their WBAN ID numbers
url = 'http://www1.ncdc.noaa.gov/pub/data/uscrn/products/stations.tsv'
stations_df = pd.read_csv(url, sep = '\t', header=0, index_col = 'WBAN')
stations_df.head()

Unnamed: 0_level_0,COUNTRY,STATE,LOCATION,VECTOR,NAME,LATITUDE,LONGITUDE,ELEVATION,STATUS,COMMISSIONING,CLOSING,OPERATION,PAIRING,NETWORK
WBAN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3047,US,TX,Monahans,6 ENE,Sandhills State Park,31.62,-102.8,2724,Commissioned,2004-01-11 19:00:00.0,,Operational,,USCRN
3048,US,NM,Socorro,20 N,Sevilleta National Wildlife Refuge (LTER Site),34.35,-106.88,4847,Commissioned,2004-01-11 19:00:00.0,,Operational,,USCRN
3054,US,TX,Muleshoe,19 S,Muleshoe National Wildlife Refuge (Headquarter...,33.95,-102.77,3742,Commissioned,2004-04-22 20:00:00.0,,Operational,,USCRN
3055,US,OK,Goodwell,2 E,OK Panhandle Research & Extn. Center (Native ...,36.59,-101.59,3266,Commissioned,2004-04-22 20:00:00.0,,Operational,,USCRN
3060,US,CO,Montrose,11 ENE,Black Canyon of the Gunnison National Park (Ve...,38.54,-107.69,8402,Commissioned,2004-09-07 20:00:00.0,,Operational,,USCRN


In [86]:
# Looking at California Stations
stations_df[stations_df.STATE == "CA"]

Unnamed: 0_level_0,COUNTRY,STATE,LOCATION,VECTOR,NAME,LATITUDE,LONGITUDE,ELEVATION,STATUS,COMMISSIONING,CLOSING,OPERATION,PAIRING,NETWORK
WBAN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
4222,US,CA,Redding,12 WNW,Whiskeytown National Recreation Area (RAWS Site),40.65,-122.6,1418,Commissioned,2004-01-25 19:00:00.0,,Operational,,USCRN
53139,US,CA,Stovepipe Wells,1 SW,Death Valley National Park (Stovepipe Wells Site),36.6,-117.14,84,Commissioned,2004-06-07 20:00:00.0,,Operational,,USCRN
53150,US,CA,Yosemite Village,12 W,"Yosemite National Park, (Crane Flat Lookout)",37.75,-119.82,6620,Commissioned,2007-12-18 19:00:00.0,,Operational,,USCRN
53151,US,CA,Fallbrook,5 NE,San Diego State Univ's Santa Margarita Ecologi...,33.43,-117.19,1140,Commissioned,2008-06-08 20:00:00.0,,Operational,,USCRN
53152,US,CA,Santa Barbara,11 W,Univ. of California - Santa Barbara (Coal Oil ...,34.41,-119.87,18,Commissioned,2008-09-20 20:00:00.0,,Operational,,USCRN
93243,US,CA,Merced,23 WSW,Kesterson Reservoir (US Bureau of Reclamation),37.23,-120.88,78,Commissioned,2004-06-07 20:00:00.0,,Operational,,USCRN
93245,US,CA,Bodega,6 WSW,University of California - Davis (Bodega Marin...,38.32,-123.07,63,Commissioned,2008-07-29 20:00:00.0,,Operational,,USCRN


__C.__ Using Station Information to programatically identify url for each station's monthly summary data in the ncdc.noaa.gov file system.

In [101]:
# helper function
def get_noaa_url(wban, stations_df):
    """ Function to take a wban number and output a url."""
    base = 'http://www1.ncdc.noaa.gov/pub/data/uscrn/products/monthly01/CRNM0102-'
    station = '_'.join(stations_df.loc[str(wban),['STATE', 'LOCATION', 'VECTOR']])
    return base + station.replace(' ','_') + '.txt'

In [104]:
# pulling data for station in Bodega
url = get_noaa_url(93245, stations_df)
bodega_df = pd.read_csv(url, sep = '\s+', header=None, names=cnames)
bodega_df.head()

Unnamed: 0,WBANNO,LST_YRMO,CRX_VN_MONTHLY,PRECISE_LONGITUDE,PRECISE_LATITUDE,T_MONTHLY_MAX,T_MONTHLY_MIN,T_MONTHLY_MEAN,T_MONTHLY_AVG,P_MONTHLY_CALC,SOLRAD_MONTHLY_AVG,SUR_TEMP_MONTHLY_TYPE,SUR_TEMP_MONTHLY_MAX,SUR_TEMP_MONTHLY_MIN,SUR_TEMP_MONTHLY_AVG
0,93245,200806,1.302,-123.0747,38.3208,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,U,-9999.0,-9999.0,-9999.0
1,93245,200807,1.302,-123.0747,38.3208,16.5,10.3,13.4,12.7,0.2,22.0,R,35.4,10.2,18.0
2,93245,200808,1.302,-123.0747,38.3208,17.0,10.9,14.0,13.2,1.5,20.3,R,35.9,10.3,18.3
3,93245,200809,1.302,-123.0747,38.3208,16.7,10.5,13.6,13.2,1.5,15.1,R,31.7,9.0,16.5
4,93245,200810,1.302,-123.0747,38.3208,16.2,9.5,12.8,12.5,24.0,13.5,R,26.9,7.2,14.0
