In [1]:
from __future__ import print_function
import pandas as pd
import sqlalchemy
import config

In [2]:
# Extract username and password from YAML config
credentials_file = '.database_credentials.yml'
cfg = config.YAMLParser(credentials_file).config

We load the weather data below. The fields for this dataset are enumerated [here](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ghcn-daily-by_year-format.rtf).

In [3]:
weather_data = pd.read_csv('2015.csv', 
                           header=None,
                           index_col=False,
                           names=['station_identifier', 
                                  'measurement_date', 
                                  'measurement_type', 
                                  'measurement_flag', 
                                  'quality_flag', 
                                  'source_flag', 
                                  'observation_time'],
                           parse_dates=['measurement_date'])

In [4]:
weather_data.head()

Unnamed: 0,station_identifier,measurement_date,measurement_type,measurement_flag,quality_flag,source_flag,observation_time
0,US1FLSL0019,2015-01-01,PRCP,173,,,N
1,US1TXTV0133,2015-01-01,PRCP,119,,,N
2,USC00178998,2015-01-01,TMAX,-33,,,7
3,USC00178998,2015-01-01,TMIN,-167,,,7
4,USC00178998,2015-01-01,TOBS,-67,,,7


There are a [large number](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt) of categories of weather data in the dataset. For simplicity, we only want to load the data from the 5 "core" weather categories into the database for further analysis:

* PRCP : Precipitation (tenths of mm)
* SNOW : Snowfall (mm)
* SNWD : Snow depth (mm)
* TMAX : Maximum temperature (tenths of degrees C)
* TMIN : Minimum temperature (tenths of degrees C)

We also want to cull a few columns from the DataFrame.

In [5]:
weather_data_subset = weather_data[weather_data.measurement_type.isin(['PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN'])][['station_identifier', 'measurement_date', 'measurement_type', 'measurement_flag']]

In [6]:
weather_data_subset.head()

Unnamed: 0,station_identifier,measurement_date,measurement_type,measurement_flag
0,US1FLSL0019,2015-01-01,PRCP,173
1,US1TXTV0133,2015-01-01,PRCP,119
2,USC00178998,2015-01-01,TMAX,-33
3,USC00178998,2015-01-01,TMIN,-167
5,USC00178998,2015-01-01,PRCP,0


This cuts down the total count of records by about 30%.

Now, let's write the weather data to our DB. If you setup the Postgres DB as noted in the README, this should instantiate a connection to the database with your local unix username.

If you've configured another Postgres user with a username / password, please fill in the appropriate credentials using the [SQL Alchemy connection string](http://docs.sqlalchemy.org/en/latest/core/engines.html).

In [9]:
db_name = 'weather'
connection_string = "postgresql://localhost:5432/%s" % (db_name)
conn = sqlalchemy.create_engine(connection_string)

table_name = 'weather_data'
# The to_sql method defaults to bigint for integer types here, which are larger than we need. 
# This manually sets the datatypes of the columns we need to override
column_type_dict = {'measurement_flag': sqlalchemy.types.Integer}
# Writing all the data to the DB at once will cause this notebook to crash.
# We pass a large integer to the chunksize parameter to chunk the writing of records
weather_data_subset.to_sql(table_name, conn, chunksize=100000, index_label='id', dtype=column_type_dict)

Now, let's process and read in the metadata - which contains the (lat, long) - tied to each weather station.

In [10]:
station_metadata = pd.read_csv('ghcnd-stations.txt', 
                           sep='\s+',  # Fields are separated by one or more spaces
                           usecols=[0, 1, 2, 3],  # Grab only the first 4 columns
                           na_values=[-999.9],  # Missing elevation is noted as -999.9
                           header=None,
                           names=['station_id', 'latitude', 'longitude', 'elevation'])

In [11]:
station_metadata.head()

Unnamed: 0,station_id,latitude,longitude,elevation
0,ACW00011604,17.1167,-61.7833,10.1
1,ACW00011647,17.1333,-61.7833,19.2
2,AE000041196,25.333,55.517,34.0
3,AEM00041194,25.255,55.364,10.4
4,AEM00041217,24.433,54.651,26.8


How many stations do we have with missing elevation?

In [12]:
len(station_metadata[station_metadata['elevation'].isnull()])

4623

Finally, write the metadata to the DB.

In [13]:
table_name = 'station_metadata'
station_metadata.to_sql(table_name, conn, index_label='id')