# GHCND Stations Data Prep
#### Description:
This notebook is used to prep the data contained in ghcnd-stations.txt. Use noaa-daily-retrieve-files-ftp.ipynb to download this file among other metadta files used in the NOAA Daily Weather Project. 

System of Origin Data Source: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

#### Data Prep Operations:
- define english column names
- create CountryCode column from first 2 characters of StationID
- merge with ghcnd-countries.txt to get CountryName and ghcnd-states.txt to get StateName
- save to .csv file (tab separated)

#### Created by:
Nate Muth <br>
nmuth87@gmail.com

#### Created on:
7/29/2018

#### Changelog:
7/29/2018 - Initial Create Date<br>
8/5/2018 - Added metadata tables to give aliases for countries and states

In [2]:
import pandas as pd

In [3]:
# read stations.txt into DataFrame
stationsDF = pd.read_fwf('ghcnd-stations.txt', header=None, delimiter=' '
                         , widths=[12,9,10,7,3,31,4,4,6]
                         , names=['StationID', 'Latitude', 'Longitude', 'Elevation',
                                 'State', 'StationName', 'GSN_Flag', 'HCN_CRN_Flag', 'WMO_ID']
                         , dtypes={'WMO_ID':object}
                         )

# set StationID as the index
stationsDF.set_index('StationID', inplace=True)

In [4]:
stationsDF.head()

Unnamed: 0_level_0,Latitude,Longitude,Elevation,State,StationName,GSN_Flag,HCN_CRN_Flag,WMO_ID
StationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ACW00011604,17.1167,-61.7833,10.1,,ST JOHNS COOLIDGE FLD,,,
ACW00011647,17.1333,-61.7833,19.2,,ST JOHNS,,,
AE000041196,25.333,55.517,34.0,,SHARJAH INTER. AIRP,GSN,,41196.0
AEM00041194,25.255,55.364,10.4,,DUBAI INTL,,,41194.0
AEM00041217,24.433,54.651,26.8,,ABU DHABI INTL,,,41217.0


In [5]:
stationsDF[stationsDF['StationName'].str.contains('EPPLEY')]

Unnamed: 0_level_0,Latitude,Longitude,Elevation,State,StationName,GSN_Flag,HCN_CRN_Flag,WMO_ID
StationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
USW00014942,41.3103,-95.8992,299.3,NE,OMAHA EPPLEY AIRFIELD,,,72550.0


In [6]:
def countryCode(row):
    return row['temp_StationID'][:2]

stationsDF['temp_StationID'] = stationsDF.index.values
stationsDF['CountryCode']= stationsDF.apply(countryCode, axis=1)
stationsDF.drop(columns=['temp_StationID'],inplace=True)

In [7]:
stationsDF[stationsDF['StationName'].str.contains('EPPLEY')]

Unnamed: 0_level_0,Latitude,Longitude,Elevation,State,StationName,GSN_Flag,HCN_CRN_Flag,WMO_ID,CountryCode
StationID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
USW00014942,41.3103,-95.8992,299.3,NE,OMAHA EPPLEY AIRFIELD,,,72550.0,US


### 8/5/2018 | Added metadata tables to give aliases for countries and states
Joining to ghcnd-countries.txt on the CountryCode column<br>
Joining to ghcnd-states.txt on the State column

In [10]:
countriesDF = pd.read_fwf('ghcnd-countries.txt',header=None,delimiter=' ',names=['CountryCode','CountryName'])
countriesDF.head()

Unnamed: 0,CountryCode,CountryName
0,AC,Antigua and Barbuda
1,AE,United Arab Emirates
2,AF,Afghanistan
3,AG,Algeria
4,AJ,Azerbaijan


In [13]:
stations_DF2 = pd.merge(stationsDF,countriesDF,on='CountryCode',how='left')
stations_DF2[stations_DF2['StationName'].str.contains('EPPLEY')]

Unnamed: 0,Latitude,Longitude,Elevation,State,StationName,GSN_Flag,HCN_CRN_Flag,WMO_ID,CountryCode,CountryName
104868,41.3103,-95.8992,299.3,NE,OMAHA EPPLEY AIRFIELD,,,72550.0,US,United States


In [16]:
statesDF = pd.read_fwf('ghcnd-states.txt',header=None,delimiter=' ',names=['State','StateName'])
statesDF.head()

Unnamed: 0,State,StateName
0,AB,ALBERTA
1,AK,ALASKA
2,AL,ALABAMA
3,AR,ARKANSAS
4,AS,AMERICAN SAMOA


In [17]:
stations_DF2 = pd.merge(stations_DF2,statesDF,on='State',how='left')
stations_DF2[stations_DF2['StationName'].str.contains('EPPLEY')]

Unnamed: 0,Latitude,Longitude,Elevation,State,StationName,GSN_Flag,HCN_CRN_Flag,WMO_ID,CountryCode,CountryName,StateName
104868,41.3103,-95.8992,299.3,NE,OMAHA EPPLEY AIRFIELD,,,72550.0,US,United States,NEBRASKA


In [18]:
# Write file to local directory
stationsDF.to_csv('ghcnd-stations-cleansed.csv',sep='\t')