# GHCN data
Reading in and working with GHCN (GLobal Historical Climate Network) data

see https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn

reading in stations

In [1]:
from ftplib import FTP
from io import StringIO
from contextlib import closing
import pandas as pd

In [2]:
ghcn_specs = [
    {
        "name": "stations",
        "file_name": "ghcnd-stations.txt",
        "desc": "stations",
        "field_specs": [
            {
                "field_name": "ID",
                "pos_from": 1,
                "pos_to": 11
            },
            {
                "field_name": "LATITUDE",
                "pos_from": 13,
                "pos_to": 20
            },
            {
                "field_name": "LONGITUDE",
                "pos_from": 22,
                "pos_to": 30
            },
            {
                "field_name": "ELEVATION",
                "pos_from": 32,
                "pos_to": 37
            },
            {
                "field_name": "STATE",
                "pos_from": 39,
                "pos_to": 40
            },
            {
                "field_name": "NAME",
                "pos_from": 42,
                "pos_to": 71
            },
            {
                "field_name": "GSN FLAG",
                "pos_from": 73,
                "pos_to": 75
            },
            {
                "field_name": "HCN/CRN FLAG",
                "pos_from": 77,
                "pos_to": 79
            },
            {
                "field_name": "WMO ID",
                "pos_from": 81,
                "pos_to": 85
            },
        ]
    },
    {
        "name": "readme",
        "file_name": "readme.txt",
        "desc": "readme"
    }
]

In [4]:
def get_ghcn_specs(name):
    return [g for g in ghcn_specs if g["name"] == name][0]

In [5]:
def get_ghcn_file(file_name:str = "ghcnd-stations.txt"):
    file = ""
    with closing(FTP("ftp.ncdc.noaa.gov")) as ftp:
        ftp.login()
        with closing(StringIO()) as r:
            # need to add newlines, hence custom call back
            ftp.retrlines(
                "RETR /pub/data/ghcn/daily/{file_name}".format(file_name = file_name),
                callback = lambda line: r.write(line + "\n"))
            file = r.getvalue()
    return file

In [6]:
stations = get_ghcn_file("ghcnd-stations.txt")

In [7]:
stations_field_pos = [(field["pos_from"] - 1, field["pos_to"]) for field in get_ghcn_specs("stations")["field_specs"]]

In [8]:
stations_field_names = [field["field_name"] for field in get_ghcn_specs("stations")["field_specs"]]

In [9]:
stations_df = pd.read_fwf(
    StringIO(stations),
    colspecs = stations_field_pos,
    header = None)

stations_df.columns = stations_field_names

In [10]:
stations_df.head()

Unnamed: 0,ID,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME,GSN FLAG,HCN/CRN FLAG,WMO ID
0,ACW00011604,17.1167,-61.7833,10.1,,ST JOHNS COOLIDGE FLD,,,
1,ACW00011647,17.1333,-61.7833,19.2,,ST JOHNS,,,
2,AE000041196,25.333,55.517,34.0,,SHARJAH INTER. AIRP,GSN,,41196.0
3,AEM00041194,25.255,55.364,10.4,,DUBAI INTL,,,41194.0
4,AEM00041217,24.433,54.651,26.8,,ABU DHABI INTL,,,41217.0


In [11]:
len(stations_df)

115081

In [12]:
stations_df.STATE.value_counts()

TX    5037
CO    4176
CA    2798
NC    2159
NE    2090
      ... 
PW      11
UM      11
MP      11
SA       1
PI       1
Name: STATE, Length: 76, dtype: int64

In [13]:
stations_df[stations_df.STATE == "NY"]

Unnamed: 0,ID,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME,GSN FLAG,HCN/CRN FLAG,WMO ID
75552,US1NYAB0001,42.6670,-74.0509,445.0,NY,ALTAMONT 2.7 SSW,,,
75553,US1NYAB0006,42.7198,-73.9304,88.7,NY,SCHENECTADY 5.6 SSW,,,
75554,US1NYAB0010,42.5455,-74.1475,488.3,NY,RENSSELAERVILLE 2.1 NNW,,,
75555,US1NYAB0016,42.7590,-73.7370,104.2,NY,LATHAM 1.1 NNE,,,
75556,US1NYAB0017,42.6678,-73.7856,65.5,NY,ALBANY 0.7 E,,,
...,...,...,...,...,...,...,...,...,...
114228,USW00094740,44.3853,-74.2067,506.9,NY,SARANAC RGNL AP,,,
114230,USW00094745,41.0669,-73.7075,115.5,NY,WESTCHESTER CO AP,,,
114233,USW00094789,40.6386,-73.7622,3.4,NY,NEW YORK JFK INTL AP,,,74486.0
114234,USW00094790,43.9922,-76.0217,96.9,NY,WATERTOWN INTL AP,,,


In [22]:
stations_df[(stations_df.STATE == "NY") & (pd.notnull(stations_df["GSN FLAG"]))]

Unnamed: 0,ID,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME,GSN FLAG,HCN/CRN FLAG,WMO ID
113151,USW00014771,43.1111,-76.1039,125.9,NY,SYRACUSE HANCOCK INTL AP,GSN,HCN,72519.0


In [14]:
readme = get_ghcn_file("readme.txt")

In [16]:
print(readme)

README FILE FOR DAILY GLOBAL HISTORICAL CLIMATOLOGY NETWORK (GHCN-DAILY) 
Version 3.26

--------------------------------------------------------------------------------
How to cite:

Note that the GHCN-Daily dataset itself now has a DOI (Digital Object Identifier)
so it may be relevant to cite both the methods/overview journal article as well 
as the specific version of the dataset used.

The journal article describing GHCN-Daily is:
Menne, M.J., I. Durre, R.S. Vose, B.E. Gleason, and T.G. Houston, 2012:  An overview 
of the Global Historical Climatology Network-Daily Database.  Journal of Atmospheric 
and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-00103.1.

To acknowledge the specific version of the dataset used, please cite:
Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, 
R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - 
Daily (GHCN-Daily), Version 3. [indicate subset used following decimal

In [23]:
test_dly = get_ghcn_file("all/USW00014771.dly")

In [24]:
print(test_dly)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

