## Step 10: Gather Source Datasets

Purpose: Download and stage required datasets into the project **source** geodatabase.  

Current list of resources:

- [CRWU_CREAT_Grid_Projections](https://services.arcgis.com/cJ9YHowT8TU7DUyn/ArcGIS/rest/services/CRWU_CREAT_Grid_Projections/FeatureServer/0) from EPA Geoplatform
- [CRWU_CREAT_Historic_Climate_Stations](https://services.arcgis.com/cJ9YHowT8TU7DUyn/ArcGIS/rest/services/CRWU_CREAT_Historic_Climate_Stations/FeatureServer/0) from EPA Geoplatform
- [COOP_STATIONS_TO_USE](https://github.com/barrc/get_ncei/blob/main/src/coop_stations_to_use.csv) from barrc GitHub
- [ISD_STATIONS_TO_USE](https://github.com/barrc/get_ncei/blob/main/src/isd_stations_to_use.csv) from barrc Github
- [Census States](https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/State_County/MapServer/0) from US Census Tigerweb


In [15]:
print("Executing Step 10: Gather Source Datasets");

import arcpy,os,http.client,json,requests;

# Verify or Create Source filegeodatabase
fgdb = os.getcwd() + os.sep + 'source.gdb';

if not arcpy.Exists(fgdb):
    print("  creating new source workspace");
    arcpy.CreateFileGDB_management(
         os.path.dirname(fgdb)
        ,os.path.basename(fgdb)
    );
else:
    print("  using existing source workspace");


Executing Step 10: Gather Source Datasets
  using existing source workspace


### 10.010: Scrape AGS 

Reusable function to scrape an ArcGIS Online resource into a local file geodatabase.
Note some online resources (namely Census) have additional download limits beyond the stated maxRecordCount.
Using a smaller forcelimit value will usually work around this.


In [16]:
def scrape_ags(host,path,fgdb,fc,forcelimit=None):
    
    if arcpy.Exists(fgdb + os.sep + fc):
        arcpy.Delete_management(fgdb + os.sep + fc);
        
    headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"};
    conn = http.client.HTTPSConnection(host);
    conn.request("POST",path,"f=json",headers);
    response = conn.getresponse();
    data = response.read();
    json_data = json.loads(data);
    if not 'currentVersion' in json_data:
        raise ValueError("Error, unable to query https://" + host + path);
    extraction_amount = json_data['maxRecordCount'];
    if forcelimit is not None and forcelimit < extraction_amount:
        extraction_amount = forcelimit;
    where = "1=1";
    params = "where={}&returnIdsOnly=true&returnGeometry=false&f=json".format(where);
    conn = http.client.HTTPSConnection(host);
    conn.request("POST",path + "/query",params,headers);
    response = conn.getresponse();
    data = response.read();
    json_data = json.loads(data);
    ary_oid   = sorted(json_data['objectIds']);
    oid_name  = json_data['objectIdFieldName'];
    oid_count = len(ary_oid);
    
    initial_hit = True;
    counter = 0;
    while counter <= oid_count - 1:
        if counter + extraction_amount > oid_count - 1:
            int_max = oid_count - 1;
        else:
            int_max = counter + extraction_amount - 1;
        where = oid_name + ' >= ' + str(ary_oid[counter]) + ' AND ' + oid_name + ' <= ' + str(ary_oid[int_max]);
        print("  pulling records where " + where);
        fields = "*";
        params = "where={}&outFields={}&returnGeometry=true&outSR=4269&f=json".format(where, fields);
        conn = http.client.HTTPSConnection(host);
        conn.request("POST",path + "/query",params,headers);
        response = conn.getresponse();
        data = response.read(); 
        json_data = json.loads(data);
        ef = arcpy.AsShape(json_data,True);
        if initial_hit:
            arcpy.management.CopyFeatures(ef,fgdb + os.sep + fc)
            initial_hit = False;
        else:
            arcpy.Append_management(ef,fgdb + os.sep + fc,"NO_TEST");
        counter += extraction_amount;
        
    conn.close(); 
    del conn;
    print("  Scrape complete.");
    return True;


### 10.020: Download CRWU_CREAT_Grid_Projections from EPA Geoplatform

In [17]:
%%time

host = "services.arcgis.com";
path = "/cJ9YHowT8TU7DUyn/ArcGIS/rest/services/CRWU_CREAT_Grid_Projections/FeatureServer/0";
fc   = "CRWU_CREAT_Grid_Projections";

if arcpy.Exists(fgdb + os.sep + fc):
    arcpy.Delete_management(fgdb + os.sep + fc);

z = scrape_ags(host,path,fgdb,fc);

z = arcpy.management.AddIndex(
     in_table   = fgdb + os.sep + fc
    ,fields     = 'CREAT_ID'
    ,index_name = 'CREAT_ID_IDX'
);

z = arcpy.management.AddIndex(
     in_table   = fgdb + os.sep + fc
    ,fields     = 'GRIDCODE'
    ,index_name = 'GRIDCODE_IDX'
);


  pulling records where OBJECTID >= 1 AND OBJECTID <= 2000
  pulling records where OBJECTID >= 2001 AND OBJECTID <= 4000
  pulling records where OBJECTID >= 4001 AND OBJECTID <= 6000
  pulling records where OBJECTID >= 6001 AND OBJECTID <= 8000
  pulling records where OBJECTID >= 8001 AND OBJECTID <= 10000
  pulling records where OBJECTID >= 10001 AND OBJECTID <= 12000
  pulling records where OBJECTID >= 12001 AND OBJECTID <= 14000
  pulling records where OBJECTID >= 14001 AND OBJECTID <= 16000
  pulling records where OBJECTID >= 16001 AND OBJECTID <= 18000
  pulling records where OBJECTID >= 18001 AND OBJECTID <= 20000
  pulling records where OBJECTID >= 20001 AND OBJECTID <= 22000
  pulling records where OBJECTID >= 22001 AND OBJECTID <= 24000
  pulling records where OBJECTID >= 24001 AND OBJECTID <= 24743
  Scrape complete.
Wall time: 5min 6s


### 10.030: Download CRWU_CREAT_Historic_Climate_Stations from EPA Geoplatform

In [18]:
%%time

host = "services.arcgis.com";
path = "/cJ9YHowT8TU7DUyn/ArcGIS/rest/services/CRWU_CREAT_Historic_Climate_Stations/FeatureServer/0";
fc   = "CRWU_CREAT_Historic_Climate_Stations";

if arcpy.Exists(fgdb + os.sep + fc):
    arcpy.Delete_management(fgdb + os.sep + fc);

z = scrape_ags(host,path,fgdb,fc);

z = arcpy.management.AddIndex(
     in_table   = fgdb + os.sep + fc
    ,fields     = 'CLIMATE_STATION_PK_ID'
    ,index_name = 'CLIMATE_STATION_PK_ID_IDX'
);

z = arcpy.management.AddIndex(
     in_table   = fgdb + os.sep + fc
    ,fields     = 'NOAA_STATION_ID'
    ,index_name = 'NOAA_STATION_ID_IDX'
);


  pulling records where OBJECTID >= 1 AND OBJECTID <= 2000
  pulling records where OBJECTID >= 2001 AND OBJECTID <= 4000
  pulling records where OBJECTID >= 4001 AND OBJECTID <= 6000
  pulling records where OBJECTID >= 6001 AND OBJECTID <= 8000
  pulling records where OBJECTID >= 8001 AND OBJECTID <= 10000
  pulling records where OBJECTID >= 10001 AND OBJECTID <= 11165
  Scrape complete.
Wall time: 11.8 s


### 10.040: Tab Downloaders

Reusable functions to download and import csv or tab delimited text into local filegeodatabase 
using optional custom fieldmappings

In [19]:
def downloadtab(url,filename):
    if arcpy.Exists(filename):
        arcpy.Delete_management(filename);
    print("  downloading file");
    with open(filename,'wb') as f,requests.get(url,stream=True) as r:
        for line in r.iter_lines():
            f.write(line + '\n'.encode());
    return True;
    
def tab2fc(filename,fgdb,fc,longname,latname,field_mapping=None):
    
    if arcpy.Exists('memory' + os.sep + 'tempTable'):
        arcpy.Delete_management('memory' + os.sep + 'tempTable');
  
    print("  loading to table");
    arcpy.TableToTable_conversion(
         in_rows       = filename
        ,out_path      = 'memory'
        ,out_name      = 'tempTable'
        ,field_mapping = field_mapping
    );
    
    if arcpy.Exists(fgdb + os.sep + fc):
        arcpy.Delete_management(fgdb + os.sep + fc);
        
    print("  converting to NAD83 points");
    arcpy.management.XYTableToPoint(
         in_table          = 'memory' + os.sep + 'tempTable'
        ,out_feature_class = fgdb + os.sep + fc
        ,x_field           = longname
        ,y_field           = latname
        ,coordinate_system = arcpy.SpatialReference(4269)
    );
    
    arcpy.Delete_management('memory' + os.sep + 'tempTable');
    return True;

def fmtext(infc,fieldname,fieldlength):
    fm = arcpy.FieldMap();
    fm.addInputField(infc,fieldname);
    nf = fm.outputField;
    nf.type = 'Text';
    nf.length = fieldlength;
    fm.outputField = nf;
    return fm;

def fmint(infc,fieldname):
    fm = arcpy.FieldMap();
    fm.addInputField(infc,fieldname);
    nf = fm.outputField;
    nf.type = 'Integer';
    fm.outputField = nf;
    return fm;

def fmdouble(infc,fieldname):
    fm = arcpy.FieldMap();
    fm.addInputField(infc,fieldname);
    nf = fm.outputField;
    nf.type = 'Double';
    fm.outputField = nf;
    return fm;
    

### 10.050: Download COOP_STATIONS_TO_USE dataset from barrc GitHub repository

In [20]:
%%time 

url = "https://raw.githubusercontent.com/barrc/get_ncei/master/src/coop_stations_to_use.csv"
fc  = 'COOP_STATIONS_TO_USE';

tmptab = arcpy.env.scratchFolder + os.sep + 'tempTable.csv';
z = downloadtab(url,tmptab);
  
fms = arcpy.FieldMappings();
fms.addFieldMap(fmtext  (tmptab,'station_id',255));
fms.addFieldMap(fmtext  (tmptab,'station_name',255));
fms.addFieldMap(fmtext  (tmptab,'state',255));
fms.addFieldMap(fmtext  (tmptab,'start_date',255));
fms.addFieldMap(fmtext  (tmptab,'end_date',255));
fms.addFieldMap(fmdouble(tmptab,'latitude'));
fms.addFieldMap(fmdouble(tmptab,'longitude'));
fms.addFieldMap(fmtext  (tmptab,'in_basins',255));
fms.addFieldMap(fmtext  (tmptab,'break_with_basins',255));
fms.addFieldMap(fmtext  (tmptab,'network',255));
fms.addFieldMap(fmtext  (tmptab,'start_date_to_use',255));
fms.addFieldMap(fmtext  (tmptab,'end_date_to_use',255));

z = tab2fc(tmptab,fgdb,fc,'longitude','latitude',fms);

print("  add quotes to start and end fields");
cb_cleanDate = """
def cleanDate(pin):
    (mm,dd,yyyy) = pin.split('/');
    if mm in ['1','2','3','4','5','6','7','8','9']:
       mm = '0' + mm;
    if dd in ['1','2','3','4','5','6','7','8','9']:
       dd = '0' + dd;
    return "'" + yyyy + "/" + mm + "/" + dd + "'";
    
""";

z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'start_date_clean'
    ,field_type   = 'Text'
    ,field_length = 255
    ,field_alias  = 'start_date_clean'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'start_date_clean'
    ,expression      = "cleanDate(!start_date!)"
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_cleanDate
);

z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'end_date_clean'
    ,field_type   = 'Text'
    ,field_length = 255
    ,field_alias  = 'end_date_clean'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'end_date_clean'
    ,expression      = "cleanDate(!end_date!)"
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_cleanDate
);

z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'start_date_to_use_clean'
    ,field_type   = 'Text'
    ,field_length = 255
    ,field_alias  = 'start_date_to_use_clean'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'start_date_to_use_clean'
    ,expression      = "cleanDate(!start_date_to_use!)"
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_cleanDate
);

z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'end_date_to_use_clean'
    ,field_type   = 'Text'
    ,field_length = 255
    ,field_alias  = 'end_date_to_use_clean'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'end_date_to_use_clean'
    ,expression      = "cleanDate(!end_date_to_use!)"
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_cleanDate
);

print("  calculating year count");
cb_yearCount = """
import datetime;
def yearCount(pstart,pend):
    d1 = datetime.datetime.strptime(pstart,"%m/%d/%Y");
    d2 = datetime.datetime.strptime(pend  ,"%m/%d/%Y");
    yr = round((d2 - d1).days / 365);
    return yr + 0.0;
    
""";

z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'year_count'
    ,field_type   = 'Double'
    ,field_alias  = 'year_count'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'year_count'
    ,expression      = 'yearCount(!start_date_to_use!,!end_date_to_use!)'
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_yearCount
);

print("  adding indexes");
z = arcpy.management.AddIndex(
     in_table      = fgdb + os.sep + fc
    ,fields        = 'station_id'
    ,index_name    = 'station_id_IDX'
);


  downloading file
  loading to table
  converting to NAD83 points
  add quotes to start and end fields
  calculating year count
  adding indexes
Wall time: 11.1 s


### 10.060: Download ISD_STATIONS_TO_USE dataset from barrc GitHub repository

In [21]:
%%time

url = "https://raw.githubusercontent.com/barrc/get_ncei/master/src/isd_stations_to_use.csv"
fc  = 'ISD_STATIONS_TO_USE';

tmptab = arcpy.env.scratchFolder + os.sep + 'tempTable.csv';
z = downloadtab(url,tmptab);

fms = arcpy.FieldMappings();
fms.addFieldMap(fmtext  (tmptab,'station_id',255));
fms.addFieldMap(fmtext  (tmptab,'station_name',255));
fms.addFieldMap(fmtext  (tmptab,'state',255));
fms.addFieldMap(fmtext  (tmptab,'start_date',255));
fms.addFieldMap(fmtext  (tmptab,'end_date',255));
fms.addFieldMap(fmdouble(tmptab,'latitude'));
fms.addFieldMap(fmdouble(tmptab,'longitude'));
fms.addFieldMap(fmtext  (tmptab,'in_basins',255));
fms.addFieldMap(fmtext  (tmptab,'break_with_basins',255));
fms.addFieldMap(fmtext  (tmptab,'network',255));
    
z = tab2fc(tmptab,fgdb,fc,'longitude','latitude',fms);

print("  add quotes to start and end fields");
z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'start_date_clean'
    ,field_type   = 'Text'
    ,field_length = 255
    ,field_alias  = 'start_date_clean'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'start_date_clean'
    ,expression      = "cleanDate(!start_date!)"
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_cleanDate
);

z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'end_date_clean'
    ,field_type   = 'Text'
    ,field_length = 255
    ,field_alias  = 'end_date_clean'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'end_date_clean'
    ,expression      = "cleanDate(!end_date!)"
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_cleanDate
);

print("  calculating year count");
z = arcpy.management.AddField(
     in_table     = fgdb + os.sep + fc
    ,field_name   = 'year_count'
    ,field_type   = 'Double'
    ,field_alias  = 'year_count'
);

z = arcpy.management.CalculateField(
     in_table        = fgdb + os.sep + fc
    ,field           = 'year_count'
    ,expression      = 'yearCount(!start_date!,!end_date!)'
    ,expression_type = 'PYTHON3'
    ,code_block      = cb_yearCount
);

print("  adding indexes");
z = arcpy.management.AddIndex(
     in_table   = fgdb + os.sep + fc
    ,fields     = 'station_id'
    ,index_name = 'station_id_IDX'
);


  downloading file
  loading to table
  converting to NAD83 points
  add quotes to start and end fields
  calculating year count
  adding indexes
Wall time: 7.67 s


### 10.070: Download US Census Tigerweb 2020 State Coverage

In [22]:
%%time

# Note tigerweb will timeout if all state-equivalent records are requested in one go.
# Setting the forcelimit value to 5 records at once works around the problem.

host = "tigerweb.geo.census.gov";
path = "/arcgis/rest/services/TIGERweb/State_County/MapServer/0";
fc   = "census_states";

if arcpy.Exists(fgdb + os.sep + fc):
    arcpy.Delete_management(fgdb + os.sep + fc);

z = scrape_ags(host,path,fgdb,fc,5);

z = arcpy.management.AddIndex(
     in_table   = fgdb + os.sep + fc
    ,fields     = 'GEOID'
    ,index_name = 'GEOID_IDX'
);


  pulling records where OBJECTID >= 1 AND OBJECTID <= 5
  pulling records where OBJECTID >= 6 AND OBJECTID <= 10
  pulling records where OBJECTID >= 11 AND OBJECTID <= 15
  pulling records where OBJECTID >= 16 AND OBJECTID <= 20
  pulling records where OBJECTID >= 21 AND OBJECTID <= 25
  pulling records where OBJECTID >= 26 AND OBJECTID <= 30
  pulling records where OBJECTID >= 31 AND OBJECTID <= 35
  pulling records where OBJECTID >= 36 AND OBJECTID <= 40
  pulling records where OBJECTID >= 41 AND OBJECTID <= 45
  pulling records where OBJECTID >= 46 AND OBJECTID <= 50
  pulling records where OBJECTID >= 51 AND OBJECTID <= 55
  pulling records where OBJECTID >= 56 AND OBJECTID <= 56
  Scrape complete.
Wall time: 1min 8s


### 10.080: Review results

In [23]:
grid = fgdb + os.sep + 'CRWU_CREAT_Grid_Projections';
grid_cnt = arcpy.GetCount_management(grid)[0];
hist = fgdb + os.sep + 'CRWU_CREAT_Historic_Climate_Stations';
hist_cnt = arcpy.GetCount_management(hist)[0];
coop = fgdb + os.sep + 'COOP_STATIONS_TO_USE';
coop_cnt = arcpy.GetCount_management(coop)[0];
isd  = fgdb + os.sep + 'ISD_STATIONS_TO_USE';
isd_cnt = arcpy.GetCount_management(isd)[0];
states  = fgdb + os.sep + 'census_states';
states_cnt = arcpy.GetCount_management(states)[0];

print("  Grid Projections : " + str(grid_cnt));
print("  Historic Stations: " + str(hist_cnt));
print("  COOP Stations    : " + str(coop_cnt));
print("  ISD Stations     : " + str(isd_cnt));
print("  Tigerweb States  : " + str(states_cnt));
print(" ");


  Grid Projections : 24743
  Historic Stations: 11165
  COOP Stations    : 1851
  ISD Stations     : 3293
  Tigerweb States  : 56
 
