### Notebook to download data from the Argo DAC

Here we use some code that Ethan Campbell wrote (https://github.com/ethan-campbell/Weddell_polynya_paper) that allows us to download files from the Argo DAC based on specific search criterion. 
For example below we download all files between 40-70S. 

In [1]:
# some_file.py
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, '/Users/dhruvbalwada/work_root/sogos/')

In [2]:
import os
from numpy import *
import pandas as pd
import xarray as xr
import matplotlib as mpl
import matplotlib.pyplot as plt

In [3]:
import sogos.download_product as dlp

In [4]:
# Where the data should be saved

raw_data_dir = '/Users/dhruvbalwada/work_root/sogos/data/raw/'
argo_gdac_dir = raw_data_dir + 'Argo/'

The call below downloads the list of synthetic profiles on the DAC, then downloads all the files, and then is supposed to create a local list of files. The purpose of the local list being for comparison in the future. However, right now it is buggy and the local list creation fails. 
The function does have an internal check to make sure that a profile doesnot get overwritten (by setting `overwrite_profs=False` below). 
Essentially this is a bit hacky, but it works. It can be quite slow, but only really needs to be run once (which is also why I didn't bother to debug it). 

In [8]:
# set functions parameters 
start_date = (2019, 12,31)
end_date   = (2021, 12,31)
lat_range  = [-70,-40]
lon_range  = [-180,180]
save_to_root = argo_gdac_dir
overwrite_global_index=False 
overwrite_profs=False
bypass_download=False
only_download_wmoids=[] # the WMO id for the float 

# Download data for our float
dlp.argo_gdac(start_date, end_date,
              lat_range,lon_range,
              argo_gdac_dir, overwrite_global_index=False, overwrite_profs=False, 
              only_download_wmoids = [],
             argo_data_type='S')

Downloading index of floats
>>> File argo_synthetic-profile_index.txt already exists. Leaving current version.
>>> Number of Argo profiles on GDAC meeting criteria =  5608
>>> 2021-05-25 07:53:05.287855 - action is 0.0% complete
>>> 2021-05-25 07:53:44.414919 - action is 10.0% complete
>>> 2021-05-25 07:54:20.194963 - action is 20.0% complete
>>> 2021-05-25 07:54:55.929944 - action is 30.0% complete
>>> 2021-05-25 07:55:31.826936 - action is 40.0% complete
>>> 2021-05-25 07:56:07.452040 - action is 50.0% complete
>>> 2021-05-25 07:56:42.846686 - action is 60.0% complete
>>> 2021-05-25 07:57:18.796045 - action is 70.0% complete


Downloading: SR6902905_288.nc  88% [#####################################     ]

>>> 2021-05-25 08:30:21.140068 - action is 80.0% complete


Downloading: SR6903777_043.nc  90% [######################################    ]

>>> 2021-05-25 09:09:04.459244 - action is 90.0% complete


Downloading: SR7900881_035.nc  69% [#############################             ]

<scipy.io.netcdf.netcdf_file object at 0x7ff20f1d08d0>


KeyError: 'DATA_MODE'

## debugging argo_gdac function 

In [48]:
# NOTE: change variable names to something more readable 
save_to_meta = save_to_root + 'Meta/'
save_to_profiles = save_to_root + 'Profiles/'
url_root = '/ifremer/argo/' 
ftp_root = 'ftp.ifremer.fr'
global_index_filename = 'ar_index_global_prof.txt'
local_index_filename = 'ar_index_local_prof.txt'  # index of locally downloaded profiles
url_profiles_root = url_root + 'dac/'

In [8]:
# Download data for our float
dlp.argo_gdac(start_date, end_date,
              lat_range,lon_range,
              argo_gdac_dir, 
              only_download_wmoids)

Downloading index of floats
>>> File ar_index_global_prof.txt already exists. Overwriting with new version.


KeyboardInterrupt: 

In [13]:
df.single_file(url_root,global_index_filename,
                   save_to_meta,ftp_root=ftp_root,
                   overwrite=overwrite_global_index,verbose=True)

>>> File ar_index_global_prof.txt already exists. Leaving current version.


In [14]:
# so single file download seems to be working, and added progressbar 

In [15]:
data_frame = pd.read_csv(save_to_meta + 
                             global_index_filename,header=8,
                             low_memory=False)

In [16]:
data_frame

Unnamed: 0,file,date,latitude,longitude,ocean,profiler_type,institution,date_update
0,aoml/13857/profiles/R13857_001.nc,1.997073e+13,0.267,-16.032,A,845,AO,20181011180520
1,aoml/13857/profiles/R13857_002.nc,1.997081e+13,0.072,-17.659,A,845,AO,20181011180521
2,aoml/13857/profiles/R13857_003.nc,1.997082e+13,0.543,-19.622,A,845,AO,20181011180521
3,aoml/13857/profiles/R13857_004.nc,1.997083e+13,1.256,-20.521,A,845,AO,20181011180521
4,aoml/13857/profiles/R13857_005.nc,1.997091e+13,0.720,-20.768,A,845,AO,20181011180521
...,...,...,...,...,...,...,...,...
2189111,nmdis/2901633/profiles/R2901633_067.nc,2.013050e+13,27.462,139.107,P,841,NM,20130507103443
2189112,nmdis/2901633/profiles/R2901633_068.nc,2.013051e+13,27.432,138.840,P,841,NM,20130511165723
2189113,nmdis/2901633/profiles/R2901633_069.nc,2.013052e+13,27.692,138.677,P,841,NM,20130521170139
2189114,nmdis/2901633/profiles/R2901633_070.nc,2.013053e+13,27.895,138.465,P,841,NM,20130531181516


In [17]:
global_profile_list = data_frame.values

In [18]:
global_profile_list.shape

(2189116, 8)

In [19]:
num_profs = len(global_profile_list)
prof_matches = zeros(num_profs, dtype=bool)

In [20]:
import re

In [21]:
float_number_regexp = re.compile('[a-z]*/[0-9]*/profiles/[A-Z]*([0-9]*)_[0-9]*[A-Z]*.nc')

In [22]:
global_profile_list

array([['aoml/13857/profiles/R13857_001.nc', 19970729200300.0, 0.267,
        ..., 845, 'AO', 20181011180520],
       ['aoml/13857/profiles/R13857_002.nc', 19970809192112.0,
        0.07200000000000001, ..., 845, 'AO', 20181011180521],
       ['aoml/13857/profiles/R13857_003.nc', 19970820184545.0,
        0.5429999999999999, ..., 845, 'AO', 20181011180521],
       ...,
       ['nmdis/2901633/profiles/R2901633_069.nc', 20130521042631.0,
        27.691999999999997, ..., 841, 'NM', 20130521170139],
       ['nmdis/2901633/profiles/R2901633_070.nc', 20130531044525.0,
        27.895, ..., 841, 'NM', 20130531181516],
       ['nmdis/2901633/profiles/R2901633_071.nc', 20130610043319.0,
        27.930999999999997, ..., 841, 'NM', 20130617181801]], dtype=object)

In [23]:
last_valid_position_float = int(float_number_regexp.findall(global_profile_list[0,0])[0])

In [24]:
float_number_regexp.findall(global_profile_list[0,0])

['13857']

In [25]:
[global_profile_list[0,2],global_profile_list[0,3]]

[0.267, -16.032]

In [29]:
for n in range(num_profs):
        current_float = int(float_number_regexp.findall(global_profile_list[n,0])[0])
        
        # accommodate profiles with missing lat/lon data (set as 99999.000)
        if global_profile_list[n,2] == 99999.000 or global_profile_list[n,3] == 99999.000 \
                or global_profile_list[n,2] == -999.000 or global_profile_list[n,3] == -999.000:
            
            if current_float == last_valid_position_float:
                assumed_prof_position = last_valid_position
            else:
                continue 
                # in effect, leave prof_matches[n] = False
                ### original solution was the following: 
                # raise AssertionError('Profile has invalid lat/lon 
                # and is unusable because no prior valid lat/lon for this float, 
                # {0}.'.format(current_float))
        else:
            assumed_prof_position = [global_profile_list[n,2],global_profile_list[n,3]]
            last_valid_position = assumed_prof_position
            last_valid_position_float = current_float
        
        # skip profiles with missing timestamps
        if isnan(global_profile_list[n,1]):
            continue  
            # in effect, leave prof_matches[n] = False
            # finally, if profile has valid position and timestamp, then check against args
        if tt.is_time_in_range(start_date,end_date,tt.convert_14_to_tuple(global_profile_list[n,1])):
            if gt.geo_in_range(assumed_prof_position[0],assumed_prof_position[1],lat_range,lon_range):
                prof_matches[n] = True
print('>>> Number of Argo profiles on GDAC meeting criteria = ',sum(prof_matches))                

>>> Number of Argo profiles on GDAC meeting criteria =  9172


In [31]:
matching_profs = where(prof_matches)[0]
local_profile_list = global_profile_list[matching_profs,:]
num_profs = len(local_profile_list)

In [32]:
url_root,global_index_filename,
                   save_to_meta,ftp_root=ftp_root,
                   overwrite=overwrite_global_index,verbose=True

9172

In [50]:
# download necessary profiles to local
if not bypass_download:
    if len(only_download_wmoids) is not 0:
        only_download_wmoids = [str(selected_wmoid) for selected_wmoid in only_download_wmoids]
        trim_local_profile_list_indices = []
        starting_dir = os.getcwd()
        os.chdir(save_to_profiles)
        existing_prof_files = os.listdir()
    prof_file_regexp = re.compile('[a-z]*/[0-9]*/profiles/([A-Z]*[0-9]*_[0-9]*[A-Z]*.nc)')
    prof_path_regexp = re.compile('([a-z]*/[0-9]*/profiles/)[A-Z]*[0-9]*_[0-9]*[A-Z]*.nc')
    for i, global_prof_index in enumerate(matching_profs):
        prof_file = prof_file_regexp.findall(global_profile_list[global_prof_index,0])[0]
        prof_path = prof_path_regexp.findall(global_profile_list[global_prof_index,0])[0]
        if len(only_download_wmoids) is not 0:
            if all([selected_wmoid not in prof_file for selected_wmoid in only_download_wmoids]):
                if prof_file in existing_prof_files: trim_local_profile_list_indices.append(i)
                continue
            print('dlp.argo_gdac() is downloading ' + prof_file)
            trim_local_profile_list_indices.append(i)
        df.single_file(url_profiles_root + prof_path,prof_file,save_to_profiles,
                       ftp_root=ftp_root,overwrite=overwrite_profs,verbose=False)
        df.how_far(i,matching_profs,0.01)
    if len(only_download_wmoids) is not 0:
        matching_profs = matching_profs[trim_local_profile_list_indices]
        local_profile_list = local_profile_list[trim_local_profile_list_indices,:]
        num_profs = len(local_profile_list)
        os.chdir(starting_dir)

dlp.argo_gdac() is downloading R5906030_001.nc


Downloading: R5906030_001.nc  71% [##############################             ]

dlp.argo_gdac() is downloading R5906030_002.nc


Downloading: R5906030_002.nc  92% [#######################################    ]

dlp.argo_gdac() is downloading R5906030_003.nc


Downloading: R5906030_003.nc  93% [########################################   ]

dlp.argo_gdac() is downloading R5906030_004.nc


Downloading: R5906030_004.nc  96% [#########################################  ]

dlp.argo_gdac() is downloading R5906030_005.nc


Downloading: R5906030_005.nc  95% [#########################################  ]

dlp.argo_gdac() is downloading R5906030_006.nc


Downloading: R5906030_006.nc  92% [#######################################    ]

dlp.argo_gdac() is downloading R5906030_007.nc


Downloading: R5906030_007.nc  93% [########################################   ]

dlp.argo_gdac() is downloading R5906030_008.nc


Downloading: R5906030_008.nc  94% [########################################   ]

dlp.argo_gdac() is downloading R5906030_009.nc


Downloading: R5906030_009.nc  95% [#########################################  ]

dlp.argo_gdac() is downloading R5906030_010.nc


Downloading: R5906030_010.nc  94% [########################################   ]

dlp.argo_gdac() is downloading R5906030_011.nc


Downloading: R5906030_011.nc  93% [########################################   ]

dlp.argo_gdac() is downloading R5906030_012.nc


Downloading: R5906030_012.nc  92% [#######################################    ]

dlp.argo_gdac() is downloading R5906030_013.nc


Downloading: R5906030_013.nc  95% [#########################################  ]

dlp.argo_gdac() is downloading R5906030_014.nc


Downloading: R5906030_014.nc  94% [########################################   ]

dlp.argo_gdac() is downloading R5906030_015.nc


Downloading: R5906030_015.nc  95% [#########################################  ]

dlp.argo_gdac() is downloading R5906030_016.nc


Downloading: R5906030_016.nc 100% [###########################################]

dlp.argo_gdac() is downloading R5906030_017.nc


Downloading: R5906030_017.nc  94% [########################################   ]

dlp.argo_gdac() is downloading R5906030_018.nc


Downloading: R5906030_018.nc  92% [#######################################    ]

dlp.argo_gdac() is downloading R5906030_019.nc


Downloading: R5906030_019.nc  94% [########################################   ]

dlp.argo_gdac() is downloading R5906030_020.nc


Downloading: R5906030_020.nc  99% [########################################## ]

dlp.argo_gdac() is downloading R5906030_021.nc


Downloading: R5906030_021.nc 100% [###########################################]

dlp.argo_gdac() is downloading R5906030_022.nc


Downloading: R5906030_022.nc 100% [###########################################]

dlp.argo_gdac() is downloading R5906030_023.nc


Downloading: R5906030_023.nc  89% [######################################     ]

dlp.argo_gdac() is downloading R5906030_024.nc


Downloading: R5906030_024.nc  89% [######################################     ]

dlp.argo_gdac() is downloading R5906030_025.nc


Downloading: R5906030_025.nc  94% [########################################   ]

dlp.argo_gdac() is downloading R5906030_026.nc


Downloading: R5906030_026.nc  92% [#######################################    ]

In [52]:
# re-process local profile index
float_wmoid_regexp = re.compile('[a-z]*/([0-9]*)/profiles/[A-Z]*[0-9]*_[0-9]*[A-Z]*.nc')
float_profile_filename_regexp = re.compile('[a-z]*/[0-9]*/profiles/([A-Z]*[0-9]*_[0-9]*[A-Z]*.nc)')
float_profile_mode_regexp = re.compile('[a-z]*/[0-9]*/profiles/([A-Z]*)[0-9]*_[0-9]*[A-Z]*.nc')
float_profile_num_regexp = re.compile('[a-z]*/[0-9]*/profiles/[A-Z]*[0-9]*_([0-9]*)[A-Z]*.nc')
float_wmoids = [int(float_wmoid_regexp.findall(local_profile_list[n,0])[0]) for n in range(num_profs)]
float_profile_filenames = [float_profile_filename_regexp.findall(local_profile_list[n,0])[0] 
                           for n in range(num_profs)]
float_profile_modes = [float_profile_mode_regexp.findall(local_profile_list[n,0])[0] 
                       for n in range(num_profs)]
float_profile_nums = [int(float_profile_num_regexp.findall(local_profile_list[n, 0])[0]) 
                      for n in range(num_profs)]
float_position_flags = [0 for n in range(num_profs)]
local_profile_list = hstack((vstack(float_wmoids),vstack(float_profile_filenames),
                             vstack(float_profile_modes),
                             vstack(float_position_flags),local_profile_list))

In [53]:
# sort profile index by WMOid + profile number (e.g. 7900093 is completely out of order)
sort_param = array(float_wmoids) + array(float_profile_nums) / 10000
local_profile_list = local_profile_list[argsort(sort_param)]

RecursionError: maximum recursion depth exceeded while calling a Python object