# Tutorial 8: Remote access of public data

In most of the previous tutorials, we've downloaded data by hand from various repositories by going to the website, finding the data, downloading, and then opening it with a Python package. This is a good process for one file, and it is a helpful step when first exploring data (e.g. do you want to download more files? does the data have the information you need?), but it can be a very tedious process if you have to download more than 2 or 3 files. In this tutorial, we will explore a few ways to open data directly using Python packages without downloading the data by hand. 

By the end of this tutorial, you will be able to:
* find and recognize open-access data
* simplify data access and processing
* create complex loops and functions

In [None]:
# import all your packages - we'll talk about each as we use them
import time
import datetime
import pandas as pd
import numpy as np
import xarray as xr
from urllib.request import urlopen

### What is open-access?

Open-access is a practice in research of sharing data, code, publications, and other research materials online for free. Often it refers to journal articles, but open data (also called open source data) is an important part of open-acess and making research accessible to the public. Open data is available for anyone to access, modify, and share. Government agencies often have their data available for free to the public, for example the satellite data on EarthData from Tutorials 6 and 7 and the EPA AQS data from Tutorial 5. Open data should also be free of barriers like email request requirements, so it is often stored in repositories available remotely over the internet. These repositories make it possible to access the data from anywhere using a URL. In this tutorial, we will rely on these URLs to open and read the data directly from the repository, rather than first downloading a file to a personal computer. 

There are a number of repositories that store data. We've already mentioned EarthData and the EPA AQS website. Often, you can find the repositories for specific data by spending some quality time with a search engine. In this tutorial, we will explore 2D and 3D data that can be opened with different Python packages to help show the range of methods available. First, we will look at two methods to open and read 2D data. Then, we will look at using Xarray for larger dimensional datasets. 

### 2D Method 1: urllib

Let's say you're working on a research project studying air quality in Des Moines, IA. You want to see how weather conditions, like temperature and relative humidity, affect local air quality. To do this, you need to find a local weather station and download some data. After Googling, you find this website: https://mesonet.agron.iastate.edu/request/download.phtml. 

The website shows all the Mesonet weather stations in Iowa, and you can select one and download that data. Let's try that now. On the Iowa State website do the following:
1) Select a random station (it doesn't matter which).   
2) Select some random weather conditions (it doesn't matter which).   
3) Pick a date range (any will do).   
4) Leave steps 4-6 alone.   
5) Click Get Data. What happens?

Rather than downloading a file, the Get Data button opened a new browser window. This window has all the data listed out. If you examine the URL closely, you'll see all the specifications of the data are in the URL. Here is an example (scroll to see all of it):

https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?station=AIO&data=tmpc&data=dwpf&year1=2022&month1=1&day1=1&year2=2022&month2=9&day2=20&tz=Etc%2FUTC&format=onlycomma&latlon=no&elev=no&missing=M&trace=T&direct=no&report_type=3&report_type=4

The first line through asos.py? is the base of the URL data request. Then, each of the parameters are specified. For example, the station is AIO. The data selected is tmpc (temperature in ˚C) and dwpf (dewpoint in ˚F). The start and end date are listed, and the formatting options are to use commas as the separator values (onlycomma) and no lat, lon, or elevation data is added.

We can use this URL to design a function that will make a URL for the specific data we want, and then use the Python package urllib to open the data from these URLs.

In [None]:
# function to build URL
def make_url(station,data,start,end,tz,sep,latlon,elev):
    base = 'http://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?'
    stat = 'station='+station
    data = '&data='+data
    date1 = start.strftime('&year1=%Y&month1=%m&day1=%d&')
    date2 = end.strftime('&year2=%Y&month2=%m&day2=%d&')
    tz = '&tz='+tz
    form = '&format='+sep
    ll = '&latlon='+latlon
    el = '&elev='+elev
    close = '&missing=M&trace=T&direct=no&report_type=1&report_type=2'
    # missing and trace data handling can be changed here by altering the letters M and T. 
    url = base+stat+data+date1+date2+tz+form+ll+el+close
    return url

The function, called `make_url`, will build a URL from the specifications of the user. The last section of the URL, called `close` lists how missing data should be handled. The default is to put in "M" for missing and "T" for trace (below detection limits), but this can be changed if desired.

**Knowledge Check:** How would you change the missing and trace data markers in the function?

In [None]:
# your code here


In [None]:
# To change the missing and trace markers, separate out the code and then add additional arguments to the function
# DO NOT run this code cell, this is only for demonstration. You will get an error later if you run it.
def make_url(station,data,start,end,tz,sep,latlon,elev,M,T):
    ...
    ...
    ...
    missing = '&missing='+M
    trace = '&trace='+T
    close = '&direct=no&report_type=1&report_type=2'
    url = base+stat+data+date1+date2+tz+form+ll+el+missing+trace+close
    return url

Let's test out the original function. If you ran the code cell above, you will need to re-run the first function.

In [None]:
# list out the parameters
station = 'BWI'                        # random station name, this one is Baltimore, MD airport BWI
data = 'tmpc'                          # 'all' or a variable name
start = datetime.datetime(2018,1,1)    # date in year, month, day format
end = datetime.datetime(2018,12,31)    # date in year, month, day format
tz = 'Etc%2FUTC'                       # timezone, you might need to test out urls on the website to get the codes
sep = 'onlycomma'                      # format of data separation
latlon = 'no'                          # 'yes' or 'no'
elev = 'no'                            # 'yes' or 'no'

In [None]:
# use the function
url = make_url(station,data,start,end,tz,sep,latlon,elev)
url

Copy and paste this URL into a new browser window without the quotation marks. Does it work?

Let's try again with a more complicated list of variables. What do you think will happen?

In [None]:
# list out the parameters
station = 'BWI'                        
data = 'tmpc&dwpf'                     # two variables now
start = datetime.datetime(2018,1,1)    
end = datetime.datetime(2018,12,31)    
tz = 'Etc%2FUTC'                       
sep = 'onlycomma'                      
latlon = 'no'                          
elev = 'no'                            

In [None]:
url = make_url(station,data,start,end,tz,sep,latlon,elev)
url

Try out this URL. Did it work? Are both data listed out?

No, not all the data is there. The tmpc data is listed, but no dwpf. That is because the URL is missing the key `'&data='` before dwpf. Without this, dwpf is not recognized as a data variable, and so it is skipped over. In order to add more data variables to the list, they each need to have `'&data='` in front of them. 

**Knowledge Check:** Fix the parameters code so that dwpf will be included as a variable, build a new URL, and check that it worked.

In [None]:
# your code here


In [None]:
# example of fixed code
station = 'BWI'                        
data = 'tmpc&data=dwpf'                     # fixed for 2 variables
start = datetime.datetime(2018,1,1)    
end = datetime.datetime(2018,12,31)    
tz = 'Etc%2FUTC'                       
sep = 'onlycomma'                      
latlon = 'no'                          
elev = 'no'  

In [None]:
url = make_url(station,data,start,end,tz,sep,latlon,elev)
url

Hooray!

Ok, by now you're probably wondering why we aren't using Python to read the data. All we've done so far is build a URL and copy and paste. Well, to be honest, that is the first step in making code to access data on the internet. First, you must figure out how to get to the data: what is the URL, how can I make a URL that works for many different situations? Now that we've done that, we can actually move on to the easy part of opening the data with Python.

For this data, we will use the package urllib to open and read the URL. You can read more on the documentation https://docs.python.org/3/library/urllib.html. You can also learn more about the specific function we will be using, `urlopen`, by running the code cell below. Just hit X in the top right corner of the window when you're done. 

In [None]:
urlopen?

In [None]:
# simple function to read the data from the Iowa State Mesonet website
def get_data(url):
    data = urlopen(url).read().decode("utf-8")
    return data

**Knowledge Check:** Add comments to the above code cell to explain what each line of code does. You might need to do some researching, unless you know what "utf-8" means!

Each command in that line of code is walking through a different step in the process of accessing the data and reading it out for us. First, `urlopen` is opening the data, like how we were pasting the URL in and then the data pops up in the browser window. Next, the `read` command is reading in the data, like how our eyes are taking in the image of the computer screen with data on it. Finally, `decode` is decoding the data using the UTF-8 Unicode encoding system. This is a way for computers to turn binary into readable characters for us. This step is like how our brains understand the image we see. 

In [None]:
data = get_data(url)

In [None]:
data

AH! That looks terrible!

Rather than try to parse it out ourselves, let's let Python do the heavy lifting.

In [None]:
def make_df(data):
    outfn = 'data.txt'                    # make a text file
    out = open(outfn, "w")                # open the text file in write mode
    out.write(data)                       # write the data to the text file
    out.close()                           # close the text file 
    df = pd.read_csv(outfn,sep=',')       # open the text file with pandas
    return df

In [None]:
df = make_df(data)

In [None]:
df

What did the `make_df` function do? There are actually a few steps there. We made a text file of the data, then used pandas to open the text file as a DataFrame. The data looks good! But what if we don't want to go through the extra step of making and saving a text file? What if we just want to use pandas?

**Exercise 1:** So far, we have worked on data for consecutive days in a year. What if we want data for a specific month in multiple years? Of course you can change the year manually, but can you make it easier? Think about using a for loop!

In [None]:
# your code here


In [None]:
# potential method
df = []
for year in np.arange(2018,2020):
    station = 'BWI'                        
    data = 'tmpc&data=dwpf'                     # fixed for 2 variables
    start = datetime.datetime(year,6,1)    
    end = datetime.datetime(year,6,30)    
    tz = 'Etc%2FUTC'                       
    sep = 'onlycomma'                      
    latlon = 'no'                          
    elev = 'no' 
    url = make_url(station,data,start,end,tz,sep,latlon,elev)
    data = download_data(url)
    df.append(make_df(data))
    
df_tot = pd.concat(df)

In [None]:
df_tot

**Exercise 2:** Remake the `make_df` function to only use pandas to read the data. What happens? As you continue through this tutorial, can you think of ways to solve the error messages?

In [None]:
# your code here


If that didn't work for you, then we need to learn some more about pandas! Let's move on and explore the power of pandas a little more.

### 2D Method 2: pandas

The first method that we used to read in 2D data required the use of urllib to open the URL and read it. Then we used pandas to open the text file we created. This can be a really good method if you want to actually download a file for your data. Perhaps you really want to save the data to your personal computer and not have to go back to the URL site each time, especially if you're working when you don't have internet access. The decode step in urllib is also very useful in situations when you have interesting data formats. Urllib also has some useful functions for dealing with errors. However, the process of using urllib took several steps. If you know your data is readable and doesn't need to be decoded, you can jump right ahead to using pandas. 

Let's say you're interested in researching water quality around Hawaii. You do some Googling and find a series of research cruises have been conducted around the islands for years on this website: https://hahana.soest.hawaii.edu/hot/. However, the data is not stored in a way that makes it convenient for downloading multiple files, so you want to use Python to make your life easier. 

As a first step, you find the data of interest and check out a few URLs to get an idea of the pattern. https://hahana.soest.hawaii.edu/FTP/hot/water/

Click on some of the files to see what the data and URLs look like. 

The URLs seem pretty simple. They are the same long base, plus a number to denote which cruise the data was collected on, and a unique top level domain for the type of data (a top level domain is the ending of a URL. It is the text after the final period like "com" from .com). It seems pretty easy to come up with a function to make the URLs.

**Knowledge Check:** Make a simple function to make a URL for the HOT data. There should be two inputs for the function.

In [None]:
# your code here


In [None]:
# example simple function to make a URL
def make_url2(number,TLD):
    base = 'https://hahana.soest.hawaii.edu/FTP/hot/water/hot'
    url = base+number+TLD
    return url

Now that we have a function to make a URL, let's try out some random inputs and then test using pandas to open that data. It would be helpful at this point to also open the URL in a new window so that you can check the pandas output with the real data.

In [None]:
url2 = make_url2('110','.sea')
print(url2)
data2 = pd.read_csv(url2)
data2

Ok, that doesn't look great. It seems like pandas assumed the whole first line was one column, and then made each line after that into the data for that row. Not great. Maybe we can tell pandas to skip the first row?

In [None]:
data2 = pd.read_csv(url2,skiprows=1)
data2

Hmm, not much better. Even when a different line is used to establish the column headers, pandas still thinks there should only be one column. Maybe we need to tell pandas what the separators should be to separate the data into multiple columns. Normally, in a CSV file, the separators are commas, but in this file the separators seem to be spaces. We can still skip the first row, because that does not seem to be data or column headers. 

In [None]:
data2 = pd.read_csv(url2,sep='\s+',skiprows=1)
data2

Yay! That seemed to work. We told pandas to expect spaces of any size to be the separators. You'll notice that the spaces online are not a uniform size, which is why we needed to use `'\s+'` rather than `'\t'` for tabs or `' '` for a single space.

One thing you might have noticed in that the first row of data (index 0) is not data, but is instead the units of the data. We don't want this in the pandas DataFrame because it could mess up future calculations. Let's add another skipped row in the `read_csv` command.

In [None]:
data2 = pd.read_csv(url2,sep='\s+',skiprows=[0,2,3,4])
data2

**Knowledge Check:** Why did the numbers in the `skiprows` argument change? Originally it was `skiprows=1`, but now it is `skiprows=[0,2,3,4]`. If you can't figure out why this is, try running the command again but changing the values for the `skiprows` command, and then compare with the website.

**Exercise 3:** Try out code to only skip rows 0, 2, and 3. Does that work? Is there a difference and what is it? Why do you think either code might be working?

In [None]:
# your code here


Now that we have the code working for one URL, we should try it for a different type of data from the same website. 

In [None]:
url3 = make_url2('11','.gof')
print(url3)
data3 = pd.read_csv(url3,sep='\s+',skiprows=[0,2,3,4])
data3

That looks like it works! 

Now that we have working code, let's make a loop to read in the data for several cruises. We've used for loops before, but let's try out a new kind of logical function.

In [None]:
cruise = 1

In [None]:
while cruise < 3:
    number = str(cruise)
    TLD = '.gof'
    url2 = make_url2(number,TLD)
    if cruise == 1:
        total = pd.read_csv(url2,sep='\s+',skiprows=[0,2,3,4])
    else:
        data2 = pd.read_csv(url2,sep='\s+',skiprows=[0,2,3,4])
        total = pd.concat([total,data2],axis=0)
    print(cruise)
    cruise += 1

In [None]:
total

A while loop functions as long as some condition is true. In this case, we said "while the cruise number is less than 3, run the following code." This is useful if you don't necessarily want to specify all the times something is true and only want to specify when it is not true. An analogous loop would be `for x in np.arange(1,3):`, which is saying "run the following code for 1, 2." 

After the while loop, we used the make_url function to make a URL for each cruise, and then used an if/else statement to turn the data into one combined DataFrame. For the first cruise, `if cruise == 1`, we just made a new DataFrame. But for every other cruise, `else`, we made a DataFrame and then concatenated this new data with the previous data. Finally, we added 1 to the value of cruise and continued the loop.

**Exercise 4:** Now that you know a little more about using pandas to open data, remake the urllib make_df function to read the Iowa ASOS data using only pandas and without making a new text file each time. There is some helpful code already started, so you'll just need to fill it in. Replace the comments on each line to explain what the code is doing.

In [None]:
# your edits here
def make_df2(data):
    lines = data.split('\n')                   # split up the string by the separators
    splits = [x.split() for x in lines]        # what goes in the paranthesis?
    columns = splits[]                         # what goes in the brackets?
    df = pd.DataFrame(data = ??, columns = ??) # how do you make a DataFrame?
    return                                     # what gets returned?

In [None]:
# possible answer
def make_df2(data):
    lines = data.split('\n') 
    splits = [x.split(',') for x in lines] 
    columns = splits[0]
    df = pd.DataFrame(data = splits[1:-1], columns = columns)
    return df

In [None]:
# rerun the code just in case
station = 'BWI'                        
data = 'tmpc&data=dwpf'
start = datetime.datetime(2018,1,1)    
end = datetime.datetime(2018,12,31)    
tz = 'Etc%2FUTC'                       
sep = 'onlycomma'                      
latlon = 'no'                          
elev = 'no'  
url = make_url(station,data,start,end,tz,sep,latlon,elev)

In [None]:
df = make_df2(url)
df

Ok, but how might you fix the column names?

### Xarray for netCDF files

So far, we've focused on data that is 2D (tables of data). But what if we want to download satellite data? From previous tutorials, we know that pandas does not work with these types of data. Instead, we need to use Xarray to open and read the netCDF files. 

Let's go back to the chlorophyll-a data from Tutorials 6 and 7. This data was kind of hard to find, and matching the data used in the tutorial with the data you could search for was complicated. Unfortunately, there is so much satellite data stored in so many places that it will always involve a lot of searching for the right data. But one thing that can make accessing the data easier is OPeNDAP. OPeNDAP is a software that can make it easy to store and access large datasets, like satellite data. A decent amount of satellite data is stored through the OPeNDAP framework, and we can use this to access the data online.

We're going to use one of NASA's other data repositories to get started: https://oceandata.sci.gsfc.nasa.gov/. Then click Data -> OPeNDAP. From here, you can pick and choose where to go. This tutorial will use the URL to get the same chlorophyll-a data products as the previous tutorials, but you can change the URLs if you want to try out a different data product.

For chlorophyll-a, go to 'Merged_ATV' -> 'L3SMI' -> '2018'. Then select any day. We just need one URL to get an idea of the pattern, since we've had some practice already.

Once you select a day, there should be at least one file that appears. The first file, the one of interest to us, will say 'X#######.' and then '.L3m_DAY_CHL_chlor_a_4km.nc'. The first part with X in front is the year followed by the numerical day of the year. If you click on this file, you will be given several options to get the data. We don't want to download the file by hand, we want to do it with Python, so the important information for us is the 'Data URL'. We can use that to get the data!

First step, recognize the pattern in the URLs and then build a function (clicking these links will get you to an error message).

The URL to the January 1st file is 'http://oceandata.sci.gsfc.nasa.gov/opendap/Merged_ATV/L3SMI/2018/001/X2018001.L3m_DAY_CHL_chlor_a_4km.nc'. The URL to the January 2nd file is 'http://oceandata.sci.gsfc.nasa.gov/opendap/Merged_ATV/L3SMI/2018/002/X2018002.L3m_DAY_CHL_chlor_a_4km.nc'. Can you figure out the pattern?

The files and URLs are just changing by the day of year number. So we need a function that will use a date, turn that date into a day of year, and then build the URL. Alternatively, we can build a function that takes some numeric value, uses that to build the URL, and then also turns the numeric value into a date and return that for reference.

In [None]:
# first, let's build another function to make our URLs.
# the only thing that changed about the URLs is the day of year (doy). 
# design a function to turn a date into the URL. You might want to check out the zfill command for strings.
def make_url3(date):
    doy = 
    year = 
    url = 
    return url

In [None]:
# test out your function with this code
date = pd.to_datetime('2018-01-01')
url3 = make_url3(date)
print(url3)
xr.open_dataset(url3)

Try to build your own function before checking with the code cell below.

In [None]:
# one example of a function
def make_url4(date):
    doy = str(date.dayofyear).zfill(3)
    year = str(date.year)
    base = 'http://oceandata.sci.gsfc.nasa.gov/opendap/Merged_ATV/L3SMI/'
    file = '.L3m_DAY_CHL_chlor_a_4km.nc'
    url = base+year+'/'+doy+'/X'+year+doy+file
    return url

In [None]:
# alternative example of a function
def make_url5(number,year):
    doy = str(number).zfill(3)
    year = str(year)
    date = pd.to_datetime(number,unit='D',origin=pd.to_datetime(year+'-01-01'))
    base = 'http://oceandata.sci.gsfc.nasa.gov/opendap/Merged_ATV/L3SMI/'
    file = '.L3m_DAY_CHL_chlor_a_4km.nc'
    url = base+year+'/'+doy+'/X'+year+doy+file
    return url,date

In [None]:
# different code to test new function
number = 5
year = 2020
url,date = make_url5(number,year)
print(url)
print(date)
xr.open_dataset(url)

<font color=red>**Note:**  </font>Any days in 2022 on the OPeNDAP server are not working through remote download and return a File Not Found error. This is not the code but rather the website. Try another date to see if it works instead.

Look carefully at the example above. Do you notice a discrepancy between the URL date and the pandas datetime?

**Knowledge Check:** Why is there a difference between the date in the URL and the pandas datetime? How can this be fixed?

The reason there is a difference is that pandas day of year function starts at the first day of the year (January 1) then ADDS the number of days on. So January 1 plus 5 days becomes January 6. Meanwhile, the code to build the URL just takes the number it is given. To fix this, you could either subtract 1 from number before doing the pd.to_datetime command, you could start a day earlier in the pd.to_datetime command (December 31 of the previous year), or you could add 1 to number before turning it into a string for the URL. But then you also need to start counting from 0 rather than 1. 

Now that we have a code to make a URL, we can begin downloading data with Xarray. However, if we want to merge all of the days together into one large Dataset, we will need to assign a new dimension and coordinate to the data with the date. 

In [None]:
# get the chlorophyll-a data and add a time dimension
def align_coords(data,date):
    chlor = data['chlor_a']
    chlor = chlor.assign_coords({'time':date}).expand_dims('time',axis=2)
    return chlor

In [None]:
# make a function to merge the files on the time dimension
def merge_days(file,data):
    file = xr.concat([file,data],dim='time')
    return file

**Knowledge Check:** What are these two functions doing? Add comments to the code to explain each step.

The first function selected the chlorophyll-a data that we wanted, and removed the palette data. This also removed two of the dimensions of the data. We then added a new dimension called 'time' and gave it the coordinate of the file's date. 

The second function will merge two Datasets together using the time dimension. This way, the data will stay in order of date.

We can use the three functions to open multiple days of data and combine them into one Dataset.

In [None]:
# this code will take a while because the Datasets are big. Adding more dates makes it take longer
for date in pd.date_range('2020-01-01','2020-01-05',freq='D'):
    url = make_url4(date)
    if date.dayofyear == 1:
        data_chlor = xr.open_dataset(url)
        file = align_coords(data_chlor,date)
    else:
        data_chlor = xr.open_dataset(url)
        chlor = align_coords(data_chlor,date)
        file = merge_days(file,chlor)
    print(date)

In [None]:
file

**Exercise 5:** Make a new function to create URLs and download NCEP NARR data. Make a plot of daily air temperature at 2 m for the years 2019-2020. 

Finding the data:
1) Start with the list of data here https://psl.noaa.gov/data/gridded/data.narr.html   
2) Find the data of interest, then click the book icon. This will take you to the THREDDS catalogue page.    
3) On the THREDDS catalogue page, click any year.   
4) On the info page for that year, click the OPeNDAP option.   
5) On the OPeNDAP Dataset Access Form you will be given a data URL. Use this to create your base URL and function.

In [None]:
# your code here
# function to build URL, maybe merge data?


In [None]:
# your code here
# loop to open and merge data


In [None]:
# your code here
# plot the data
