# Downloading Precipitation Data

In this notebook, we will use a **url.txt** file that contains a list of URLs for precipitation data to download all relevant data and then filter it for the respective state.

## Import libraries

We import the `requests` package to access URLs, `netCDF4` to read netCDF files and other relevant packages to work wirh data including `numpy` and `pandas`. I'll also download `descartes`, `geopandas` and `shapely` to filter the data specific to a given shapefile.

In [None]:
!pip install geopandas netCDF4 shapely rtree pygeos

In [None]:
import os
import requests
import descartes
import numpy as np
import pandas as pd
import geopandas as gpd
from netCDF4 import Dataset
from geopandas.tools import sjoin
from shapely.geometry import Point, Polygon, shape

  shapely_geos_version, geos_capi_version_string


In [None]:
# Imort libraries for handing requests and cookies
from http.cookiejar import CookieJar
from urllib.parse import urlencode
import urllib.request as urllib2

# The user credentials that will be used to authenticate access to the data
 
username = "<username>"
password = "<passsword>"

# Create a password manager to deal with the 401 reponse that is returned from
# Earthdata Login
 
password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, "https://urs.earthdata.nasa.gov", username, password)

# Create a cookie jar for storing cookies. This is used to store and return
# the session cookie given to use by the data server (otherwise it will just
# keep sending us back to Earthdata Login to authenticate).  Ideally, we
# should use a file based cookie jar to preserve cookies between runs. This
# will make it much more efficient.
 
cookie_jar = CookieJar()

# Install all the handlers.
 
opener = urllib2.build_opener(
    urllib2.HTTPBasicAuthHandler(password_manager),
    #urllib2.HTTPHandler(debuglevel=1),    # Uncomment these two lines to see
    #urllib2.HTTPSHandler(debuglevel=1),   # details of the requests/responses
    urllib2.HTTPCookieProcessor(cookie_jar))
urllib2.install_opener(opener)

## Import data files

We will load in the shape file for USA and based on the state we are looking at, we will filter the data.

In [None]:
shape_file = gpd.read_file("cb_2018_us_state_500k.shp")

## Create csv

I have created a function which takes in the urls file and creates CSV file for the respective state.

In [None]:
def create_csv(url_file_name, shape_file, state_code = "WA", year = "2010", month = "01"):
    
    # Filter shape file for the specific state
    shape_file = shape_file[shape_file["STUSPS"] == state_code].reset_index(drop = True)
    
    # Load the URLs file
    url_file = open(url_file_name, "r")
    
    # Create a dataframe which will store all data
    resultant_data = pd.DataFrame({"Latitude": [], 
                                   "Longitude": [], 
                                   "Precipitation": [], 
                                   "geometry": [], 
                                   "State": [],
                                   "StartTime": [],
                                   "EndTime": []})
    
    
    # For each URL, get the data
    for URL in url_file:

        # Make sure the URL is not for a PDF file
        if URL[-4:] != "pdf\n":

            # Create and submit the request. There are a wide range of exceptions that
            # can be thrown here, including HTTPError and URLError. These should be
            # caught and handled.
            
            request = urllib2.Request(URL[:-1])
            response = urllib2.urlopen(request)
            result = response.read()

            # Save the retrieved data to a file
            FILENAME = URL.split("https://gpm1.gesdisc.eosdis.nasa.gov/opendap/GPM_L3/GPM_3IMERGHH.06/")[1].split("/")[2].split("?")[0]
            f = open(FILENAME, 'wb')
            f.write(result)
            f.close()

            # Read data
            data = Dataset(FILENAME)
            
            # Get latitude, longitude and precipitation
            lon_values = list(np.repeat(data['lon'][:], data['lat'][:].shape[0]))
            lat_values = list(np.tile(data['lat'][:], data['lon'][:].shape[0]))
            precp_values = data['precipitationCal'][:][0].flatten()
            temp_df = pd.DataFrame({"Latitude": lat_values, "Longitude": lon_values, "Precipitation": precp_values})

            # Create geodataframe from the points
            geometry = [Point(xy) for xy in zip(temp_df["Longitude"], temp_df["Latitude"])]
            points = gpd.GeoDataFrame(temp_df, crs = "EPSG:4269", geometry = geometry)

            # Select the data that lies within the given state
            final_df = sjoin(points, shape_file, how = 'inner', op = 'intersects')
            
            # Create dataframe for the given file
            final_df = final_df[["Latitude", "Longitude", "Precipitation", "geometry"]].reset_index(drop = True)
            final_df["State"] = state_code
            final_df["StartTime"] = data.__dict__["FileHeader"].split("\nStartGranuleDateTime=")[1].split(";")[0]
            final_df["EndTime"] = data.__dict__["FileHeader"].split("\nStopGranuleDateTime=")[1].split(";")[0]

            # Append to the resultant dataframe
            resultant_data = pd.concat([resultant_data, final_df], ignore_index = True)

            # Print to console that a record has been updated
            print("Retrieved data from {} to {}".format(data.__dict__["FileHeader"].split("\nStartGranuleDateTime=")[1].split(";")[0], 
                                                        data.__dict__["FileHeader"].split("\nStopGranuleDateTime=")[1].split(";")[0]))

            # Remove the file we created as we no longer need
            os.remove(FILENAME)

    # Save the final dataframe to the system
    resultant_data.to_csv(state_code + "_" + year + "_" + month + ".csv", index = False)

## Retrieve data

I will now start to load in the data for each of the URL files. Sample created for **Washington** - **February 2010**.

In [None]:
create_csv("url.txt", shape_file, state_code = "FL", year = "2000", month = "07")

Retrieved data from 2000-07-01T00:00:00.000Z to 2000-07-01T00:29:59.999Z
Retrieved data from 2000-07-01T00:30:00.000Z to 2000-07-01T00:59:59.999Z
Retrieved data from 2000-07-01T01:00:00.000Z to 2000-07-01T01:29:59.999Z
Retrieved data from 2000-07-01T01:30:00.000Z to 2000-07-01T01:59:59.999Z
Retrieved data from 2000-07-01T02:00:00.000Z to 2000-07-01T02:29:59.999Z
Retrieved data from 2000-07-01T02:30:00.000Z to 2000-07-01T02:59:59.999Z
Retrieved data from 2000-07-01T03:00:00.000Z to 2000-07-01T03:29:59.999Z
Retrieved data from 2000-07-01T03:30:00.000Z to 2000-07-01T03:59:59.999Z
Retrieved data from 2000-07-01T04:00:00.000Z to 2000-07-01T04:29:59.999Z
Retrieved data from 2000-07-01T04:30:00.000Z to 2000-07-01T04:59:59.999Z
Retrieved data from 2000-07-01T05:00:00.000Z to 2000-07-01T05:29:59.999Z
Retrieved data from 2000-07-01T05:30:00.000Z to 2000-07-01T05:59:59.999Z
Retrieved data from 2000-07-01T06:00:00.000Z to 2000-07-01T06:29:59.999Z
Retrieved data from 2000-07-01T06:30:00.000Z to 200

Collecting pygeos
[?25l  Downloading https://files.pythonhosted.org/packages/f5/2c/071f928a67d8a7e754a99ba3281ec685c8dfa4d64f9b83fc53ca2c325b82/pygeos-0.9-cp37-cp37m-manylinux1_x86_64.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 4.5MB/s 
Installing collected packages: pygeos
Successfully installed pygeos-0.9
