<a href="https://colab.research.google.com/github/diyanko/DataAnalytics2021-Diyanko-Bhowmik/blob/main/Combining_Precipitation_and_Elevation_dataipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Combining Precipitation and Elevation from Raster

In this notebook, I will create datasets needed for my analysis. This will involve combining all monthly datasets together, combining all daily datasets together and creating an elevation data file.

## Import libraries

We need `ftplib` for managing FTP connections and `rasterio` for working with rasters.

In [None]:
import os
import ssl
import json

import numpy as np
import pandas as pd

!pip install rasterio
!pip install geopandas

import rasterio
import geopandas as gpd
from ftplib import FTP_TLS

Collecting rasterio
[?25l  Downloading https://files.pythonhosted.org/packages/e1/bf/d3c5e7df3828db144a6797269bf3aec31db96c20f13e75b93179eb059955/rasterio-1.2.3-cp37-cp37m-manylinux1_x86_64.whl (19.1MB)
[K     |████████████████████████████████| 19.1MB 1.5MB/s 
[?25hCollecting snuggs>=1.4.1
  Downloading https://files.pythonhosted.org/packages/cc/0e/d27d6e806d6c0d1a2cfdc5d1f088e42339a0a54a09c3343f7f81ec8947ea/snuggs-1.4.7-py3-none-any.whl
Collecting cligj>=0.5
  Downloading https://files.pythonhosted.org/packages/42/1e/947eadf10d6804bf276eb8a038bd5307996dceaaa41cfd21b7a15ec62f5d/cligj-0.7.1-py3-none-any.whl
Collecting click-plugins
  Downloading https://files.pythonhosted.org/packages/e9/da/824b92d9942f4e472702488857914bdd50f73021efea15b4cad9aca8ecef/click_plugins-1.1.1-py2.py3-none-any.whl
Collecting affine
  Downloading https://files.pythonhosted.org/packages/ac/a6/1a39a1ede71210e3ddaf623982b06ecfc5c5c03741ae659073159184cd3e/affine-2.3.0-py2.py3-none-any.whl
Installing collected pa

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load centers

We load in the centers file so we can use the coordinates for extracting data.

In [None]:
centers_data = pd.read_csv("/content/drive/MyDrive/DA/all_processed_data.csv")
centers_data.head()

Unnamed: 0,ID,NAME,CITY,STATE,ZIP,LATITUDE,LONGITUDE
0,12142728,T J HEALTH COLUMBIA,COLUMBIA,KY,42728,37.096642,-85.294546
1,1042539,CASEY COUNTY HOSPITAL,LIBERTY,KY,42539,37.317717,-84.933172
2,5240336,MARCUM AND WALLACE MEMORIAL HOSPITAL,IRVINE,KY,40336,37.706197,-83.977277
3,3641653,HIGHLANDS REGIONAL MEDICAL CENTER,PRESTONSBURG,KY,41653,37.72901,-82.76732
4,7140207,NORTON WOMEN'S AND KOSAIR CHILDREN'S HOSPITAL,LOUISVILLE,KY,40207,38.23533,-85.632937


I'll extract the coordinates which will be needed for getting the corresponding data values.

In [None]:
coords = [(x,y) for x, y in zip(centers_data["LONGITUDE"], centers_data["LATITUDE"])]

## Load credentials

Access to data is through an authorized ftp connections, so we need to load in our credentials. I have kept the credentials inside the file **eosdis_credentials.json** as a json format which I read in.

In [None]:
# Open credentials file
credentials_file = open("eosdis_credentials.json")

# Load credentials
credentials = json.load(credentials_file)
username = credentials["username"]
password = credentials["password"]
  
# Close the file
credentials_file.close()

FileNotFoundError: ignored

## Set up FTP

We will now set up the FTP connection which we can use to get the data from the server. The set up instructions are taken from [GPM](https://gpm.nasa.gov/data/directory/imerg-final-run-pps-research-gis) download instructions for FTP.

In [None]:
ftp_site = "arthurhouftps.pps.eosdis.nasa.gov"
FTP_TLS.ssl_version = ssl.PROTOCOL_TLSv1_2
ftps = FTP_TLS()
ftps.connect(ftp_site, 21)
ftps.login("diyankobhowmik.db@gmail.com", "diyankobhowmik.db@gmail.com")
ftps.prot_p()

'200 Protection set to Private'

## Get data

We will now use the FTP connection to get the TIF files, extract data for the health centers and create our dataframe.

### Monthly data download

The following function enables us to download monthly precipitation data for a given date range for a given set of coordinates.

In [None]:
def get_monthly_data(coords = [], ids = [], start_year = 2010, end_year = 2020):
    
    # If year is less than 2001, it won't work
    if start_year < 2001:
        print("ERROR: Start year must be more than 2000")
        return
     
    # If year is more than 2020, it won't work
    if end_year > 2020:
        print("ERROR: End year must be less than 2021")
        return
    
    # If end year is less than start year, it won't work
    if end_year < start_year:
        print("ERROR: End year must be after start year")
        return
        
    # Start creating a resultant dataframe
    resultant_data = pd.DataFrame({"ID": ids, "temp": range(len(coords))})
    
    # For each year
    for year in range(start_year, end_year + 1):
        
        # For each month in the current year
        for month in ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]:
        
            # Get the TIF file and save to local temporary file
            f = open("temporary_tif.tif", "wb")
            ftps.retrbinary('RETR pub/gpmdata/' + str(year) + '/' + month + '/01/gis/3B-MO-GIS.MS.MRG.3IMERG.' + str(year) + month + '01-S000000-E235959.' + month + '.V06B.tif', f.write)
            f.close()
            
            # Open that file
            src = rasterio.open('temporary_tif.tif')
            
            # Get data from rastor
            resultant_data["temp"] = [x for x in src.sample(coords)]
            resultant_data[month + "_" + str(year)] = resultant_data["temp"].apply(lambda x: x[0])
            
            # Scale down by a factor of 1000 (documentation states the values are scaled up)
            resultant_data[month + "_" + str(year)] = resultant_data[month + "_" + str(year)].div(1000)
            
            # Remove the temporary file
            os.remove("temporary_tif.tif")
            
        # Print when data is loaded for a given year
        print("Completed extraction for {}".format(year))
        
    # Return final dataframe
    return resultant_data.drop(["temp"], axis = 1)

I will download the monthly precipitation data for the years 2005 to 2020 and save it.

In [None]:
prec_for_centers = get_monthly_data(coords, centers_data["ID"], 2005, 2020)
prec_for_centers.to_csv("/content/drive/MyDrive/DA/prec_for_centers_monthly.csv", index = False)

Completed extraction for 2005
Completed extraction for 2006
Completed extraction for 2007
Completed extraction for 2008
Completed extraction for 2009
Completed extraction for 2010
Completed extraction for 2011
Completed extraction for 2012
Completed extraction for 2013
Completed extraction for 2014
Completed extraction for 2015
Completed extraction for 2016
Completed extraction for 2017
Completed extraction for 2018
Completed extraction for 2019
Completed extraction for 2020


### Adding elevation data for monthly data

Finally, I'll combine the elevation data I got from The **World TIF**, which is taken from [ASTER Global Digital Elevation Map](https://asterweb.jpl.nasa.gov/gdem.asp). The values are in metres.

In [None]:
def get_elevation_data(coords, ids, elevation_file):
      
    # Create resultant dataframe
    resultant_data = pd.DataFrame({"ID": ids})
    
    # Read the TIF file
    src = rasterio.open(elevation_file)
    
    # Get elevation data
    resultant_data["elevation"] = [x for x in src.sample(coords)]
    resultant_data["elevation"] = resultant_data["elevation"].apply(lambda x: x[0])
        
    # Return resultant data
    return resultant_data

In [None]:
elevation_for_centers = get_elevation_data(coords, centers_data["ID"], "/content/drive/MyDrive/DA/GDEM-10km-BW.tif")
elevation_for_centers.to_csv("/content/drive/MyDrive/DA/elevation_for_centers.csv", index = False)