"Geo Data Science with Python" - Fall 2021
### Notebook Final Exam

This is a take-home exam. 

If you work in teams, please indicate your collaborators below.

In [None]:
NAME = ""
COLLABORATORS = ""

---
---
# PART A: Theoretical Knowledge

(16 points)

Note that these exam questions are designed to deepen your understanding of Python language and data science topics. I encourage you to try to think about the answers first and formulate how you would formulate them, before you cross-check and complete them by looking up your notes or elsewhere. This is also similar to how we approach scientific writing about theoretical topics, for example for the introduction of a report or scientific paper. First we note down down our thoughts to include what we understand is important and to draft a sketch on a topic, then we reflect our draft by revising the literature, and then we use both to formulate our statements more precise. 

Grading criteria: 
- Relevancy of your statements (2 points, depending on question)
- Precision of your statements (1 point)
- Conciseness of your answer (1 point)


---
#### Question A.1: Explain the difference between a class and a function in Python
(4 points)



A class is an object oriented approach that involves the creation of an instance, which possess attributes and methods. A class may possess several methods, equivalent to having several functions. 
<br>A function is different because it receives inputs and produces outputs, and usually performs a unique set of calculations. 

---
#### Question A.3: Explain the difference between a class object and an instance object in Python.

Can one exist without the other?
Add code example for how to create a class and how to create an instance.

(4 points)


An instance object is an initiation of a class object. Therefore, and instance object cannot exist without a class object, but the reverse is not true. 

```
class Example (object):
    def __init__(self, input):
        self.input = input
    
    def method (self):
        print ('This is the value of the first input:', self.input)
        
instance = Example('Hi')
instance.method()  
```

 From the above Example() is a class object, and instance is an instance object of the Example() class.

In [None]:
# code example 
class Example (object):
    def __init__(self, input):
        self.input = input
    
    def method (self):
        print ('This is the value of the first input:', self.input)
        
instance = Example('Hi')
instance.method()

---
#### Question A.3: Explain the difference between the matrix product and the hadamard product when combining two numpy arrays. 

Provide code example for applying both to the example matrices x and y below. Also, name example cases, for which each of them are applied!

(4 points)



Matrix product is natrix multiplicatin as determined by linear algebra. This method requires that the column in the first matrix and the rows in the second have the same dimension. 
<br>The Hadamard product is an element by element multiplication of  matrix componenets in corresponding positions so long as the dimensions of both matrices are the same. 

In [None]:
import numpy as np
x = np.array([[1,1,1],[1,1,1],[1,1,1]])
y = np.array([[2,2,2],[2,2,2],[2,2,2]])

In [None]:
# code example 
Had = x*y
MP = x@y
print ('Hadamard product:')
print(Had)

print ('\nMatrix product:')
print(MP)

---
#### Question A.4: Which machine learning method(s) would you use to automatically discriminate between sounds of birds from sounds of whizzles? Explain your answer.

(4 points)


Supeervised machine learning with a classification approach would be ideal. Since the data is labeled, and we have an idea of our desired output, we can use specified parameters to produce discrete outputs (bird vs whistle). 

---
---
# PART B: Practical Knowledge

(40 points + 10 extra credit)

The tasks of this part are very similar to previous exercises and intent to allow you to get some routine experience with downloading and handling large science data and analyzing them for specific regions.

---
### B.1: Getting data via OpenDAP Servers

(10 points)

At the following link you can retrieve information about groundwater and soil moisture conditions over the US from a data assimilation of remotely sensed GRACE data into a hydrological modle CLSM.

https://disc.gsfc.nasa.gov/datasets/GRACEDADM_CLSM0125US_7D_4.0/summary?keywords=GRACE%20TWS

Investigate also the readme file of the dataset (available at the same link, under the Documentation tab) and answer the following questions in the markdown cell below:

- What is the content of the variable `gws_inst`
- What is the unit of the variable `gws_inst`
- What is the basic purpose of the dataset?
- What is the meaning of low or high `gws_inst` values?

Then, via OpenDAP, retrieve data over mid-eastern USA (south of 40 degree lat, north of 33 degree lat, and east of -85 degree) during any week in the available time series but only for the variable `gws_inst`, its coordinates and its time variable. Add your code in the code cell below.

### Answer
The variable ```gws_inst``` contains data on Groundwater Percentile as a drought indicator. 
The variable ```gws_inst``` has a unit of %. 
The purpose of the dataset is to provide groundwater and soil moisture drought indicators for North America. 
Low ```gws_inst``` values imply low groundwater content, and increased probability of a drought. The opposite is true for higher values. 

In [None]:
from pydap.client import open_url
from pydap.cas.urs import setup_session
import numpy as np 
import matplotlib.pyplot as plt
from shapely.geometry import Polygon, Point

In [None]:
# still receiving an "expired SSL certificate error"
# code in this cell is to bypass that error
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
# accessing the GES DISC data using a url for Jan 6th and including all coordinates and gw_inst data
url = 'https://hydro1.gesdisc.eosdis.nasa.gov/opendap/hyrax/GRACEDA/GRACEDADM_CLSM0125US_7D.4.0/2020/GRACEDADM_CLSM0125US_7D.A20200106.040.nc4?lat[0:1:223],lon[0:1:463],time[0:1:0],gws_inst[0:1:0][0:1:223][0:1:463]'

# creating a login session and assinging the data to the "data" variable 
username = 'fadams'
password = 'Herewego007'
session = setup_session (username, password, check_url=url)
data = open_url (url, session = session)

In [None]:
# assigning all the data arrays to relevant variable names
lat = data.lat[:].data
lon = data.lon[:].data
time = data.time[:].data
gw = data.gws_inst.array[:].data
lonGrid, latGrid = np.meshgrid(lon,lat)

In [None]:
data.lat[:].data

In [None]:
gw.shape


In [None]:
# This cell calculates the new bounds based on the given conditions
# mid-eastern US bounds
minLat = 33
maxLat = 40
minLon = -85
# maxLon = -67

# indexed boundaries 
minLatBnd = np.argmin(np.abs(lat-minLat))
maxLatBnd = np.argmin(np.abs(lat-maxLat))
minLonBnd = np.argmin(np.abs(lon-minLon))
# maxLonBnd = np.argmin(np.abs(lon-maxLon))

# grid values for the new boundary conditions
latGridBnd = latGrid [minLatBnd:maxLatBnd, minLonBnd:]
lonGridBnd = lonGrid  [minLatBnd:maxLatBnd, minLonBnd:]

# new boundaries for the 1D coordinates 
latBnd = lat [minLatBnd:maxLatBnd]
lonBnd = lon [minLonBnd:]

# groundwater data within the specificed boundary conditions
gwBnd = gw [:, minLatBnd:maxLatBnd, minLonBnd:]


In [None]:
minLonBnd

In [None]:
plt.pcolormesh (lonGridBnd, latGridBnd, gwBnd[0])

---
### B.2: Selecting data within Virginia and North Carolina

(10 points)

For the dataset above, generate two data masks that will allow you to extract only those grid cells that are located inside the state of Virginia and North Carolina. The boundaries for both states are given in the files `boundary_VA.csv` and `boundary_NC.csv`. Then plot the dataset and the boundary for the month you chose above. Describe the results (values and their spatial variablity). We will use the mask in the next step.

In [None]:

VA_file = np.genfromtxt('boundary_VA.csv', delimiter = ',', skip_header=1)
NC_file = np.genfromtxt('boundary_NC.csv', delimiter = ',', skip_header=1)

VA_tupleList = [tuple(i) for i in VA_file]
NC_tupleList = [tuple(i) for i in NC_file]

VA_poly = Polygon (VA_tupleList)
NC_poly = Polygon (NC_tupleList)

VA_mask = np.zeros (gwBnd.shape[1:])
NC_mask = np.zeros (gwBnd.shape[1:])
# combineMask = np.zeros (gwBnd.shape)

lenLat = len (latBnd)
lenLon = len (lonBnd)

for i in range (lenLat):
    for j in range (lenLon):
        
        pt = Point (latBnd[i], lonBnd[j])
        
        VA_mask [i,j] = int (VA_poly.contains(pt))
        NC_mask [i,j] = int (NC_poly.contains(pt))
        combineMask = VA_mask + NC_mask        


# VA_mask.shape

# NC_poly


In [None]:
plt.figure(figsize = (14,6))
plt.pcolormesh (lonGridBnd, latGridBnd, gwBnd[0], cmap = 'nipy_spectral')
plt.plot (VA_file[:,0], VA_file[:,1])
plt.plot (NC_file[:,0], NC_file[:,1], 'g')

ADD YOUR DESCRIPTION OF THE RESULTS HERE

---
### B.3: Analyzing time series for Virginia and North Carolina

(20 points)

Expand your retrieval of the GRACEDADM_CLSM0125US dataset to at least two successive years of your choice (you have to batch process the url-links).  You can either download the list of filenames from the NASA website or generate them automatically by updating the dates in the links.

Estimate a time series for the region average of `gws_inst` for the state of Virginia and the state of North Carolina. Plot the time series, compare and discuss the results.

In [None]:
file = open ('./subset_GRACEDADM.txt', 'r')
filenames = file.readlines()
file.close
# print (filenames)
filenames = [i.rstrip('\n') for i in filenames]

# removing first entry because it is a README file 
filenames = filenames[1:]
filenames[0]
# len(filenames)

In [None]:
from tqdm import tqdm

In [27]:

### ADD YOUR CODE HERE
# obtaining data for 2 consecutive years, from 2019 to 2020 

# using an opendap URL as a reference 
urlRef = 'https://hydro1.gesdisc.eosdis.nasa.gov/opendap/hyrax/GRACEDA/GRACEDADM_CLSM0125US_7D.4.0/2020/GRACEDADM_CLSM0125US_7D.A20200106.040.nc4'

# finding index of the part of the url where the differeb
url_idx = urlRef.index('hyrax')

# customisations to the end of the url for variables including coordinates, time and groundwater percentile 
url_end = '?lat[0:1:223],lon[0:1:463],time[0:1:0],gws_inst[0:1:0][0:1:223][0:1:463]'

data = {}
gws = {}
N = len(filenames)

for i in tqdm(filenames):
    for j in range(N):
        
        fname = i[(url_idx-3):]
    
        url = 'https://hydro1.gesdisc.eosdis.nasa.gov/opendap/hyrax/' + fname + url_end
        # print(url)
        
        data =  open_url (url)

        # dataset = open_url (url, session = session)
        # gws[str(j)] = dataset.gws_inst.array[:].data

        

    
# filenames.index(filenames[0])
# # data[]
# filenames.index(filenames[0])
filenames[0][(url_idx-3):]

  2%|▏         | 2/104 [01:47<1:31:29, 53.82s/it]


KeyboardInterrupt: 

In [None]:
data


ADD YOUR DISCUSSION OF THE RESULTS HERE


---
### B.4: Explorative Exercise: Module Cartopy (extra credit)

(10 extra credit points)

Expand your Python knowledge and investigate the module `cartopy`. 

Study their documentation pages here: https://scitools.org.uk/cartopy/docs/v0.15/matplotlib/intro.html# and here https://scitools.org.uk/cartopy/docs/v0.15/matplotlib/advanced_plotting.html?highlight=pcolormesh
to find out how to plot your map(s) from task B.2 in any preferred projection and adding coastlines to the plot.

To install the package on your computer/webapp, if you haven't yet, use:

```python
conda install cartopy
```


In [None]:

### ADD YOUR CODE HERE
