<img style="float: left;" src="earth-lab-logo-rgb.png" width="150" height="150" />

# Homework Template: Earth Analytics Python Course: Spring 2020

Before submitting this assignment, be sure to restart the kernel and run all cells. To do this, pull down the Kernel drop down at the top of this notebook. Then select **restart and run all**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below.

* IMPORTANT: Before you submit your notebook, restart the kernel and run all! Your first cell in the notebook should be `[1]` and all cells should run in order! You will lose points if your notebook does not run. 

For all plots and code in general:

* Add appropriate titles to your plot that clearly and concisely describe what the plot shows (e.g. time, location, phenomenon).
* Be sure to use the correct bands for each plot.
* Specify the source of the data for each plot using a plot caption created with `ax.text()`.
* Place ONLY the code needed to create a plot in the plot cells. Place additional processing code ABOVE that cell (in a separate code cell).

Make sure that you:

* **Only include the package imports, code, data, and outputs that are CRUCIAL to your homework assignment.**
* Follow PEP 8 standards. Use the `pep8` tool in Jupyter Notebook to ensure proper formatting (however, note that it does not catch everything!).
* Keep comments concise and strategic. Don't comment every line!
* Organize your code in a way that makes it easy to follow. 
* Write your code so that it can be run on any operating system. This means that:
   1. the data should be downloaded in the notebook to ensure it's reproducible.
   2. all paths should be created dynamically using the os package to ensure that they work across operating systems. 
* Check for spelling errors in your text and code comments


In [None]:
NAME = ""
COLLABORATORS = ""

![Colored Bar](colored-bar.png)

# Week 07 and 08 Homework - Automate NDVI Workflow

For this assignment, you will write code to generate a plot and an output CSV file of the mean normalized difference vegetation index (NDVI) for two different sites in the United States across one year of data:

* San Joaquin Experimental Range (SJER) in Southern California, United States
* Harvard Forest (HARV) in the Northeastern United States

The data that you will use for this week is available from **earthpy** using the following download: 

`et.data.get_data('ndvi-automation')`

## Assignment Goals

Your goal in this assignment is to create the most efficient and concise workflow that you can that allows for:

1. The code to scale if you added new sites or more time periods to the analysis.
2. Someone else to understand your workflow.
3. The LEAST and most efficient (i.e. runs fast) amount of code that completes the task.

### HINTS

* Remove values outside of the landsat valid range of values as specified in the metadata, as needed.
* Keep any output files SEPARATE FROM input files. Outputs should be created in an outputs directory that is created in the code (if needed) and/or tested for.
* It can help to create the plot and CSV first without cleaning the data to deal with cloud, so you can get a hang of the workflow.  Then, you can modify your workflow to include the cleaning of the data to deal with clouds. (There are tests throughout the notebook that can help you check the data!)


## Assignment Requirements

Your submission to the GitHub repository should include:
* This Jupyter Notebook file (.ipynb) with:
    * The code to create a plot of mean NDVI across the year:
        * NDVI on the x axis and formatted dates on the y for both NEON sites on one figure/axis object
    * The **data should be cleaned to remove the influence of clouds**. See the [earthdatascience website for an example of what your plot might look like with and without removal of clouds](https://www.earthdatascience.org/courses/earth-analytics-python/create-efficient-data-workflows/).
* One output .csv file that has 3 columns - NDVI, Date and Site Name - with values for SJER and HARV.

Your notebook should:
* Have at least 2 well documented and well named functions with docstrings.
* Include a Markdown cell at the top of the notebook that outlines the overall workflow using pseudocode (i.e. plain language, not code)
* Include additional Markdown cells throughout the notebook to describe: 
    * the data that you used - and where it is from
    * how data are being processing
    * how the code is optimized to run fast and be more concise

In [None]:
# Autograding imports - do not modify this cell
import matplotcheck.autograde as ag
import matplotcheck.notebook as nb
import matplotcheck.timeseries as ts
from datetime import datetime

In [None]:
# Import needed packages in PEP 8 order
# and no unused imports listed (10 points total)

# YOUR CODE HERE
raise NotImplementedError()

### DO NOT REMOVE THIS LINE ###
start_time = datetime.now()

# Figure 1: Plot 1 - Mean NDVI For Each Site Across the Year (50 points)

Create a plot of the mean normalized difference vegetation index (NDVI) for the two different sites in the United States across the year: 

* NDVI on the x axis and formatted dates on the y for both NEON sites on one figure/axis object.
* Each site should be identified with a different color in the plot and legend.
* The final plot **data should be cleaned to remove the influence of clouds**.
* Be sure to include appropriate title and axes labels.

You may additional cells as needed for processing data (e.g. defining functions, etc), but be sure to:
* follow the instructions in the code cells that have been provided to ensure that you are able to use the sanity check tests that are provided. 
* include only the plot code in the cell identified for the final plot code below

In [None]:
# This cell is not required but highly encouraged to break down the workflow.

# Create dataframe of NDVI without cleaning data to deal with clouds

# Important: to use the ungraded tests below as a sanity check, 
# name your dataframe 'ndvi_ts_unclean' and the columns: mean_ndvi and site

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Test your dataframe before cleaning the data to deal with clouds. 
# Ensure that your dataframe is named 'ndvi_ts_unclean'
# and that the columns called: mean_ndvi and site

# These tests are not graded.
# This is for data that hasn't been cleaned yet to deal 
# with clouds and serves as a half way sanity check.

# Ensure the data is stored in a dataframe.
try:
    assert isinstance(ndvi_ts_unclean, pd.DataFrame)
    print('Your data is stored in a DataFrame!')
except AssertionError:
    print('It appears your data is not stored in a DataFrame. ',
          'To see what type of object your data is stored in, check its type with type(object)')

# Ensure there are the correct amount of total entries in the dataframe.
try:
    assert len(ndvi_ts_unclean) == 46
    print('You have the correct number of data values!')
except AssertionError:
    print('You do not have the correct amount of data stored in your DataFrame.')

# Ensure there are the correct amount of entries for each site.
try:
    assert all(
        [site_count == 23 for site_count in ndvi_ts_unclean['site'].value_counts()])
    print('You have the correct amount of both sites!')
except AssertionError:
    print('One of your sites is either missing data or has extra data.')

# Ensure the minimum and maximum values in the mean_ndvi column are correct.
try:
    ndvi_min, ndvi_max = ndvi_ts_unclean['mean_ndvi'].min(), ndvi_ts_unclean['mean_ndvi'].max()
    assert ndvi_min == -0.0020918187219649553 and ndvi_max == 0.8629496097564697
    print('The minimum and maximum values in your ndvi_mean column are correct!')
except AssertionError:
    print('The minimum and maximum values in your ndvi_mean column are incorrect.')

In [None]:
# Create dataframe of NDVI including the cleaning data to deal with clouds

# Important: to use the ungraded tests below as a sanity check, 
# name your dataframe 'ndvi_ts' and the columns: mean_ndvi and site

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Last sanity check before creating your plot

# Ensure that your dataframe is named 'ndvi_ts'
# and that the columns called: mean_ndvi and site

# These tests are not graded.

# Ensure the data is stored in a dataframe.
try:
    assert isinstance(ndvi_df, pd.DataFrame)
    print('Your data is stored in a DataFrame!')
except AssertionError:
    print('It appears your data is not stored in a DataFrame. ',
          'To see what type of object your data is stored in, check its type with type(object)')

# Check that dataframe contains the appropriate number of NAN values
try:
    assert ndvi_df.isna().sum()['mean_ndvi'] == 15
    print('Correct number of masked data values!')
except AssertionError:
    print('The amount of null data in your dataframe is incorrect.')

In [None]:
# Add only the plot code to this cell

# This is the final figure of mean NDVI 
# for both sites across the year
# with data cleaned to deal with clouds

# YOUR CODE HERE
raise NotImplementedError()

### DO NOT REMOVE LINES BELOW ###
final_masked_solution = nb.convert_axes(plt, which_axes="current")
end_time = datetime.now()
total_time = end_time - start_time

In [None]:
# Ignore this cell for the autograding tests


# Question 1 (10 points)

Imagine that you are planning NEON’s upcoming flight season to capture remote sensing data in these locations and want to ensure that you fly the area when the vegetation is the most green.

When would you recommend the flights take place for each site? 

Answer the question in 2-3 sentences in the Markdown cell below.

YOUR ANSWER HERE

# Question 2 (10 points)

How could you modify your workflow to look at vegetation changes over time in each site? 

Answer the question in 2-3 sentences in the Markdown cell below.

YOUR ANSWER HERE

# Do not edit this cell! (40 points)

The notebook includes:
* at least 2 well documented and well named functions with appropriately formatted docstrings.

# Do not edit this cell! (20 points)

The notebook includes:
* a Markdown cell at the top of the notebook that outlines the overall workflow using pseudocode (i.e. plain language, not code).

# Do not edit this cell! (20 points)

The notebook includes:
* additional Markdown cells throughout the notebook to describe: 
    * the data that you used - and where it is from
    * how data are being processing
    * how the code is optimized to run fast and be more concise

# Do not edit this cell! (20 points)

The notebook will also be checked for overall clean code requirements as specified at the **top** of this notebook. Some of these requirements include (review the top cells for more specifics): 

* Notebook begins at cell [1] and runs on any machine in its entirety.
* PEP 8 format is applied throughout (including lengths of comment and code lines).
* No additional code or imports in the notebook that is not needed for the workflow.
* Notebook is fully reproducible. This means:
   * reproducible paths using the os module.
   * data downloaded using code in the notebook.
   * all imports at top of notebook.

# Do not edit this cell! (20 points)

In addition to this notebook, the submission to the GitHub repository includes:
* One output .csv file that has 3 columns - NDVI, Date and Site Name - with values for SJER and HARV