<img style="float: left;" src="../earth-lab-logo-rgb.png" width="150" height="150">

# Earth Data Science Corps Summer 2020

![Colored Bar](../colored-bar.png)

## What Is a Workflow Diagram?

A workflow diagram
represents a high level overview summary of the steps that you will need to complete 
to address the question that you are asking in your project. 

In the example diagram below, you will see a project that has two input 
datasets that are being used to create an output project. For this example,
let's pretend that the output project is a map of areas that are the most 
vulnerable to drought conditions in your area. 

<img src="images/workflow1.png">

Looking at the diagram below imagine the following:

1. **Input Data:** each green box (Branch #1 and Branch #2) represents the 
data that you will use to complete your project. Each data set may need to 
be processed separately at first, prior to beginning your analysis. 

2. **Cleaning / Processing:** Once you have identified the data that you need to 
use, you can then import and clean up the data. Cleaning steps may include things like:

    * Remove NA or missing data values
    * Exploring the data to see if there are any unusual values 
    * Clipping or filtering out parts off the data that you perhaps don't need for your analysis

Once you have completed the data cleaning process, you are ready to analyze your data.

3. **Analysis:** In this step you may manipulate the data to convert it into information 
that is usable for your analysis. Examples include:
    * calculating a vegetation index such as NDVI
    * performing classification on your data
    * If your data are time series, perhaps this step involves summarizing and calculating statistics on the data. 

4. **Output Data:** Finally you might use your analysis to produce an output data set. Following the examples above, example outputs may include:
    * a geotiff of NDVI for your region
    * a classified raster
    * a new .CSV file containing summarized time series values
  
*****

### Branch 2 - A Second Input Data Set

The steps above outlined the initial processing for a single dataset. But maybe your project will require
more than one dataset. For example if you are working with remote sensing imagery, maybe there are 
vector boundaries (study area boundaries, county boundaries, etc) that you need to complete your final analysis. 

You will need to start the same process of opening, exploring and cleaning up that second dataset, prior to 
completing your final analysis. For instance, you may open up the vector data and discourse that is it 
in a different Coordinate Reference System than your imagery. You may need to clean up the data and 
reproject the data first, before trying to work with the two datasets together.


## An Example Workflow Diagram

<img src="images/workflow1.png">

## Putting It All Together: Using Two Datasets in a Project

Once you have cleaned up all of the data that you need for your project, you are ready to
begin working with the data together. This is what the diagram below demonstrates as you
may begin clean and process the data together starting a new "branch" of your workflow. 

### Save Intermediate Outputs

In the example below, you may consider saving some of the intermediate outputs. This will 
save you time in the future if you need to redo your analysis. If you are processing 
truly large datasets, this will also save time if your workflow stalls while running. 

## There Are No Perfectly Linear Workflows

Keep in mind that a workflow is rarely linear. Often times you spend additional time on the data cleanup steps, exploring your data and trying various clean up approaches. With that said, it is helpful to start with a linear sketch to help you organize your process, while not getting too overwhelmed by all of the steps involved. 

## Break Down Earth Data Science Workflow  Into Steps

1. Define your question or challenge that your data analysis needs to address.

The first step is to clarify the problem you are addressing with your data analysis. Hopefully at this point, you've already worked through the project message box, which is where you'd address this topic.  

2. Define your project outcomes or end products 

In this step, you will define what you hope to produce in order to explain, predict, estimate, or build your question or challenge.

3. Define the data that you will need to achieve these outcomes. 

Sometimes you know exactly what data you will need, but in other cases you may need to do some searching (see the next step). Regardless before you begin any actual work on your project you will need to identify the data that you will use!


4. Has any work like this been done before?

There is no need to reinvent the wheel when working with data. If someone else has a workflow that you can use or modify - consider starting there. Search online for papers, articles, blog posts that describe a similar goal and outcome that may be help inform your processing steps. 

Through this search you can start to:

5. Identify what is needed to implement those methods and identify what methods you need to learn more about.

Once you have done the work to address all of the steps above, you likely will have some questions. 
Take some time to identify what you need to learn more about. This is where you may ask 
someone who knows more than you do about this type of workflow for guidance! 


<img src="images/workflow2.png">


### 1. Identify Input Datasets (e.g. what data is needed and where you can find it?)

Searching for datasets can be a time consuming step of any project, but getting the correct dataset for your analysis is crucial. Here are a few things that are important to keep in mind while searching for datasets: 

* Specify resolutions (e.g. spatial and temporal) and timeframes (e.g. dates needed for analysis). Your problem will have different needs resolution wise depending on what you're looking at. If, for example, you were looking at multi-decade deforestation patterns of the Amazon Rainforest as a whole, a low temporal and spatial resolution could work for your project. But if you were looking at recovery of a forested area from a small fire that happened last year, you would need a much higher spatial and temporal resolution. Sometimes, even if you find data that is technically correct for your project, it is still not useful for due to resolution limitations.
* Specify data sources (e.g. public data portals like Earth Explorer, data from project partners). Once you know more specifically what type of data you are searching for, start identifying data sources that are appropriate for your project. Often, data can be collected from large publically available datsets, but not always. Because of this, it's important to ask:
    * Are there datasets that you need but don't have yet? If you are collecting more specialized data, make sure your in contact with the agency you're collecting the data from. This is to make sure that the data still fits your project and can be ready for integration with your project.  

<img src="images/workflow3.png">

## Time to Get Started On Your Project

After you've done your workflow research, you are ready to begin building out your workflow.
Below each step of a general workflow is discussed in more detail.

### 2. Clean and Process Data (e.g. standardize, modify to study area or timeframe)

As you've seen, data can be messy when it's distributed. It's important to clean up the data before running analysis on it. Some key tasks might be:
* Data standardization tasks
    * Data clean-up (e.g. data contain site names but they are not consistent such as SJER vs San Joaquin Experimental Range).
    * Making data collected in different units or coordinate systems compatible with other datasets
      * Pick the most appropriate units for the study area of your project and convert all your data to that standard.
* Ensure that your analysis only covers the spatial and/or temporal extent needed for analysis.
    * To spatially limit your analysis, you can crop your data to whatever your study area is.
    * To temporally limit your analysis, you can ensure that the start and end dates of your data collection only cover the time period you are interested in. You can also summarize you data by the minimal time step needed (e.g. monthly, yearly) if it's temporal resolution is to high. 
* Add new attributes to your data. Not all datasets come complete with every piece of information, so make sure to add data to the dataset if you need to. 
    * Categorize or bin data as needed to aid with analysis.

<img src="images/workflow4.png">

### 3. Analyze Your Data

* Go back to the question or challenge you are trying to address, and determine the analysis that will turn your data (**input**) into something you can use (**output**). Below are some examples of types of analysis you might run on your data: 
    * Summarization
    * Classification
    * Raster calculation
    * Vector overlay
    * Linear Regression
    * Other types of analysis not listed here!

<img src="images/workflow5.png">

### 4. Determine Output Data Needs

* What does the output need to look like (e.g. format, structure) to create your desired end product? Below are some possible outputs your analysis may produce:
    * Classified raster image
        * e.g. flooded area analysis, Normalized difference vegetation index (NDVI), to look at vegetation coverage of an area, land cover, etc. 
        * This could also be analysing images over time or at mutliple study sites.
    * Spreadsheet of summary statistics that can later be turned into a plot.
        * e.g. the mean of a variable over time and/or by study site
* Here is where sketching a workflow really shines!
    * **Map inputs to outputs** through cleaning and processing your data and running it through your chosen analysis.
        
<img src="images/workflow6.png">

### 5. Create Desired End Product

* Usually a visualization summarizing the outputs of the analysis you ran on your input data. Visualization is a powerful tool, and can be created in more than one way! (e.g. maps, plots)
  * Think about choosing 3 key visualizations that summarize your results
    * Check out <a href="https://www.edwardtufte.com/tufte/" target="_blank">Edward Tufte's work for inspiration</a>
  * You can also create and/or clean up some these visuals outside of Jupyter Notebook and use Markdown to display them in Jupyter Notebook
    * Sometimes it's easiest to modify visuals outside of Jupyter Notebook once you've produced them. This could be done to combine maps or plots, add text to images in Adobe or Powerpoint, etc. 
    * Use google drive, dropbox, etc. to provide a URL to any images that were produced outside of your jupyter notebook (e.g. all images in this presentation)
        
<img src="images/workflow2.png">

### 6. Iterate Through and Expand Workflow as Needed

* Workflows are rarely linear, and often consist of combining multiple workflows into one. Add branches to create intermediary products as needed for the end product. 

<img src="images/workflow1.png">

## Tools for Drawing/Sketching Project Workflows

* [Lucidchart](https://www.lucidchart.com/pages/) used for in-class demo
* [List of other free tools](https://medium.com/pm101/8-flowcharts-and-diagrams-apps-837373859e87)

## Ideas for Getting Started with Sketching

Any of these areas are great places to start:
* Start with Data:
    * Identify tasks needed to:
        * clean data (e.g. remove null values, add/remove columns)
        * standardize data (e.g. resample to same resolution)
        * analyze data and produce output data
    * Identify which tasks need to be repeated across a list of items (e.g. multiple sites)
* Start with End Product:
    * Identify the final output file(s) and the intermediary files needed to create it

## Other Suggestions For Project Success

* Create a project plan/timeline 
    * Begin with a blank weekly
        * Add weekly activities and milestones for project
* Use <a href=" https://www.smartsheet.com/blog/essential-guide-writing-smart-goals" target="_blank">SMART Goals </a> to Identify individual tasks
    * Specific
    * Measurable
    * Achievable
    * Relevant
    * Time-bound
* Map/Assign these tasks to the related milestone