# 💻Coding for the Environment Machine Learning Workshop 🥳

## Agenda

### 1. Introductions
### 2. Setup
### 3. Overview
### 4. Clustering
### 5. Supervised Classification
### 6. Practice in Python

# 1. Introductions 👋

# 2. Setup

### 1. Set up container: https://cmgr.oit.duke.edu/

### 2. Clone workshop stuff:
`git clone https://github.com/caseyslaught/duke-c4e-ml-workshop.git`

---

# 3. Overview 🤔

### Machine learning is a BIG field.
- #### Land cover classification (what land cover class is this?)
- #### Computer vision (what's in this photo?)
- #### Audio processing (what animal is in this recording?)
- #### Automated driving, natural language processing, automated robots, etc...

<br/>

![audio](images/spectrogram.jpeg)
![computer vision](images/computer_vision.jpeg)
![lulc](images/lulc.jpeg)

### A brief history of ML
- #### Machine learning has been around for a while!
- #### First ML programs in the 1950's
- #### Deep Blue beats Garry Kasparov in 1997
- #### Deep learning coined in 2006
- #### Recent explosion due to new computational ability (GPU's), data availablily, new techniques.

---

<br/>

## We're gonna focus on classification in this workshop.

<br/>

---

## Unsupervised
- #### No labeled data given
- #### Goal is to find some structure in the data
- #### ex. clustering

![kmeans sat](images/kmeans_sat.png)

---

## Supervised
- #### We do have labeled data
- #### We train a model using the labeled data
- #### We then predict the output given a new set of inputs
- #### ex. regression, random forest, neural networks


## Clustering/classification problems
- #### Given a point with some attributes, what group does it belong to?
- #### ex. species (size, hair?, scales?, gills?) → phylum

---

# 4. Clustering 🧮

![](images/clusters_init.png)
![](images/clusters_after.png)

## Clustering is all about grouping points using the data in out dataset

### Why would we want to do this?
- #### Understand the data better (maybe we don't know the groups in advance)
- #### Explore potential classes for subsequent supervised analysis
- #### Easily summarize data (kinda like compresion)

### What are some potential problems?
- #### Groups not neatly separated (lots of overlap, weird configurations)
- Interpreting results takes domain knowledge

![clusters](images/clustering.png)

---

## Questions so far?

---

## Clustering | K-Means

### 1. Initialize
- #### Define K cluster centers randomly (centroids)


### 2. Iterate
- #### For each point calculate closest cluster
- #### Calculate mean of each cluster and make that mean the new cluster center  


### 3. Terminate
- #### If no points reassigned then finished

<br />

![kmeans](images/kmeans.png)
![](images/kmeans_init.png)
![](images/kmeans_after.png)

## Clustering | K-Means (cont.)

- ### Benefits
    - #### Easy(ish) to understand
- ### Downsides
    - #### Very sensitive to initial cluster locations
    - #### May terminate at local minimum (not global)
    - #### May not know number of clusters in advance

## With K-means we might not end up at the optimal solution.
### To get around this we can: 
#### 1. Calculate some metric of error for the K-means algorithms (ex. sum of distances to centroid)
#### 2. Run the algorithm a bunch of times and save the results
#### 3. Pick the result with the lowest error  


![](images/kmeans_bad.gif)

---

# Clustering | Mixture of Gaussians

### What if our clusters are not symetrical?
### What if we have some overlap between cluseters?

### Here's how Gaussian Mixture Models work:
- #### Each group is represented by a Gaussian (normal) distribution
- #### Each distribution has paramters: mean, covariance, and height
- #### Goal is to find best set of parameters for the data
- #### Use expectation-maximization (EM) algorithms to find optimal parameters

<br/>

![gaussian mixture](images/mixture.png)

### Let's revisit the previous example

![](images/kmeans_init.png)
![](images/mixture_after.png)

---

## Clustering | Validation

### How do we evaluate if our model is any good?
#### With supervised learning we can withhold some of our labeled data to test our model.
#### However, unsupervised learning doesn't have a source of truth.

#### Internal validation
- #### Here we are asking, how cohesive (similar to each other) are clusters?
- #### Also, how different are different clusters?
- #### A good result (valid) will have high cohesion within clusters and high separation between clusters.

<br/>

![](images/internal_val.png)
![](images/internal_val_score.png)
#### Silhouette coefficient, Calisnki-Harabasz coefficient, Dunn index, Xie-Beni score, Hartigan index

<br/>

#### External validation 
- #### We can only due this if we have true labels for the clusters
- #### Compare points from generated result to known clusters  
![](images/external_val.png)
### Jaccard Similarity, Mutual Information, Fowlkes-Mallows Index

---

# 5. Supervised Learning
### We have *labeled* data
### We already know which classes to use

## The general process of supervised classification is:
### 1. Build a model using training data
### 2. Evaluate that model using testing data
### 3. Use that model on new, unseen data

<br/>

![](images/supervisedlearning.png)

---

## Decision Trees

### 1. We start with a *root node*
- #### No incoming links, but two or more outgoing links  

### 2. We then go to an *internal node*, or condition
- #### Conditions have one incoming link, and two or more outgoing links  
- #### Conditions can use binary (yes/no), categorical (red, blue, green), or continuous **attributes** (3.141592...)

### 3. We end with *leaf nodes* (our classes)
- #### Leaf nodes have one incoming link and no outgoing links

![decision tree](images/decisiontree.png)

### **Root and internal nodes 🥕 make decisions on the data (tall or short)**
### **Leaf nodes 🍃 provide the prediction (rhino or elephant)**

## Overview of decision trees

- ### Handles different attribute types well (discrete, continuous)
- ### Fitting model is fast, evaluation can be slow
- ### Handles missing values well
- ### Notorious for overfitting
    - #### **Overfitting** is when our model does well on training data but not on testing or real-world scenarios
    
---

## Random Forest 🌲🌲🌲

### Q: How can we avoid the *overfitting* problem with decision trees?
### A: Let's create a bunch of decision trees and call it a forest!

<br/>

### Q: But how do we train a bunch of very different decision tree with just one training set?
### A: We can **resample** the training set to create a bunch of unique training sets.
- #### We can either resample without replacement or *with* replacement (aka **bootstraping**)  

<br/>

### Q: How do we combine all of these trees into one decision ?
### A: Take a majority vote. What is the most common decision across all trees?  

<br/>

![](images/randomforest.jpg)

## Here's how Random Forest works 🌳🌳🌳
#### 1. Generate a training set using resampling with replacement (bootstrapping)
#### 2. Set aside a fraction of training set for testing (aka *out-of-bag sample*)
#### 3. Train many decision trees on the training set
#### 4. Use majority vote to determine prediction
#### 5. Use out-of-bag sample to perform cross-validation

## 🌳🌳🌳 Notes about Random Forest 🌴🌴🌴

- ### Combines the output of many decision trees achieve a single result (*ensemble method*)
    - #### Reduces risk of overfitting
- ### Can look up Gini index to get feature importance
- ### Very solid algorithm!
- ### More complex and time-consuming than simple decision tree

---

## Support Vector Machine ↗️↘️
- ### Classification technique that create a *hyperplane* between classes
- ### In simplest form, only solves binary classification (dog or cat)
- ### For multiple classes, we create a collection of many binary classiciations
    - #### *one vs. one*
    - #### *one vs. many*
 
![](images/svm.png)

---

## Neural Networks

### Let's look at how the brain works!
### Neurons have many *dendrites* and an *axon*. Connections to other neurons are called *synapses*.
### Activation function controls if a neuron *fires*

![](images/neuron.jpg)


### Artificial Neural Networks replicate this natural structure (kinda)
- #### ANNs are at the hearth of *deep learning*
- #### We start with an input layer (ex. Sentinel-2 band)
- #### Each ANN has one or more hidden layers
- #### We end with an output layer

![](images/neuralnetwork.png)

### Nodes, nodes, nodes
- #### The node is the most basic unit (i.e. neuron)
- #### A node receives one or more inputs (**X**)
- #### It computes an output (**y**) and sends it to another node (or as the final output layer)
- #### Each input (**x**) into a node has a *weight*
- #### Each node has an activation function that computes the output


![image.png](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-09-at-3-42-21-am.png?w=1136&h=606)

## Weights, weights, weights
### How do we determine what the weights should be?

### Option 1: *feed-forward*
- #### Remember that each neuron has an activation function and some weights
- #### Multiply activations (*a*) by weights (*w*) and assign sum to a new neuron: new neuron = *a1\*w1 + a2\*w2 ...*
- #### Continue this until output layer

### Option 2: *backpropagation*
- #### Start with the output layer
- #### Going backwards, adjust weights to get desired output

## This is a big, big field!

![neuralnets](https://cdn-images-1.medium.com/max/2000/1*cuTSPlTq0a_327iTPJyD-Q.png)

---

## Final questions before diving into Python?

---

## References

Environmental Spatial Data Analysis by Nate Chaney, Duke University  
https://github.com/chaneyn/ESDA_CEE690-02

History of ML  
https://www.forbes.com/sites/bernardmarr/2016/02/19/a-short-history-of-machine-learning-every-manager-should-read/?sh=66c6d5af15e7

Awesome Deep Ecology by Patrick Gray, Duke University  
https://github.com/patrickcgray/awesome-deep-ecology

Unsupervised Validation  
https://www.guavus.com/technical-blog/unsupervised-machine-learning-validation-techniques/
