# 💻Coding for the Environment Machine Learning Workshop 🥳

## Agenda

1. Introductions
2. Motivation
3. ML Overview
4. Clustering
    - K-Means
    - Hierarchical
    - Gaussian Mixture
5. Supervised Classification
    - Support Vector Machines
    - Decision Trees
    - Random Forest
6. Neural Networks
7. scikit-learn

# Introductions 👋

# Setup

### 1. Set up container: https://cmgr.oit.duke.edu/

### 2. Clone workshop stuff:
`git clone https://github.com/caseyslaught/duke-c4e-ml-workshop.git`

---

# Overview 🤔

### Machine learning is a BIG field.
- This is a BIG field. You're gonna want to focus on something.
    - Land cover classification (what land cover class is this?)
    - Computer vision (what's in this photo?)
    - Audio processing (what animal is in this recording?)
    - Automated driving, natural language processing, automated robots, etc...

![audio](images/spectrogram.jpeg)
![computer vision](images/computer_vision.jpeg)
![lulc](images/lulc.jpeg)

### A brief history of ML
- First ML programs in the 1950's
- Deep Blue beats Garry Kasparov in 1997
- Deep learning coined in 2006
- Recent explosion due to new computational ability (GPU's), data availablily, new techniques.

---

### We're gonna focus on classification in this workshop.

---

## Unsupervised
- No labeled data given
- Goal is to find some structure in the data
- ex. clustering

![kmeans sat](images/kmeans_sat.png)

---

## Supervised
- We do have labeled data
- We train a model using the labeled data
- We then predict the output given a new set of inputs
- ex. regression, random forest, neural networks


## We'll be considering clustering/classification problems. 
- Given a point with some attributes, what group does it belong to?
- ex. species (size, hair?, scales?, gills?) → phylum

---

# Clustering 🧮

![](images/clusters_init.png)
![](images/clusters_after.png)

## Clustering is all about grouping points using the data in out dataset

### Why would we want to do this?
- Understand the data better (maybe we don't know the groups in advance)
- Explore potential classes for subsequent supervised analysis
- Easily summarize data (kinda like compresion)

### What are some potential problems?
- Groups not neatly separated (lots of overlap, weird configurations)
- Interpreting results takes domain knowledge

![clusters](images/clustering.png)

---

## Questions so far?

---

## Clustering | K-Means

1. Initialize
    - Define K cluster centers randomly (centroids)


2. Iterate
    - For each point calculate closest cluster
    - Calculate mean of each cluster and make that mean the new cluster center  


3. Terminate
    - If no points reassigned then finished


![kmeans](images/kmeans.png)
![](images/kmeans_init.png)
![](images/kmeans_after.png)

## Clustering | K-Means (cont.)

- Benefits
    - Easy(ish) to understand
    
- Downsides
    - Very sensitive to initial cluster locations
    - May terminate at local minimum (not global)
    - May not know number of clusters in advance

## With K-means we might not end up at the optimal solution.
### To get around this we can: 
1. Calculate some metric of error for the K-means algorithms
2. Run the algorithms a bunch of times and save the results
3. Pick the result with the lowest error
![](images/kmeans_bad.gif)

---

# Clustering | Mixture of Gaussians

> What if our clusters are not symetrical?  
> What if we have some overlap between cluseters?

- Each group is represented by a Gaussian (normal) distribution
- Each distribution has paramters: mean, covariance, and height
- Goal is to find best set of parameters for the data

![gaussian mixture](images/mixture.png)

![](images/kmeans_init.png)
![](images/mixture_after.png)

---

## Clustering | Validation

### How do we evaluate if our model is any good?
#### With supervised learning we can withhold some of our labeled data to test our model.
#### However, unsupervised learning doesn't have a source of truth.

#### Internal validation
- Here we are asking, how cohesive (similar to each other) are clusters?
- Also, how different are different clusters?
- A good result (valid) will have high cohesion within clusters and high separation between clusters.
![](images/internal_val.png)
![](images/internal_val_score.png)
#### Silhouette coefficient, Calisnki-Harabasz coefficient, Dunn index, Xie-Beni score, Hartigan index

#### External validation 
- We can only due this if we have true labels for the clusters
- Compare points from generated result to known clusters  
![](images/external_val.png)
### Jaccard Similarity, Mutual Information, Fowlkes-Mallows Index

# Supervised Learning

## Decision Trees

## Random Forest

## Boosting

## Neural Networks

## References

History of ML
https://www.forbes.com/sites/bernardmarr/2016/02/19/a-short-history-of-machine-learning-every-manager-should-read/?sh=66c6d5af15e7

Awesome Deep Ecology
https://github.com/patrickcgray/awesome-deep-ecology

Unsupervised Validation  
https://www.guavus.com/technical-blog/unsupervised-machine-learning-validation-techniques/
