<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/6_intro_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Machine Learning

### Reading
- McKinney, Chapter 13.3, 13.4 
- Molin - Getting Started with Machine Learning (thru Preprocessing Data)
- Géron - Chapter 1: The Machine Learning Landscape
- https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

### Tutorials
- https://www.datacamp.com/community/tutorials/introduction-machine-learning-python
- https://www.datacamp.com/community/tutorials/preprocessing-in-data-science-part-1-centering-scaling-and-knn
- https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/tree/master/ch_09


### Practice
- https://www.datacamp.com/courses/supervised-learning-with-scikit-learn

### Learning Outcomes
- Machine learning concepts
- Key python libraries - statmodels, SciPy, scikit-learn
- Supervised & unsupervised learning
- Training and test datasets
- Data pre-preprocessing
- Introduction to scikit-learn
- Clustering with K-Nearest Neighbor


### What is Machine Learning (ML)?

The science (and art) of programming computers to learn from data.

Machine Learning is great for:

- Problems for which existing solutions require a lot of fine-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better than the traditional approach.

- Complex problems for which using a traditional approach yields no good solution: the best Machine Learning techniques can perhaps find a solution.

- Fluctuating environments: a Machine Learning system can adapt to new data.

- Getting insights about complex problems and large amounts of data.

**common examples**
- fraud detection
- spam filters
- character recognition
- face detection
- recommendations
- speech-to-text, text-to-speech
- anomoly detection
- association rule learning

**Types of ML systems**
- Trained with human supervision (supervised, semi-supervised, reinforcement learning ) or unsupervised
- Can learn incrementally (online) or offline (batch learning)
- Compares new data to known data (instance learning) or uses a predictive model (model-based learning)



### Supervised learning

Training data includes labels for desired solutions.
- classification - program is trained to identify data matching a class
- target numeric value - system trained on data with a given set of features (predictors) and their labels

**Common Supervised Learning Models**

- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural networks


### Unsupervised Learning

The system learns from unlabeled data, without a teacher.

**Common Unsupervised learning algorithms**

- **Clustering** - identify groups and which group data points belong to
  - K-Means
  - DBSCAN
  - Hierarchical Cluster Analysis (HCA)
- **Anomaly detection** and novelty detection
  - One-class SVM
  - Isolation Forest
- **Visualization** and dimensionality reduction
  - Principal Component Analysis (PCA)
  - Kernel PCA
  - Locally Linear Embedding (LLE)
  - t-Distributed Stochastic Neighbor Embedding (t-SNE)
- **Association rule** learning - discover interesting relations between attributes in large amounts of data 
  - Apriori
  - Eclat

**dimensionality reduction** describes techniques to simplify data without losing too much information. Can result in faster performance, less disk and memory space, and in some cases better accuracy.

**feature extraction** - a dimensionality reduction technique that merges multiple strongly-correlated features into a single feature

### Semisupervised & Reinforcement Learning

**Semisupervised**

Algorithms that can deal with partially labeled data. Usually combinations of unsupervised and supervised algorithms.

**Reinforcement Learning**

The learning system (agent) can observe the environment, select and perform actions, and get rewards or penalties as a result. It learns the best strategy (policy) to get the most rewards and what action the agent should choose when it is in a given situation.

### Batch -v- Online Learning

**Batch**

System is trained with all available data. Training is usually time & computing intensive, so done offline. Learning about new data requires training a new version of the system.

**Online Learning**

The system learns incrementally from data instances fed sequentially, either individually or in small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly.

Great for systems that receive data continuously and need to adapt to change rapidly or autonomously. Also a good option if computing resources are limited or data can't fit into memory of a single computer.

**Learning rate** - how fast the system should adapt to changing data. Affects how sensitive the system is to *noise* in the data.


### Instance -v- Model-based Learning

**Instance-based learning** - the system learns data examples, then generalizes to new cases by using a **similarity measure** to compare them to the learned examples (or a subset of them).

**Model-based learning** - a predictive model is developed from data. 

Model requires a **performance measure** - e.g. **utility function** that measures how good the model is, or a **cost function** that measure how bad it is.

Model can refer to a *type of model*, to a *full-specified model architecture*, or to the final *trained model* ready for making predictions.

**Model selection** involves choosing the type of model and fully specifying its architecture. 

**Model training** means running an algorithm to find the model parameters that will make it best fit the training data.


### ML Challenges

- Insufficient quantity of training data
- Non-representative training data
- Poor-quality training data
- Irrelevant features
- Overfitting the training data
- Underfitting the training data
