# Categories of ML Algorithms

* Supervised Learning
    
    * Classification: Naive Bayes Classifiers, Decision Trees, ANN, Nearest Neighbors, Random Forest, SVM
    * Regression
    
* Unsupervised Learning

    * Clustering Algorithms: K-means, Hierarchical Clustering
    
* Reinforcement Learning

    * Markov Decision Process
    
Gathering quality tagged training data is a common problem for supervised learning


# Feature Engineering

One of the most important aspects of ML is the process of providing quality features to your algorithm.

* Feature extraction

    * The process of getting useful feature from the raw data
    * Instead of just the image, feed the algorithm with higher level features like "contain the face of a person or not" or "the skin color"
    * The higher level features' calculation may be done by other Machine Learning algorithms!
    * With better feature extraction, algorithm will learn better with less training examples, reducingthe time needed to train the model

* Feature selection

    * Feature selection algorithms score each feature and return only they most valuable ones according to that score.
    * Avoid using huge feature sets. As we get more features, we have to get much many instances to represent a decent amount of the combinations. This is the curse of dimensionality
    * As the complexity of the model grows, the number of training examples needed grows exponentially
    

# Training Examples

* Get quality training data; do not feed supervised learning algorithm with wrong answer
* The more quality training data you can gather, the better results you may get
* Quality training data is expensive; sometimes can only hire someone on crowdsourcing platforms to manually tag the data 
* Bootstrapping: a way to try to make tagging more efficient by using your own Machine Learning model to help you
* Use  a validation set to tune the parameters; see [cross validation on wikipedia](https://en.wikipedia.org/wiki/Cross-validation)


# Performance Metrics

* Accuracy: If a model returns the correct output on 95% of the testing examples, the accuracy is 95%
* Training and testing sets of instances *must* be disjoint
* Overfitting: Having very good predicted results in your training data, but get poor results in a separate testing set
* To avoid overfitting, use a simpler model with less features, simplify the model and use a bigger and more representative training set.
* Metrics for classification: 
    * Precision and Recall tell you how well the algorithm performs on each class
    * Confusion matrices, see where our classification algorithm is 'confusing' predictions.
* Metrics for regression and clustering: ???


# Computational Resources

* Need to run multiple trainings until getting decent results and retraining the model to cover new instances and keep improving its accuracy.
* To get fast results from training huge models, need various GBs of RAM and multi-core machines to parallelize the processing
* Use Python and Scikit-learn


# 5 Steps


* Collecting data: Get raw (relevant) data with good variety, density and volume
    
* Preparing the data: Determining the quality of data and fix issues such as missing data and treatment of outliers. See [exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis)

* Training a model: Train the model with training data

* Evaluating the model: Test the accuracy with test data

* Improving the performance: Try different models or introducing different features to augment the efficiency


# Applications

* Image Processing
    
    * Image tagging: facebook learns from the photos you manually tag.
    * OCR 
    * Self-driving cars
 
* Text Analysis

    * Spam filtering: classification
    * Sentiment Analysis
    * Information Extraction: extracting addresses, entities, keywords, etc.
 
* Data Mining (mining useful information from a huge table in a database)

    * Anomaly detection: detect outliers, like credit card fraud detection, which transactions are outliers from the usual purchasing pattern?
    * Association rules: beer and diaper, information for marketing purposes, diabetes and heart disease
    * Grouping: grouping users behaviour in a SaaS platform
    * Predictions: credit score of new customers

* Retail

    * Which product is fast/slow moving? Which product should be introduced/removed?

* Video Games & Robotics

    * Reinforcement learning
    

# Other

* Popular algorithms

    * SVM
    * Probabilistic Models
    
        * Get predictions with degree of certainty

    * Deep Learning: 

        * Based on ANN, deep learning develops new structures with deeper layers and improved the learning algorithms to not only try to learn but also to build structures to represent the most important features automatically with higher levels of abstraction.


* Difference between ML & Data Mining: Data Mining deals with searching specific information like beer and diaper. And Machine Learning solely concentrates on performing a given task. 

* Book: Machine Learning by Tom Mitchell

> A computer program is said to learn to perform a task T from experience E, if its performance at task T, as measured by a performance metric P, improves with experience E over time.


* Which type of ML algorithm to use? 
    
    * Supervised or unsupervised? Classification, regression or clustering? Deep Learning, SVM, Naive Bayes… which one is the best? 
    * All depends on the problem and data; try and error
    
* AI companies

    * clarifai
    * snips (app)
    * Wade & Wandy
    

# Reference

* [MonkeyLearn Online Text Analysis](http://monkeylearn.com/) 
* [A Gentle Guide to Machine Learning](https://blog.monkeylearn.com/a-gentle-guide-to-machine-learning/)
* [Introduction to Machine Learning for Developers](http://blog.algorithmia.com/introduction-machine-learning-developers/)
* [Machine Learning basics for a newbie](https://www.analyticsvidhya.com/blog/2015/06/machine-learning-basics/)