# Module 1: Fundamentals of Machine Learning - Intro to SciKit Learn

## What is Machine Learning (ML)?
1. The study of computer programs (algorithms) that can learn by example

2. ML algorithms can generalize from existing examples of a task
    - e.g. after seeing a training set of labeled images, an image classifier can figure out how to apply labels accurately to new, previously unseen images
    
## Machine Learning models can learn by example
1. Algorithms learn rules from `labelled examples`.

2. A set of labelled examples used for learning is called `training data`.

3. The learned rules should also be able to `generalize` to correctly recognize or predict new examples not in the traning set.

## Machine learning models learn from experience
1. Labeled examples
    -  email spam detection
2. User feedback
    - clicks on a search page
3. Surrounding environment 
    - self-driving cars
    
## Machine Learning brings together statistics, computer science, and more..
1. Statistical methods
    - infer conclusions from data
    - estimate reliability of predictions
2. Computer science
    - large-scale computing architectures
    - algorithms for capturing, manipulating, indexing, combining, retrieving and performing predictions on data
    - software piplines that manage the complexity of multiple subtasks
3. Economics, biology, psychology
    - how can an individual or system efficiently improve their performance in a given environment?
    - what is learning and how can it be optimized?
    
## What is `Applied` Machine Learning?
1. Understand basic ML concepts and workflow

2. How to properly apply 'black-box' machine learning components and features

3. Learn hwo to apply machine learning algorithms in Python using the `scikit-learn` package

### What is not convered in this course:
    - underlying theory of statistical machine learning
    - lower-level details of how perticular ML components work
    - in-depth material on more advanced concepts like deep learning
    
`Recommended book: Introduction to Machine Learning: A Guide for DAta Scientists, Andreas C. Muller and Satah Guido, O'Reilly Media`

## Key types of Machine Learning problems

### `Supervised` machine learning: learn to predict `target values` from labelled data
1. classification (target values are discrete classes)

2. regression (target values are continouous values)

<img src="https://img.ceclinux.org/cc/a127b733c7ba8ee0131253295fbb624c895297.png">

<img src="https://img.ceclinux.org/6f/25db672b12450f93747899143a7ed9248dce37.png">


### `Unsupervied` machine learning: find structure in *unlabeled* data
1. find groups of similar instances in the data (clustering)

2. finding unusual patterns (outlier detection)

<img src="https://img.ceclinux.org/43/5fd3aa2e6811e38501eaed1fe7247a25f0294b.png">

<img src="https://img.ceclinux.org/a6/e204341334f0dc5266d8a1967fd5175fdd8d87.png">

<img src="https://img.ceclinux.org/45/41d3b9a13c2c8ffd58c94c1f2e335fd75df110.png">


## Python Tools for Machine Learning
1. scikit-learn
    - from sklearn.model_selection import train_test_split; from sklearn.tree import DecisionTreeClassifier
2. scipy

3. numpy

4. pandas

5. matplotlib and others
    - import matplotlib.pyplot as plt; import seaborn as sn; import graphviizs


## An Example Machine Learning Problem

<img src="https://img.ceclinux.org/7b/101e665d3bd7a9f7b8338a85ec2f103400b8f3.png">

<img src="https://img.ceclinux.org/90/023537a0aaa4b35c2e1aaeec784a61002a90c1.png">

<img src="https://img.ceclinux.org/5a/8175e6d6cd98104f56cdb3edf3a27804335a22.png">

<img src="https://img.ceclinux.org/3c/d8d3c2152ce82a41b77402b067dfa105d68cd0.png">


Notes: can't use a training sample as a test sample!

## Examining the Data
### Why looking at the data initially is important
1. insepecting feature values may help identify what cleaning or preprocessing still needs to be done once you can see the range or distribution of values that is typical for each attribute

2. you maight notice missing or noisy data, or inconsistencies such as the wrong data type being used for a column, incorrect units of measurements for a particular column, ro that there aren't enough examples of a particular class

3. you may realize that your problem is actually solvable without machine learning

Examples:
<img src="https://img.ceclinux.org/dc/fb0a1c4a6906d4cce0b6b08f4a2fdaeee04a55.png">

<img src="https://img.ceclinux.org/23/7b10a1f3866db085044c9311b691fa83b00bf5.png">

## K-Nearest Neighbors Classification
1. can be both for classifiers and regression

2. is an instance-based/memory-based supervised leanring, meaning it memorizes labelled examples during training 

### The k-Nearest Neighbor (k-NN) Classifier Algorithm
*Given a training set X_train with labels y_train, and given a new instance x_test to be classified:*

1. find the most similar instances (let's call them X_NN) to x_test that are in X_train

2. get the labels y_NN for the instances in X_NN

3. predict the label for x_test by combining the labels y_NN e.g. simple majority vote

<img src="https://img.ceclinux.org/4f/1065e157a5eee43d00856cfbcf60350d859b7b.png">

## A nearest neighbor algorithm needs four things specified
1. a distance metric
    - typically Euclidean (Minkowski with p=2)
2. how many 'nearest' neightbors to look at?
    - e.g. five
3. optional weighting function on the neighbor points
    - ignored
4. how to aggregate the classes of neighbor points
    - simple majority vote; (class with the most representatives among nearest neighbors)
    

### Bias Variance tradeoff
1. When K is smaller, the prediction is sensitive to noise, outliers, mislabeled data, and other sources of variation in individual data point. 

2. For larger values of K, the areas assigned to different classes are smotther and not as fragmented and more robust to noise in the indvidual points. But possibley with more mistakes in individual points.