# Introduction to Machine Learning 

By the end of this session you will:
- have a clear understanding of Machine Learning, and distinguish it from statistics
- understand training, validation, test sets,
- be able to analyze a new dataset and visualize it [if possible],
- learn some basic machine learning algorithms (naive bayes, trees-based algorithms, linear classifiers, etc.)
- have a clear understanding of the variance-bias tradeoff
- understand over-fitting

## Machine Learning

![Machine Learning Applications](figures/Slide4.PNG)

![Kaggle](figures/Slide5.PNG)

## Data
The first concern of Machine Learning is **Data** <br>
**No Data, No Machine Learning** (this statement is true to a good extent; but there are ML methods which are not data-driven)

![features](figures/Slide11.PNG)

![features2](figures/Slide12.PNG)

![Training](figures/Slide13.PNG)

![image.png](attachment:image.png)

### Scikit-learn library for Python
We will introduce the scikit-learn library of python (http://scikit-learn.org), which is the most utilized machine learning library.

Sklearn is shipped with 7 small datasets to get researchers acquainted with Machine Learning:
- boston --> &nbsp; &nbsp; `load_boston` Load and return the boston house-prices dataset (regression).
- iris --> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; `load_iris([return_X_y])` Load and return the iris dataset (classification).
- diabetes -->  &nbsp; `load_diabetes([return_X_y]) Load and return the diabetes dataset (regression).
- digits -->  &nbsp; &nbsp; &nbsp; &nbsp; `load_digits([n_class, return_X_y])` Load and return the digits dataset (classification).
- linnerud --> &nbsp;   `load_linnerud([return_X_y])` Load and return the linnerud dataset (multivariate regression).
- wine --> &nbsp; &nbsp; &nbsp; &nbsp; `load_wine([return_X_y])` Load and return the wine dataset (classification).
- breast_cancer --> `load_breast_cancer([return_X_y])` Load and return the breast cancer wisconsin dataset (classification)

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib notebook
import numpy as np
from sklearn import datasets
import pandas as pd
from IPython.display import display

#### Wine dataset

In [None]:
wine = datasets.load_wine()

![image.png](attachment:image.png)

In [None]:
# analyze data

In [None]:
# create dataframe and describe

### PCA
https://en.wikipedia.org/wiki/Principal_component_analysis
![image.png](attachment:image.png)

In [None]:
from sklearn.decomposition import PCA

# transform to 3D using PCA and visualize

# ML. No. 1- Naïve Bayes 

- Pure statistical approach
- Fast
- Based on the Bayes theorem of probability
- Used for *Classification* (to be explained later)

### Assumption
Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature <br>
In other words the naive conditional independence assumption considers each feature $x_i$ to be conditionally independent of every other feature $x_{j}$ for $j\neq i$, given the category y.

conditional probabilities?

![image.png](attachment:image.png)

**Example** <br>
Features: alcohol, hue, color_intensity, magnesium <br> 
Label (target): cheap, fair_price, expensive <br>
![image.png](attachment:image.png)

### Bayesian Theorem (simplified form)

![image.png](attachment:image.png)

 **Posterior probability**: the conditional probability of a the observation belonging to a `class (hypothesis)`, **after** its `features (evidences)`  and `background knowledge (prior)` is taken into account <br><br>
 **Prior probability**: the probability of a `class (hypothesis)` **before** considering its `features (evidences)` <br><br>
 **Likelihood**: the compatibility of the `features (evidence)` with the given `class(hypothesis)` <br>

![image.png](attachment:image.png)

### Example

![Kaggle](figures/Slide22.PNG)

In [None]:
from sklearn.naive_bayes import GaussianNB
# train and test on the same data

In [None]:
from sklearn.model_selection import train_test_split
# split data; train and test on separate datasets

In [None]:
from sklearn.decomposition import PCA

# transform to 3D and visualize

## Exercise
1- Change the `random_state` in the `train_test_split` function call. Fit naive_bayes again. What do you see? <br>
2- Reduce the `test_size` in the `train_test_split` function call. Now change the random state again. What do you conclude?

In [None]:
# Write down your conclusions as homework and submit them before next session.

# Back to theory

![Approaches to ML](figures/Slide16.PNG)

# ML. No. 2- Logistic Regression

![decision tree](https://cdn-images-1.medium.com/max/880/1*xzF10JmR3K0rnZ8jtIHI_g.png)

- easy to understand and interpret (can be visualized, a white-box model)
- works with little data
- fast
- Used for Classification and Regression
- **Overfits** to the training set (will be explained)

In [None]:
from sklearn import tree
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.5, random_state=5)

# train and test with DecisionTree

In [None]:
tree.export_graphviz(classifier, out_file='tree.dot')

![image.png](attachment:image.png)

Information Gain criteria for data split in each node:
- Gini
- Entropy (to be discussed in the exercise)

#####  Less impurity (gini or entropy) --> More Information Gain

Impurity: How mixed (impure) the sample in this node are!<br> If all the samples of the node are from the same class,
`Gini = 0` and `entropy = 0`.

Gini:<br>
$IG_G = 1-\sum_i{{p_i}^2}$ <br>  <br>
$p_i = \frac{\text{number of items in class i}}{\text{total items}}$

Entropy:
![image.png](attachment:image.png)

### Exercise
Change the criterion to `entropy` and see how it affects the results. <br>
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

# Back to theory

### Overfitting, Bias, Variance
**Overfitting** 
- A modeling error which occurs when a function is too closely fit to a the training set.
- The model has memorized the training set instead of learning its patterns.
- A model that models the training data *too well*.

Bias vs Variance tradeoff:
- **Errors due to Bias**: The difference between the predictions of our model and the correct (target) value.
- **Variance**: Variability (sensitivity to noise) of a model prediction for a given data point.

A simple intuition: 
- **Bias** is how your model's **performance** in general (in train and test scenarios)
- **Variance** is the **difference** between training accuracy and test accuracy

Using strong assumptions generally means you can reduce the variance of your estimator (a good thing) at the cost of risking more model bias (a bad thing), and vice versa. 

![bias variance](https://www.kdnuggets.com/wp-content/uploads/bias-and-variance.jpg)

# ML. No. 3- Random Forest

![random forest](https://cdn-images-1.medium.com/max/1600/1*xxahsU68wsbXyMYAFTf-Eg.png)

Random Forest: <br>
- An ensemble of of Decision Trees
- Reduces overfitting

### Exercise:
- Check the RandomForestClassifier in the scikit-learn library. 
- Complete the code in the next cell
- Which argument determines the number of trees? <br>

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn. ...
classifier = ...

X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.5, random_state=5)
print(classifier.fit(X_train, y_train))

y_pred_train = classifier.predict(X_train)
print('Train #corrects: {}/{}'.format((y_train == y_pred_train).sum(), y_train.shape[0]))
print('accuracy on Train: {:0.2f}\n'.format((y_train == y_pred_train).sum()/ y_train.shape[0]*100))

y_pred = classifier.predict(X_test)
print('Test #corrects: {}/{}'.format((y_test == y_pred).sum(), y_test.shape[0]))
print('accuracy on test: {:0.2f}'.format((y_test == y_pred).sum()/ y_test.shape[0]*100))

# ML. No. 4- Logistic Regression as a Classifier

Logistic Function: <br>
![image.png](attachment:image.png)

## $ 0 \leq h_\theta(x) \leq 1 $ <br>

## $ h_\theta(x) = \frac{1}{1+e^{-\theta X}} $

## $ h_\theta(x) = p(y|x;\theta) $

![image.png](attachment:image.png)

# $ h_\theta(x) =  \frac{1}{1+e^{-  (\theta_{0} ~+~ \theta_{1} x_1~ +~ \theta_{2} x_2 )   }} $

In the above example (binary classifier), Logistic Regression minimizes the following cost function: <br>

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### iris dataset
https://archive.ics.uci.edu/ml/datasets/iris
![image.png](attachment:image.png)

In [None]:
iris = datasets.load_iris()
iris.data = iris.data[:,:2]

from sklearn.linear_model import LogisticRegression

# train and test with LogisticRegression

In [None]:
# helper plot function
def my_plot(data, target, classifier):
    %matplotlib inline       
    h = .02  # step size in the mesh
    # Plot the decision boundary. For that, we will assign a color to each    
    h_min, h_max = data[:, 0].min() - .5, data[:, 0].max() + .5
    v_min, v_max = data[:, 1].min() - .5, data[:, 1].max() + .5
    hh, vv = np.meshgrid(np.arange(h_min, h_max, h), np.arange(v_min, v_max, h))
    Z = classifier.predict(np.concatenate([np.expand_dims(hh.ravel(), axis=1), 
                                           np.expand_dims(vv.ravel(), axis=1)], axis=1))
    # Put the result into a color plot
    Z = Z.reshape(hh.shape)
    plt.figure(1, figsize=(4, 3))
    plt.pcolormesh(hh, vv, Z, cmap=plt.cm.Paired)

    # Plot also the training points
    plt.scatter(data[:, 0], data[:, 1], c=target, edgecolors='k')
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')

    plt.xlim(hh.min(), hh.max())
    plt.ylim(vv.min(), vv.max())
    plt.xticks(())
    plt.yticks(())

    plt.show()

In [None]:
my_plot(X_test, y_test, classifier)

## Take-home Exercise 
Classify the `iris` dataset (only considering the first two features and ignoring the 3rd and 4th feature) using the following classifiers and visualize the test results:
- Naive Bayes
- Decision Tree
- Random Forest

## Machine Learning vs Statistics

**Machine learning:** To learn from **Data** <br>
**Statistics:** a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of **Data** [Wikipedia] <br/>

Different points of view: <br>
- Machine learning is   applied &nbsp;&nbsp; &nbsp; statistics
- Machine learning is glorified &nbsp; &nbsp; statistics
- Machine learning is scaled up &nbsp; statistics (big data)

My preferred answer: <br>
- Machine learning makes *predictions* <br>
- Statistics makes *inferences*

# scikit-learn cheat sheet

![image.png](attachment:image.png)

# Student Feedback
Please take a couple of minutes and give us your feedback to reflect on for the next session.
https://docs.google.com/forms/d/e/1FAIpQLSfESDwWv4r2aN0EREwWLtawED8URWgn1L6EPfM79mM0Ih3YNA/viewform?usp=sf_link