# Lecture 7: Machine Learning Basics

## Agenda: 
1. What is machine learning?
2. How does machine learning work?
3. Scikit-learn
4. Numpy

##  1. What is Machine Learning ?
Machine learning learns models from a set of **n observations (also known as samples, examples, instances, records)** of data and then tries to predict **properties** of new data. 

                                  
|![Figure 1: Machine Learning](ML_training.png)|
|-----------------------------|
|Figure 1. Machine Learning|

## 2. Statistics and ML
1. Statistics is a subfield of mathematics while ML is a subfield of computer science and grew out of AI to focus on learning from data
2. ML started to flourish as a separate field in the 1990s and changed the focus to methods borrowed from statistics and probability theory
3. Statistics and ML are closely related in terms of methodological principles but are different in their primary goals
    * ML concentrates on prediction to identify the best course of actions with no or limited understanding of the underlying mechanism. Used for more complex relationship and large data sets.
    * Statistics have a focus on inference by modeling the data generation process to formalize understanding (although statistics can perform predictions as well). Traditionally used for small data sets.

## 3. Two main categories of ML
1. Supervised learning, in which the data comes with additional ***labels/attributes that we want to predict***. This problem can be either: 
    1. Classification: the desired output consists of a finite number of **discrete categories** 
        1. Examples: handwritten digit recognition, Iris classification and spam or ham email classification
    2. Regression: the desired output consists of one or more **continuous variables**
        1. Predict the final score (0-100) of students using their grades of homework
![Figure 3: Machine Learning](handwritten.png)
2. Unsupervised learning, in which the training data consists of a set of input vectors x **without any corresponding target labels**. 
    1. Clustering: discover groups of similar examples within the data, e.g., group shoppers with similar behavior
![Figure 3: Machine Learning](clusters.png)
    2. Density estimation: determine the distribution of data 
    3. Dimensionality Reduction: project the data from a high-dimensional space down to low dimensions

## 4. How does machine learning work?
Take supervised learning for example:
1. First Training a machine learning using labeled data
    1. labeled data with labels (output)
    2. machine learning models learns the relationship of the input data and output(labels)
2. Make prediction in new data that was not used in training the model
    1. The primary goal of machine learning is to build model that generalizes to new data
    
![Figure 2: Machine Learning](ML_tt.png)

## 5. Scikit-learn

1. Learn machine learning basics "An introduction to machine learning with scikit-learn" from the tutorials at https://scikit-learn.org/stable/tutorial/index.html

In [10]:
#import scikit-learn package
import sklearn as sk # run __init__ first
print('sklearn version:', sk.__version__)

# check scikit-learn folder: C:\Program Files\Anaconda3\Lib\site-packages\sklearn 

sklearn version: 0.24.2


In [12]:
# explore iris dataset
import sklearn.datasets as ds
iris = ds.load_iris()
print(iris)
# iris['data']: input data, # of samples * # of features
# iris['target']: labels for each feature vector (what category it belongs to)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [16]:
# second method to get access to the data and target
# input and labels are numpy arrays
inpt = iris.data
labels = iris.target
print(type(inpt))
print(inpt)

<class 'numpy.ndarray'>
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3

In [17]:
print(inpt.shape) # displays the dimensionality of the array rows, cols

(150, 4)


In [20]:
# accessing the first feature vector (sample)
print(type(inpt[0]))
print(inpt[0])

# accessing the last feature vector
print(inpt[-1])

<class 'numpy.ndarray'>
[5.1 3.5 1.4 0.2]
[5.9 3.  5.1 1.8]


### 6. Practice NumPy array after class

1. Learn the numpy array.
    1. https://numpy.org/devdocs/user/quickstart.html
    
2. Functions and Methods: concatenate, diagonal, dsplit, dstack, hsplit, hstack, newaxis, ravel, repeat, reshape, resize, squeeze, swapaxes, take, transpose, vsplit, vstack

3. Ordering: argmax, argmin, argsort, max, min, searchsorted, sort

4. math and statistics: cov, mean, std, var,all, any, inner, invert, max, maximum, mean, median, min, minimum, nonzero, outer, prod, re, round, sort, std, sum, trace, transpose



### 7. Example: Iris Classification
1. The 'Hello World!' task in machine learning: Iris classification
The Iris dataset

```python
    import sklearn.datasets as ds
    iris = ds.load_iris()
```
    1. 150 observations; 50 observations of 3 different species
    2. 4 fearures:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
    3. Class labels (Species): Iris-Setosa, Iris-Versicolour, and Iris-Virginica
![Figure 1: Machine Learning](Iris1.png)    