# Chapter 1 Notes

The most successful ML algorithms automate decision making process by generalizing from known examples.  These methods are known as __supervised learning__ models. The user provides pairs of inputs and desired outputs and the model figures out how to achieve these pairs given the input (training the model).  These methods are also easy to measure the performace.

Examples for supervised ML:
* IDing zip codes from handwritten envelopes
* Determining if tumor is benign based on image
* Detecting fraudulent credit behavior

We can also do __unsupervised learning__. In this case only the inputs are known, no output data is given to the algorithm.  These methods are harder to understand and determine the performace.

Examples for unsupervised ML:
* IDing blog post topics
* Graphing customers w/ similar preferences
* Detecting abnormal access patterns to a website

Tables are often the best way to represent the data.  Each data point (a customer, email, transction, etc.) is a row in the dataset.  Each property describing the data point is a column.  Each row is called a __sample__.  Each column is called a __feature__.

_One of the most important parts of ML is understanding your data and how it relates to your task_.

### Classifying Iris species

_Problem:_ We want to classify which species of iris given measurements from the flower (length/width of petals/sepals).  We have measurements from previously classifies species.  Therefore this is a supervised learning problem. More specifically this is a __classification problem__.  The possible outputs for the model are the __classes__.  

The output we want is the species.  For a given data point the species it belongs to is its __label__.

The __shape__ of your data will be given by the number of samples times the number of features.

#### Measuring model success

You cannot use data from the training to evaluate the model performace as well.  This won't tell us if it has generalized well.  To test performace we need to show it new data that the model hasn't seen before.  This is usually done by splitting the labelled data into a __training set__ and a __test set__.  `scikit-learn` has a function built in to do this, `train_test_split`.  75% goes to training data, 25% goes to validation data.

The convention is X is the input data (capitalized to represent it is a matrix) and the output is y (because it is a vector).

In [4]:
"""
- train_test_split is in the model_selection part of sklearn
- iris_dataset['data'] are the inputs
- iris_dataset['target']
- need to set the random_state so that the same output is given every time.
"""
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris_dataset = load_iris()  # sklearn includes some sample datasets to get started

X_train, X_test, y_train, y_test = train_test_split(
 iris_dataset['data'], iris_dataset['target'], random_state=0)