# Setup

    git clone https://github.com/frnsys/ml101
    cd ml101
    pip install -r requirements.txt
    jupyter notebook
    
---

# What is Machine Learning?

At it's most basic:

$$
\text{data} + \text{algorithm} = ???
$$

- About using algorithms to infer things from data, such as hidden structure, patterns, or relationships
- An amalgamation of techniques & concepts from many different fields
    
    
## What can you do with ML?

- Predict things
- Automate things/make decisions
- Gain insight into a system
- Emulate a system

## What we'll cover today

How to use:

1. [`scikit-learn`](http://scikit-learn.org/stable/) for _supervised_ learning (linear regression, logistic regression), _unsupervised_ learning (clustering), and a little bit of natural language processing
2. [`pandas`](http://pandas.pydata.org/) for handling data
3. [`matplotlib`](http://matplotlib.org/) and [`seaborn`](https://web.stanford.edu/~mwaskom/software/seaborn/) for visualizing data

## How things will be structured

1. I'll walk you through a simple example introducing each concept
2. You'll apply what you learned to a short exercise

## Assumptions

- You have some Python experience
- You know a bit of high school math

---

# Two types of problems

Broadly speaking, we deal with two types of data:

- __Continuous__ data - such as temperatures, stock prices, heights, weights. These are __measured__.
- __Discrete__ data - such as gender, yes/no responses, species, dice rolls. These are __countable__.

![](../assets/discrete_vs_continuous.svg)

And these two types of data are typically associated with two types of problems:

- Problems involving continuous data are __regression__ problems, where we want to predict a precise value. For example, what will the temperature be tomorrow?
- Problems involving discrete data are usually __classification__ problems, where we want to identify how to label something. For example, is this email spam or not?

---

# Two types of machine learning

The two primary categories of machine learning are:

- __Supervised__ learning - you have some "ground truth" data, that is, data that you know the "answers" for. For example, the daily temperature for the last year or a set of emails labeled "spam" or "ham" (not spam).
- __Unsupervised__ learning - you don't have ground truth data, but you want to uncover something about the _structure_ of the data. For instance, given a set of weights and heights about a population of animals, are there any distinct groups in the data?

There are other types of learning as well, but they are encountered less frequently.

---

# Supervised Learning

- Provide the algorithm with input data and known answers (output) for each input
- The algorithm learns the relationship between the input and the output
- Returns a function which describes this relationship


## Describing the world in functions

Phenomena can be described mathematically, i.e. by some function.

For example, there is some relationship:

- between a house's size and its sale price
    - e.g. $\text{sales price} = 200 \times \text{square footage}$
- between a runner's speed and their finishing time
    - e.g. $t = \frac{d}{s}$
- between a deer's weight and its height


Supervised machine learning algorithms try to uncover this function!