 # <b>1 <span style='color:#F76241'>|</span> What is machine learning?</b>

<font size="9">M</font> achine learning (or `ML`for short) is the art of making programs that can *learn automatically from data*. 

What does `learn` mean in this context, you may be asking? 

> The "learn" in machine learning means a program can **teach itself** how to make decisions by reading vast amounts of data and looking for patterns. Machine learning systems are usually **never told explicitly** how to make decisions.

The process of learning from data is called `training`.

When talking about `ML` systems, people often use the term `pipeline`, which I define as follows:

> **A sequence of steps that are required for a ML system to function**. 

Wow, now that's an abstract and hard to visualize definition. The following is a visualization that shows it instead:

<img src="assets/images/pipeline.jpg"  width="600" height="200">
<font size="1"> (image <a href="https://valohai.com/machine-learning-pipeline/">credits</a>) </font>

Each step in the `pipeline` holds an **important function** that is vital for the machine learning system to operate. It's aptly named "pipeline" because you can imagine data as water, flowing through a pipeline comprised of various pipes that, in this case, have different functions:

<img src="assets/images/real_pipeline.jpg"  width="600" height="200">
<font size="1"> (image <a href="https://www.apollotechnical.com/what-is-pipeline-management-why-it-matters/">credits</a>) </font>


<div class="alert alert-block alert-info"><b>Note:</b> The pipeline visualizations above contain only the <em>beginning</em> of a pipeline because later steps involve more complex things like model deployment and maintenance. These notebooks <b>only</b> cover <b>steps 1-4</b>. Additionally, sometimes the beginning of a pipeline may involve either <b>extra</b> or <b>less</b> steps. It's dependent on what you're working on. But the majority of them will follow the above structure. </div>


Before diving into step 1 of the pipeline, there's some more machine learning background I need to cover. I *promise* we'll get to the fun stuff soon. Trust me!

# <b>2 <span style='color:#F76241'>|</span> Types of machine learning systems</b>

Not all machine learning models are the same. The following two subsections define what **`supervised`** and **`unsupervised`** learning are and how they differ from each other.
 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">2.1.<em> Supervised</em></p>
</div>

**`Supervised learning`** is a learning technique where you provide a ML model two things:

- `Data points` (**X**) - the information you want the model to learn from
- `Labels` (**y**) - the correct, labeled output attached to each instance of **X**.

The data points and labels are represented as (**X**, **y**) pairs.

The **`goal`** of supervised ML systems is to **train** them on training data (X) to predict the **correct labels** (y). We do this by showing them many instances of data in hopes that they will learn what we want them to learn. Each time the model makes a prediction, we **compare** it to the **correct label** to see if the model did well. This is what the `supervised` part refers to.

<div class="alert alert-block alert-info"><b>Note:</b> There are numerous ways to refer to <b>y</b> such as: <b>labels, targets, gold labels, classes, truths,</b> and possibly more. They all refer to the same idea: the correct labels that the model learns how to predict. I will use the terms interchangeably.</div>

<img style="float: left; padding: 0px 10px 0px 0px;" src="assets/images/shapes.jpg"  width="200" height="50">

Let's look at a very simple example. Say we wanted to train a model to predict a shape. The training data would consist of (**X**, **y**) pairs. Our **X** would consist of the shapes we want the model to recognize, and our **y** would be the correct label for each shape instance.

In this example, our **true labels** are _circle_, _square_, and _triangle_. These are the labels we need our model to learn how to predict correctly. To accomplish this, imagine we have 10,000 instances of these (**shape**, **label**) pairs. You could think of the training process being as follows:

```
(🟥, square) -> Model predicts "circle" -> Check "circle" == square -> Incorrect
(🟥, square) -> Model predicts "triangle" -> Check "triangle" == square -> Incorrect
(🟥, square) -> Model predicts "square" -> Check "square" == square -> Correct
(🔴, circle) -> Model predicts "circle" -> Check "circle" == circle -> Correct
(🔺, triangle) -> Model predicts "triangle" -> Check "triangle" == triangle -> Correct

```

This process is followed for **every instance in our training data** and over time, as the model sees more examples, it will **gradually get better at predicting**. Thankfully, we don't need to do this manually, as machine learning APIs do this for us. But the intuition is important.


<div class="alert alert-block alert-success">You can think of this as being <em>similar</em> to how humans learn; the more we do things, the better we get at it. It's (roughly) the same concept, although humans require <em>significantly</em> less "training data" to learn things.</div>

If this all seems too abstract at the moment, don't fret! When we start constructing machine learning models and actually applying these concepts, it will be easier to visualize.

Next, I will briefly go over `unsupervised learning`, which is the opposite of `supervised learning`. While we won't be doing any unsupervised learning in these notebooks, it's still good to know because it comes up in other data related techniques such as clustering and dimensionality reduction.





<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">2.2<em> Unsupervised</em></p>
</div>

**`Unsupervised learning`** is a learning technique where you provide a ML model only one thing:

- `Data points` (**X**)

**Why** is there no **y**, you may be asking? Good question! The **`goal`** of unsupervised ML systems is to take **unlabeled data** and try to **find patterns** by looking at relationships between data points. One class of algorithms that use this are called **clustering algorithms** and their goal is to group similar points together into separate clusters.

In the following graphic, the left plot (a) would be the input to the system, and the right plot (b) would be the clusters created from it. You can see that it creates three clusters from the original data. 

<img src="assets/images/unsupervised_example.jpg"  width="400" height="100">
<font size="1"> (image <a href="https://www.researchgate.net/figure/Example-of-K-Means-clustering-a-Original-data-b-Data-grouped-into-three-clusters_fig2_333992148">credits</a>) </font>

Again, we won't be doing any unsupervised learning here. But these techniques are very useful. If you have extra time, I highly advise you to give this <a href="https://pythonistaplanet.com/applications-of-unsupervised-learning/" target="_blank">article</a> a read through. It doesn't get too technical and explains common use-cases of unsupervised ML very well.


<div class="alert alert-block alert-info"><b>Note:</b> There are more types of learning, but they are more advanced and require understanding of the two described above. Here are some of them: <a href="https://ai.stackexchange.com/questions/10623/what-is-self-supervised-learning-in-machine-learning" target="_blank">self-supervision</a>, <a href="https://www.altexsoft.com/blog/semi-supervised-learning/" target="_blank">semi-supervision</a>, <a href="https://www.snorkel.org/blog/weak-supervision" target="_blank">weak-supervision</a>, and <a href="https://www.synopsys.com/ai/what-is-reinforcement-learning.html" target="_blank">reinforcement learning</a>. They're not important to know for these notebooks, but feel free to read these links.</div>

 # <b>3 <span style='color:#F76241'>|</span> Regression vs. classification</b>
 
 Yet another distinction to be made is between two types of _supervised_ machine learning tasks: `regression` and `classification`. While both follow the same idea of showing a model examples and checking its predictions, there are key differences between the two.
 
 <div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">3.1<em> Regression</em></p>
</div>

The **`goal`** of regression tasks is to predict a continuous number. **Continuous numbers** can take on an infinite amount of numbers within a range of possible numbers. These values are attained through measuring, and thus are often precise and involve fractions/decimal points.

Examples of continuous variables include:

<table>
<tr><th>Student information</th><th>Weather information</th></tr>
<tr><td>

|GPA (1-4)|Height (ft)|Weight (lbs)|
|--|--|--|
|2.6| 5'4| 156|
|4.0| 6'1| 210|
|3.4| 4'9| 100|

</td><td>

|Wind speed (mph)|Temperature (F)|
|--|--|
|21|55|
|30|42|
|19|89|

</td></tr> </table>

In the `Student information` table, we could try to predict a student's **height** given their **weight**, or vice versa. In the `Weather information` table, we could try to predict **temperature** given **wind speed** (although that probably wouldn't work well).


<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">3.2<em> Classification</em></p>
</div>

The **`goal`** of classification tasks is to predict a **category** or **label** of something. 

The shapes example in `section 2.1` is an example of classification; we are trying to predict one of three possible **labels**: circle, square, triangle. 

Another example would be the positive or negative classification of movie reviews (called sentiment analysis). Our labels would be **0** and **1**, pertaining to **negative sentiment** and **positive sentiment** respectively:

<table>
<tr><th>Movie reviews</th></tr>
<tr><td>

| Review text                                    | Sentiment |
|------------------------------------------------|-----------|
| This movie was awful!                          | 0         |
| I loved it! The characters are so well written. | 1         |
| It was pretty good, nothing special.           | 1         |
| Literally the worst movie ever.                | 0         |
    
</td></tr> </table>

Classification is a bit more involved than regression because the learning process is more involved. Thus, this repository (at least for now) will soloely focus on classification.


 # <b>4 <span style='color:#F76241'>|</span> Train, dev, test</b>
 
 We can't just throw all of our data into a ML algorithm and tell it to do good. First, we need to partition the data into three splits: **train**, **dev**, and **test**.
 
 
<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">4.1<em> Training</em></p>
</div>

The `train` split is self explanatory. It is the parition used to train the model. This consists of the (**X**, **y**) pairs where **X** is the data point and **y** is the true value associated with that data point. In general, the **more training data** there is, the **better your model learns how to predict things**. This is true universally, whether you're using a simple linear regression model or a deep **neural network** (in fact, neural networks require _a lot_ more data than standatd ML algorithms).

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">4.2<em> Development</em></p>
</div>

There is a **`golden rule`** in machine learning that must _always_ be adhered to: 

> **Never, EVER, evaluate on the training set**. 

The goal of a ML model is to perform well on **unseen** data points. But the training data is nothing _but_ seen data points, so you can't use it for evaluation. If you do, you'll of course get very high scores, but when you go to use this model for something, it'll likely do very bad. We neeed splits **separate from the training data** to evaluate models. 

The purpose of the  `development (dev)` set (also known as the `validation (val)` set) is to evaluate how well you're model is doing **during the development of your machine learning model**. So after training, we evaluate it on the `dev` set to see how well the model does. If the performance is not great, we can then tweak the ML model's settings, add more training data, do more preprocessing techniques, etc. to try and a improve the scores. 

If we're satisfied with the scores on the `dev` set, we can move onto the next split: the `test` set.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px;">
    <p style="padding: 8px;color:white;text-align: center;">4.1<em> Testing</em></p>
</div>

After we've tweaked our model and have evaluated it on the `dev` set, we need to do one more final test before using this model to accomplish tasks. And that is evaluating on the `test` set.

The purpose of the `testing (test)` set (also known as the `evaluation (eval)` set, or `held-out` set), is to be the final test before using the model. In general, we want to avoid exposing our model to this split as much as possible, since tweaking the model's performance on the `test` set defeats the purpose of having a **final evaluation**. It's very easy to accidentally do this, which makes the `dev` split even more important.


<div class="alert alert-block alert-success"><b>More info:</b> Here is a <a href="https://www.v7labs.com/blog/train-validation-test-set" target="_blank">link</a> that covers </div>