# LeetCode Machine Learning Course
## Basic Concepts in ML: Machine Learning 101

### **Overview**

In this course, we will learn some of the basic concepts in the domain of machine learning. 

When you complete this course, you will have learned the following: 
- Identify different types of machine learning. 
- Know what a machine learning model is.
- Know the general workflow of building and applying a machine learning model.
- Know the advantages and disadvantages about machine learning algorithms.

### **What is a Machine Learning Model?** 

In this section, we will define the notions of Machine Learning (ML) algorithms and models, then we will classify the various ML algorithms.

So what is a machine learning? A machine learning algorithm is the process that uncovers the underlying relationship within the data. The outcome of a given machine learning algorithm is known as the **machine learning model**, which can be considered as a function $F$, which outputs certain results when given the input. Rather than a predefined and fixed function, a machine learning model is derived from historical data. As a result, when fed with different data, the output of the machine learning algorithm will change and so the machine learning model will also.

The task of machine learning is to *learn* the function from the vast mapping space.  Given a machine learning problem, it will be either a **supervised learning** or a **unsupervised learning** task. However, with all machine learning problems, we begin with a given dataset, which consists of a group of *samples*. Each sample can be represented as a tuple of *attributes*. 

In a *supervised learning* task, the data sample would contain a target attribute $y$, referred to as *ground truth*.  And the objective is to learn a function $F$ that takes the non-attributes $X$ and outputs a value that approximates the target attribute, i.e. $F(X) \approx y$. To summarize, in a supervised learning task, we are working with *labeled* data.

On the other hand, with an *unsupervised learning* task we are working with *unlabeled* data. Here are two examples of unsupervised learning tasks: 
- **Clustering**: Given a dataset, one can cluster the samples into groups, based on the similarities among the samples within the dataset. 
- **Association**: Given a dataset, the association task is to uncover the hidden association patterns among the attributes of a sample. 

Suppose that a dataset is massive but the labeled sample are few, one might find the application of both unsupervised and supervsied learning, known as **semi-supervised learning**.

If the output of a machine learning model are *discrete* values, e.g. a boolean value, then it is a **classification** model. 

If the output of a machine learning model are *continuous* values, then it is a **regression** model.

### **How does a Machine Learning Model Work?** 

The main objective of a machine learning workflow is to build a machine learning model, and this model is obtain from our given data. As a result, it is the data that determines the *upper bound* of performance that a given ML model can acheive. There can be mutliple different models that could fit the given dataset. Hence, the best thing we can do is to find a model that can approach the most to the upper bound that is set by the data. We should not expect that a model can learn something else out of the data. 

Now, let us discuss the machine learning workflow. It is important to understand that *the workflow to build a machine learning model is centralized around the data*. Below is a diagram showing this process in further detail. 

![image.png](attachment:image.png)

Starting from out data, we first determine which type of machine learning problem we would like to solve (supervised or unsupervised). If we say that the data is *labeled*, then one of the attributes in the data is the target attribute. If this target attribute exists, then we can say that the data is labeled and thus a supervised learning problem. Furthermore, following our prior definitions, it may be a classification or regression task. 

When we determine which type of model we would like to implement from the data, we can begin performing *feature engineering*. **Feature Engineering** is the group of activites that transform the data into the desired format. Here are some instances: 
- For most cases, we *split* the data into two groups: training and testing. The *training dataset* is used during the process to train a model, while the *testing dataset* is then used to test or validate whether the model we build is generic enough that can be applied to the unseen data. 
- The raw dataset is often incomplete with missing value, thus we need to fill those missing values with various strategies such as filling with the average value. 
- The dataset often contains categorical attributes. As a result, we need to *encode* those categorical string values into numerical ones, due to the constraints of the algorithm. 

The feature engineering process is not a one-off step, as usually one needs to repeatedly come back to the feature engineering later in the workflow.  Once the data is ready, we then select one of the machine learning algorithms, and start to feed the algorithm with the prepared **training data**. This is what we call the **training process**. Once we obtain our model at the end of the training process, we then test the model with the reserved **testing data**. This is what we call the **testing process**. It is rarely the case that we are happy with our first trained model. One would then go back to the training process and tune some parameters that are exposed by the model that we selected. This is what we called the **hyper-parameter tuning**. 

For supervised learning algorithms, i.e., classification and regression, there are two scenarios where the generated model does not fit the data properly, and those are **underfitting** and **overfitting**. An *underfitting model* is the one that does not fit well with the training data, i.e. significantly deviated from the ground truth. An *overfitting model* is the one that fits well with the training data, i.e. little or no error, however it does not generalized well to the unseen data.