# Week_01 - Linear Regression

### What is machine learning?

- Machine learning is a class of algorithms that solve the task of `automatic pattern recognition`.
- There are two main paradigms of programming: `imperative` and `functional`. Machine learning engineering can be regarded as a third paradigm, different from the other two as follows. Let's say you have a particular `task` you want to solve.
  - In imperative and functional development, `you write the code directly`; you tell the machine what it has to do `explicitly` and you write exactly the code solves the task by outlining and connecting multiple steps together (i.e. `you` create an algorithm).
  - in machine learning you are **not** explicitly / directly writing the logic that would solve the task. Instead, you build a `machine learning model` that models the task you are tying to solve and `the model itself creates the algorithm` for solving your task.
- A machine learning model is a set of parameters connected in various ways. It solves tasks by finding optimal values of those parameters - i.e. values for which it can in **most** cases solve a task. The word **most** is important - notice that all a machine learning model is an `automatic mathematical optimization model` for a set of parameters. **`It solves tasks by approximation, not by building explicit logic`**.
- The process during which the model optimizes its parameters is called `training`.


---

### Which are the two main jobs connected to machine learning?

##### Two types of engineering roles work with machine learning models:

1. `Data Scientist`: Responsible for creating, training, evaluating and improving machine learning models.
2. `Machine Learning Engineer`: Responsible for taking the model produced by the data scientist and deploying it (integrating it) in existing business applications.

##### Example:

Let's say you are a doctor working in an Excel file. Your task is to `map` measurements (also known as `features`) of patients (`age`, `gender`, `cholesterol_level`, `blood_pressure`, `is_smoking`, etc) to the `amount of risk` (a whole number from `0` to `10`) they have for developing heart problems, so that you can call them to come for a visit. For example, for a patient with the following features:

| age | gender | cholesterol_level | blood_pressure | is_smoking | risk_level |
|--------------|-----------|------------|------------|------------|------------|
| 40 | female | 5 | 6 | yes |

you might assign `risk_level=8`, thus the end result would be:

| age | gender | cholesterol_level | blood_pressure | is_smoking | risk_level |
|--------------|-----------|------------|------------|------------|------------|
| 40 | female | 5 | 6 | yes | 8

Throughout the years of working in the field, you have gathered 500,000 rows of data. Now, you've heard about the hype around AI and you want to use a model instead of you manually going through the data. For every patient you want the model to `predict` the amount the risk, so that you can only focus on the ones that have `risk_level > 5`.

- You hire a `data scientist` to create the model using your `training` data.
- You hire a `machine learning engineer` to integrate the created model with your Excel documents.

---

### How are machine learning models built?

Machine learning models are built by codifying the `description` of the desired model behavior. This description of the expected behavior is `implicitly` hidden in your data in the form of `patterns`. Thus, machine learning is uncovering those patterns and using them to solve various problems.

The process is called `training` - you give the untrained model your data (your `description` of desired behavior) and the model "tweaks" its parameters until it fits your description well enough. And there you have it - a machine learning models that does what you want (probably ðŸ˜„ (i.e. with a certain probability because it's never going to be perfect)).

You can think about the "tweaking" process as the process in which multiple models are created each with different values for their parameters, their `accuracies` (how likely are they to give the correct `risk_level`) are compared and the model with the highest accuracy is chosen as the final version.

---

### So what can machine learning models model?

Here is the twist - `everything`! For any task as long as you have enough data, you can model it.

One of the things you can model is **the probability of the next word in a sentence**. Surprise - the models that solve these types of tasks are called `Large Language Models`! You have a partially written sentence and you can create a mathematical model that predicts how likely every possible word, in the language you're working in, is to be the next word in that sentence. And after you've selected the word, you can repeat the process on this extended sentence - that's how you get `ChatGPT`.


---

### Introduce `numpy`.

---

### What is the difference between a `loss function` and a `metric`?

The loss function is used to optimize your model. This is the function that will get `minimized` by the optimizer.

A metric is used to judge the performance of your model. The `higher`, the better. This is only for you to look at in order to decide which model to choose.

---

### Explain the differences and similarities between `mean absolute error` and `mean squared error`.

$$ MAE = \frac{1}{n} \sum_{i=1}^{n}|x_i-y_i| $$
$$ MSE = \frac{1}{n} \sum_{i=1}^{n}(x_i-y_i)^2 $$

---
---

# Week_02 - Exploratory data analysis. Modelling the `XOR` function

Administrative:
- [X] Create a chat in Messenger for easier group communication.
- [X] We have a national holiday on `04.03`. Would you like me to try to book a room for Wednesday `06.03` from 5 pm till 7 pm?

---

### What are `features`?

- the inputs/values that get passed to the model;
- when working with tabular data `features = columns = variables`.

---

### What are `observations`?

- the entities that posses the features;
- when working with tabular data `observations = num_rows = samples`.

---

### What are `hyperparameters`?

- values controlling the learing process;
- they are used for model training but are not part of the final model;
- Examples: `learning_rate`, `num_epochs`.

---

### Describe the machine learning model creation process

1. Load data
    1. Load as `pandas` dataframe(s)
    2. Target Variable Distribution
        1. Remove rows with missing target
    4. Merge Sources (if multiple sources)
    5. Check Variable Types
    6. Check and remove duplicates where applicable
    7. Capitalize Column Names
    8. Convert date features. Break them down into `year`, `month`, `day`
    9. Train-Validation-Test Split
    10. Save to files
2. Exploratory data analysis
   1. Data audit (univariate analysis)
   2. Multivariate correlation analysis
   3. Plots
   4. Deal with missing values
3. Feature engineering
   1. Apply domain knowledge to create new features (`sum`, `sub`, `mult`, `div`, `log`, etc)
4. Feature selection
   1. Parametric tests
   2. Non-parametric tests
   3. Usually performed when the number of features is really large or there are correlations between them
5. Model selection
   1. Depending on type of task, create an appropriate pipeline for running models
6. Model training
   1. Run multiple models for multiple epochs and create a comparison table
7. Model evaluation
   1. Apply models of test set
8. Model fine-tuning
   1. Go through possible values of hyperparameters, loss functions, model sizes, weights initialization distribution, to get the "best" model.

---

### The `Train`-`Validation`-`Test` split

- model to optimize values for parameters on the train set;
- evaluation on unseen data using the validation set (model selection based on validation results);
- after satisfactory results on the validation dataset, apply **once** on the test set to get final metrics for the model.

<img src="https://www.researchgate.net/profile/Simon-Hoyle/publication/356601174/figure/fig4/AS:1095291080060946@1638149151937/Outline-of-the-split-into-training-validation-val-and-test-data-subsets-undertaken-in.png" alt="visualization" width="700"/>

---

### Define `Epoch`

Passing through the training set once.

---

### Define `Learning Rate`

The amount with which we change the parameters.

---

### How would we model the `XOR` function?

Boolean formula: (x|y) & ~(x&y)

Draw the network architecture of the `Twice` model.

Draw the network architecture of the `Gates` model.

Draw the network architecture of the `XOR` model.

---

### What are activation functions?

Functions that prevent linearity. The composition of linear functions is again a linear function so no matter how many layers the data goes through, the output is always the result of a linear function.

$$y = W_1 W_2 W_3x = Wx$$

With the introduction of a non-linear activation unit after every linear transformation, this won't happen anymore. Each layer can now build up on the results of the preceding non-linear layer which essentially leads to a complex non-linear function that is able to approximate every possible function with the right weighting and enough depth/width.

$$y = f_1( W_1 f_2( W_2f_3( W_3x)))$$

### tanh

$$tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{1 - e^{-2x}}{1 + e^{-2x}}$$

<img src='https://upload.wikimedia.org/wikipedia/commons/8/87/Hyperbolic_Tangent.svg'>

### sigmoid

$$\sigma(z) = \frac{1} {1 + e^{-z}}$$

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/600px-Logistic-curve.svg.png'>

### ReLU

$$Relu(z) = max(0, z)$$

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/ReLU_and_GELU.svg/1024px-ReLU_and_GELU.svg.png' width=500 height=400>

### ReLU is most popular

* Computation Efficiency: As ReLU is a simple threshold the forward and backward path will be faster.
* Reduced Likelihood of Vanishing Gradient: Gradient of ReLU is 1 for positive values and 0 for negative values while Sigmoid activation saturates (gradients close to 0) quickly with slightly higher or lower inputs leading to vanishing gradients.
* Sparsity: Sparsity happens when the input of ReLU is negative. This means fewer neurons are firing ( sparse activation ) and the network is lighter.