# Master Solution for CHECKPOINT Exercises

## Module 1: Introduction to Machine Learning and Data Processing for Food Sciences

### 1.1 CHECKPOINT 1
**Q:** You have trained a supervised ML model to accurately predict the expiration date of a food product given its chemical composition. What could you do with the trained model if someone gave you an unlabelled set of foods and their compositions?


**A:** If the dataset was not labelled then we can use **unsupervised learning** to find food categories with data using the chemical composition of food as features.

### 1.2 CHECKPOINT 3.2

**Q1:** Why is data pre-processing important?
**A:** Data pre-processing is important for the models since their performance depends on the quality of the data. During data pre-processing steps the dataset is cleaned, different techniques are used to deal with missing values, outliers are detected and dealt with, etc. All in all, data-preprocessing helps us in increasing data quality and prepare the data to be used by the machine learning model during training.

**Q2:** Why do we need to remove samples/features with a lot of missing values?

**A:** The samples and features with missing values do not hold information that can be useful during training. During data pre-processing they will possibly undergo through imputation techniques and will thus contain information that is already in the dataset (like mean of the missing feature, etc), thus there will not be any added value for the learning process.

**Q3:** Why is imputation done per food category?

**A:** In our dataset, it is very important to consider the differences between food categories. For example, consider the categories `alcoholic_beverages` and `vegetables`. The amount of alcohol found in alcoholic beverages is way more than in vegetables, so imputing missing values for alcohol amount in vegetables with a mean that would be calculated taking into the account alcoholic beverages would simply be wrong, and thus confuse models during the learning process.

### 1.3 CHECKPOINT 3.3

**Q:** Why do we do the train-test split?

**A:** The split is useful in evaluating the model performance. The model will be trained using the train set and its generalization abilities will be checked in the test set. This will give an indication into how well a model can predict the target value for a new, unseen sample, when it is deployed in real life.

### 1.4 CHECKPOINT 3.4

**Q:** Why is standardization important?

**A:** Standardization is important for bringing all features in the same scale. This is important per se so that feature importance in the model will not be determined by feature magnitude. 

**Q:** Why `fit_transform()` is applied on the train set and `transform()` on the test set?

**A:** The `fit_transform()` method learns from the data where it is applied and at the same time transform this data. The `transform()` method on the other hand will use what it learned from the data where `fit_transform()` was applied. Using the learned mean and standard deviation, it will transform the data without recalculating the mean and the standard deviation in the data that it is applied. This is done so that no information would leak from the train to the test set and thus, the performance would be more accurately evaluated.

## Module 2: Supervised Learning

### 2.1 CHECKPOINT 1

**Q:** 
In the following scenarios, would you use regression or classification algorithms:<br>
    1. We have a dataset that contains student characteristics and their scores in the past 5 exams. We would like to predict their *score* in the next exam. <br>
    **A: REGRESSION**<br>
    2. We have a dataset that contains the ingredient composition of some food samples. we would like to predict what *category* the food belongs to.<br>
    **A: CLASSIFICATION**<br>
    3. We have a dataset that contains different images of animals. We would like to be able to predict the *type* of animal. <br>
    **A: CLASSIFICATION**<br>
    4. We have a dataset of stock prices of a company for the past 10 years. We would like to predict the *stock price* for the upcoming year. <br>
    **A: REGRESSION**<br>

### 2.2 CHECKPOINT 2.1

**Q1:** How do parameters differ from hyperparameters? What role does each of them play in the learning process?

**A:** Parameters are learned by the model during training. Hyperparameters are settings that the data scientist will choose. Hyperparameters affect the learning process, while parameters are an outcome of the learning process.

**Q2:** What is tuned by the data analyst and what is learned by the model?

**A:** The data analyst tunes the hyperparameters, while the parameters are learned by the model.

### 2.3 CHECKPOINT 3.2

**Q:** What do you think will happen with the number of predicted positive and negative samples if we increase the classification threshold from 50% to 80%?

**A:** Increasing the classification threshold from 50% to 80% means that the model should predict with 80% probability that a sample is in the positive class so that the samples will be classified as positive. Thus, all previous samples whose positive class probability was between 50% and 80% (non-inclusive) will be considered negative, thus **less samples** then previously will be classified as positive. In other words, only samples that the model is highly confident that they belong to the positive class will be classified as positive. This might increase the number of false negative samples at the same time but decrease the number of false positives.

### 2.4 CHECKPOINT 3.3

**Q1:** What are the consequences of overfitting and underfitting on model performance?

**A:** In an overfitting scenario, the model will perform very well on the training set, but poorly on the test set. This means that the model would not be able to generalize well in new, unseen data points. On the other hand, in underfitting scenarios, the model is too simple for the data and it will not be able to learn its underlying patterns.  As a result, it will not perform well neither on the train set nor on the test set. 

**Q2:** *We have a dataset that contains 1000 points, each of them having a single feature. The function that will approximate these points will have the form $y=ax^2+bx+c$. We do not want the model to overfit the data, so we set a regularization parameter λ=0.1, to be used during the training process. Since we do not want to wait long, we also determine the number of steps that the loss function will be computed and the weights updated, n=100.* Given this scenario, determine whether a, b, c, λ and n are parameters or hyperparameters.

**A:** a, b and c are *parameters*. λ and n are *hyperparameters*.

### 2.5 CHECKPOINT 4.2

**Q1:** In the above code snippet, try different values for `max_depth`. How does increasing the value of this hyperparameter affect model performance in train and test sets? What does that mean?

**A:** As we increase the value of `max_depth` hyperparameter, which controls teh depth of the tree, we see that the difference of the performance metrics between the train and test sets becomes larger. For low values of `max-depth` MSE is higher in the train set. This might indicate that the model is too simple to learn anything from the data. On teh other hand, for high values of `max_depth` we see that the performance metrics are better in the train then the test set. This are indications of overfitting.

### 2.6 CHECKPOINT 4.3

**Q1:** In the above code snippet, try different values for `n_estimators` and `max_depth`. How does increasing the values of these hyperparameters affect model performance in train and test sets? What does that mean?

**A:** For low values of both of these hyperparameters, the test set performance is better than the train set performance, although the  overall performance is not good. These are clear signs of underfitting. By increasing the values of these hyperparameters, we see that the performance on the train set improves but the difference of metrics between the train and test sets becomes larger, indicating overfitting.

### 2.7 CHECKPOINT 5

**Q:** Compare and contrast the performance of the three types of classifiers. Set `max_depth=4` for both the Decision Tree classifier and the Random Forest classifier. Also, for the random Forest classifier set the `n_estimators=30`. With these hyperparameter settings, compare the performance of the three models. Which performs the best? Where can you see indications of overfitting/underfitting? What are some measures to prevent overfitting/underfitting in our specific case?

**A:** *Logistic regression*: From the performance metrics we see that the model has a decent performance on the train and test sets. In addition, the performance metrics on both sets are similar, meaning that the model has learned the data well and is able to generalize on new, unseen samples. *Decision Tree:* From the performance metrics we see that there is a bigger difference between train and test set metrics, indicating overfitting. In this model the performance metrics on the train set are better than those of logistic regression, however, logistic regression has a better performance in the test set. This model should be taken with caution since there are signs of overfitting. *Random Forest:* The performance metrics are a bit better than that of decision tree model. The big difference in F1-scores between train and test sets indicates that there might be overfitting, which is supported by the PRC curves as well.
 
All in all, with this hyperparameter settings we can conclude that logistic regression seems to be the most reliable model. In order to reduce overfitting traces we can lower the hyperparameter settings in both the decision tree and random forest models.

## Module 3: Unsupervised Learning