# Overview of Machine Learning for Food Sciences

## What is Machine Learning?

Probably you should have heard a lot about "Machine Learning" so far, and fairly you might have asked yourself: What does it mean? Machine Learning lies in the intersection between computer science and statistics/mathematics. It uses algorithmic concepts from computer science together with mathematical models to describe the data and be able to find meaningful patterns present. Thus, **the main goal of Machine Learning is to extract "meaning" from data**. This "meaning" is in practice represented as a **mathematical equation that best describes the data**. In a philosophical perspective, Machine Learning is used to get knowledge from data. By using this extracted knowledge it is possible to make useful predictions.

Although sometimes it might not be evident, machine learning algorithms are vastly part of our lives. Some simple examples in your everyday lives would be the face-detection features in our smartphones, the voice assistants in our electronic devices, spam filters in our emails, etc. Other areas were the usage of machine learning is gathering momentum is medicine: machine learning algorithms can detect diseases with an accuracy similar to that of doctors, they can also be used to predict efficiency of different drug combinations which normally is an extremely time-consuming process.

[Figure 1](#ml-model) shows a high level view of the process of building a machine learning model. As you can see, there are 3 big parts: **the input data**, **the machine learning model** and the **predicted output**. The **input data** is fed into the model so that it can learn a good mathematical representation of it. The **process of building the model** consist of several  steps. When we build the model, we do not give it all the data that we have. We split the data into the training data and the test data. The reasons will become evident in [the train-test split section below](#train-test-split). Then we choose a specific machine learning model. **The model is trained on the training data**. After the model training si done, we can use it to predict some results on the test data, outputing the **predicted outputs**. Given model predictions, we can check how the model has performed. As you can see, there is a cycle between the testing and learning phase. This happens because we iteratively train the model, check how it performs on unseen data and then, if the results are not good we go back to training and repeat the process until we reach some satisfactory results. When it reaches an optimal performance we assume the model is ready to be used in real world scenarions.
How do we determine whether a model is performing well or how do we measure the performance it depends on the task at hand. We will see some examples later on.


<center>
<a id="ml-model"></a>
<img src="images/ml_model.jpg" alt="Machine Learning Model" width="75%">
<center><figcaption><em>Fig 1: Building a machine learning model</em></figcaption></center>
</center>

## Where Machine Learning is Used in Food Science?

Like in many other areas, machine learning is quite useful even when it comes to Food Science. There are different applications in diverse areas that include: quality control, sensory analysis, food safety, product development, etc. 
In the case of quality control, machine learning models can help people  to identify flaws in the food production process that may be related to contamination or other defects. When it comes to sensory analysis, machine learning models can use past data to understand consumer preferences and give suggestions on new products that can be developed. Machine learning can be useful in case of food safety as well since they can be used to analyze data from microbiological test results to predict possible contamination scenarios. Lastly, it can help taylor the product development to not only make it efficient, but to also find trends and customer preferences in order to maximize profits.
All in all, everyday and more, machine learning is playing an increasing role in food science. 

## What is the data?

By **data** we refer to the **collection of samples** obtained through different experimental procedures. Usually, machine learning models work with data in tabular format. In machine learning notation, we denote **the number of samples by m** and **the number of features by n**. By the number of samples we mean the number fo data points. While features determine the characteristics of each of these data points. The machine learning models use these features to learn insights and to construct a mathematical equation that will represent the data.

<center><a id="data"></a>
<img src="images\data.jpg" alt="Data" width="50%">
<center><figcaption><em>Fig 2: Data</em></figcaption></center>
</center>

## Supervised vs Unsupervised Learning

Generally, there are 4 types of machine learning algorithms: *supervised learning*, *unsupervised learning*, *semi-supervised learning* and *reinforcement learning*. In this series of tutorials we will explore supervised and unsupervised learning algorithms.

### Supervised Learning

**Supervised Learning** - in this setting we aim to build a model that will learn the data the best and will be able to predict future values. It is the same as building a mathematical equation or formula with many input variables in order to be able to derive the desired output variable. The data points that the model uses to learn, already have the corresponding outputs. This is how the model is able to derive a connection between inputs and outputs. There are two types of problems in the supervised setting: *regression* and *classification*. In **regression**, the output that the model learns and then tries to predict is a continuous value (e.g learning age, height of people, etc). In **classification**, the output that the model learns and then tries to predict is a categorical value (e.g a class from a finite number of classes like the whether a tumor cell is benign or malignant). 


In the case of regression, after the model learns from the data, when we use it, it will output a value similar to what it saw during the training phase. In the case of classification, after the model learns from the data and is ready to be used, when we give it a new, unseen sample it will output a class or a category from the set of categories that it saw during the training phase. [Figure 3](#sup_lear_reg_clf) illustrates the process.

<center><a id="sup_lear_reg_clf"></a>
<img src="images\supervised_learning__regression_classification.jpg" alt="Supervised learning regression and classification" width="75%">
<center><figcaption><em>Fig 3: Regression and Classification</em></figcaption></center>
</center>

We will study all the steps in this process in future sections.

### Unsupervised Learning

**Unsupervised Learning** - in this setting, the data that we have does not have any values or categories that we can learn and later predict. Here, the models will try to find a structure in the data, or learn patterns present. Some use cases of such models would be: clustering, dimensionality reduction, data generation, anomaly detection etc. In the case of clustering, we try to find groups within the data, so that we can group similar samples together. In the case of dimensionality reduction, we move from data with many features, to compressed data, with very few features. While as the name suggests, in the case of data generation, we use the unlabelled data to learn a structure or underlying properties and based on this, the model will be generate similar samples. For anomaly detection, we can use machine learning models to find outliers in the data. Outliers are points that do not resemble the majority of the points in the dataset. [Fig 4](#unsup_lear) illustrates the idea. Still there is an output from the models and it outputs what the model has learned from the data. In the case of clustering, it will output a cluster number that will show with which other samples a specific sample is most similar to. In the case of dimensionality reduction, the output will be the sample but with less features. 

<center><a id="unsup_lear"></a>
<img src="images\unsupervised_learning__clustering_dimred.jpg" alt="Unsupervised learning clustering and dimensionality reduction" width="75%">
<center><figcaption><em>Fig 4: Clustering and Dimensionality Reduction</em></figcaption></center>
</center>

In these series of tutorials we will focus only on clustering and dimensionality reduction.

## Datasets Used during the Tutorials

[To be completed in the very end...]

## Data Processing for Machine Learning Methods

In order to build robust machine learning algorithms, data quality is very important. In most of the cases the data that we get is raw. If we use it as it is, the model may not be able to learn many useful characteristics. Thus, we need to pre-process the data by applying different techniques, so that it can be useful to the model. Below we brief some of these techniques.

### Train-Test Split

Although in practice it happens after the real data preprocessing, sometimes it is considered to be part of the data pre-pprocessing phase. As you can see from [Fig. 5](#train_test), before the machine learning model is trained, the data is split in two parts: the **training data** and the **testing data**. Usually we use 70-80% of the whole dataset for training and the remaining 20-30% for testing. The reason that we do this is to be able to quantify how well the model has learned the characteristics of the data. If the performance indicators will be great in the test part than we know that the model has learned the data well. If we see that performance indicators show poor results on the test set that this means that our model might have just "learned by heart".

<center><a id="train_test"></a>
<img src="images\train_test_split.jpg" alt="Unsupervised learning clustering and dimensionality reduction" width="75%">
<center><figcaption><em>Fig 5: Train-Test Split</em></figcaption></center>
</center>

In order to better understand this, let us make an analogy to your courses at ETH. When you go and sit in an exam, you rarely or never find exactly the same questions that you have seen during the exercise sessions or classes. Also, you never know in advance which questions will be asked in the exam. The reason for all this constraints is that the professors want to evaluate you based on what you have understood from the course. If you already knew the questions of the exam you can easily prepare them beforehand, try to memorize the answers even if you understand nothing and then you get a really good grade from the course. But what happens next is that when you start working outside the doors of the university, you will probably not be able to solve the tasks at hand because of lack of understanding of certain concepts and set of skills. 

This is exactly what we want to avoid when training machine learning models. In the machine learning jargon, we want to have models that generalize well. This means that they can properly solve tasks which they have not encountered during training. And they can do perfectly so, if the training was successful and they have gained insights on data. In our analogy from above, if the students understand the concepts and can put them to practice, then they can be flexible in their jobs and solve tasks that they might not have encountered during the studies. All this happens because they will have the necessary set of skills from their training (education) years. 

Below you can see how we can split in practice a dataset.
We will first read the 'Swiss Food Composition Dataset' into a pandas dataframe using the `read_csv()` method. Then we will split it using the `train_test_split()` method from the `model_selection` module of the `sklearn` library.

First, we start by importing the libraries that we will need.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

Now we need to load the dataset that we would want to split.

In [5]:
dataset = pd.read_csv('data/swiss_food_composition_database.csv')
dataset

Unnamed: 0,ID,Name,Category,"Energy, kilocalories (kcal)","Fat, total (g)","Fatty acids, saturated (g)","Fatty acids, monounsaturated (g)","Fatty acids, polyunsaturated (g)",Cholesterol (mg),"Carbohydrates, available (g)",...,Potassium (K) (mg),Sodium (Na) (mg),Chloride (Cl) (mg),Calcium (Ca) (mg),Magnesium (Mg) (mg),Phosphorus (P) (mg),Iron (Fe) (mg),Iodide (I) (�g),Zinc (Zn) (mg),Selenium (Se) (�g)
0,10533,Agar Agar,Various/Gelling and binding agents,160,0.2,n.d.,n.d.,n.d.,n.d.,0,...,52,130,n.d.,660,100,34,4.5,n.d.,1.5,n.d.
1,10536,Agave syrup,Sweets/Sugar and sweeteners,293,0,0,n.d.,n.d.,n.d.,73.1,...,n.d.,4,n.d.,n.d.,n.d.,n.d.,n.d.,n.d.,n.d.,n.d.
2,273,Almond,"Nuts, seeds and oleaginous fruit",624,52.1,4.1,31.4,11.4,0,7.8,...,740,1.1,40,270,240,510,3.3,0.2,3.3,2.2
3,278,"Almond, dry roasted, salted","Savoury snacks/Salted nuts, seeds and kernels",637,52.5,4.1,33.1,13,0,10.1,...,710,230,1190,270,280,470,3.7,2.4,3.3,2
4,269,"Almond, roasted, salted","Savoury snacks/Salted nuts, seeds and kernels",649,55.2,4.2,34.8,13.5,0,7.2,...,670,330,1190,240,270,470,3.3,2.4,3.1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1087,1661,"Zucchini piccata, prepared",Prepared dishes/Other savoury dishes,124,8,2.2,4,1,65,6.6,...,210,93,150,76,24,97,0.9,11,0.7,n.d.
1088,1657,"Zucchini slices, breaded, prepared",Prepared dishes/Other savoury dishes,127,5.5,0.7,3.4,0.9,44,13.5,...,210,89,160,28,24,65,1,7.3,0.5,n.d.
1089,367,"Zucchini, raw",Vegetables/Fresh vegetables,19,0.2,0,0,0.1,0,2,...,230,3,24,19,23,31,0.8,2.3,0.2,n.d.
1090,1654,"Zucchini, steamed (without addition of salt)",Vegetables/Cooked vegetables (incl. cans),20,0.2,0,0,0.1,0,2.2,...,220,2.9,26,21,24,33,0.7,2.5,0.3,n.d.


After having loaded it, we can split it:

In [6]:
train_set, test_set = train_test_split(dataset, test_size=0.2, random_state=0)
train_set

Unnamed: 0,ID,Name,Category,"Energy, kilocalories (kcal)","Fat, total (g)","Fatty acids, saturated (g)","Fatty acids, monounsaturated (g)","Fatty acids, polyunsaturated (g)",Cholesterol (mg),"Carbohydrates, available (g)",...,Potassium (K) (mg),Sodium (Na) (mg),Chloride (Cl) (mg),Calcium (Ca) (mg),Magnesium (Mg) (mg),Phosphorus (P) (mg),Iron (Fe) (mg),Iodide (I) (�g),Zinc (Zn) (mg),Selenium (Se) (�g)
862,14107,Sesame oil,Fats and oils/Oils,810,90,13.4,36.2,36,0,0,...,0,0,0,0,0,0,0,0,0,0
422,270,Hazelnut,"Nuts, seeds and oleaginous fruit",661,59.5,4.2,46.6,6.1,0,10.1,...,720,0.9,11,160,160,320,3.6,6.5,2.9,0.7
342,755,"Fish, sole, raw",Fish/Sea fish,85,1.1,0.2,0.2,0.4,50,0,...,350,120,140,29,25,270,0.6,25,0.1,n.d.
188,882,Cherry pie from Zug,Sweets/Cakes and tarts,335,18.7,8.5,7.5,1.6,90,34.6,...,140,62,73,67,25,120,0.8,4.2,0.6,n.d.
933,450,"Swiss chard, raw",Vegetables/Fresh vegetables,23,0.2,tr.,tr.,tr.,0,2.7,...,380,170,100,80,81,43,2.3,1,0.3,n.d.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1033,206,"Wheat flour, wholemeal, type 1700","Cereal products, pulses and potatoes/Flour and...",338,2,0.3,0.2,1.5,0,61,...,390,17,37,26,130,350,5,2.4,3.4,5
763,1514,"Rice pudding, prepared (with full fat milk, wi...",Sweets/Creams and puddings,143,5.1,3.1,1.2,0.3,19,19,...,200,85,180,150,17,130,0.1,12,0.6,3.6
835,647,"Sauerkraut, pickled",Vegetables/Cooked vegetables (incl. cans),19,0.3,0.1,0,0.2,0,1.7,...,220,550,1160,36,11,30,0.5,3.1,0.3,n.d.
559,683,"Mushroom sauce, thickened",Various/Sauces,132,8.9,5.4,2.1,0.5,28,8.5,...,170,250,410,110,11,90,0.1,8.7,0.5,n.d.


We pass three arguments to the method. The first one is the dataset that will be split. The second one `test_size` determines teh portion of the data that will be used for testing. Here we specify that `20%=0.2` will be used for testing, thus implicitly the remaining 80% for training. The method, depending on whether we pass the whole dataset or the features and the labels separately, will return 2 new dataframes in the first case (1x for the train set and 1x for the test set) or 4 new dataframes in the second case (1x for the train_features, 1x for the test features, 1x for the train labels and 1x for the test labels). The third argument, `random_state=0` makes sure that no matter how many times we execute the above code cell, it will produce the same split all the times. This means that the `train_set` and `test_set` will always have the same samples. If we omit that argument, then `train_set` and `test_set` will have different samples each time we execute the cell.

In order to make sure that the split is correct let us check the sizes of the new dataframes.

In [7]:
print("Shape of original dataset: ", dataset.shape)
print("Shape of train dataset: ", train_set.shape)
print("Shape of test dataset: ", test_set.shape)

Shape of original dataset:  (1092, 42)
Shape of train dataset:  (873, 42)
Shape of test dataset:  (219, 42)


This confirms our desired proportions for the split.

### Standardization

### Outlier detection

### Data Quality Control