# Overview of Machine Learning for Food Sciences

## What is Machine Learning?

Probably you should have heard a lot about "Machine Learning" so far, and fairly you might have asked yourself: What does it mean? Machine Learning lies in the intersection between computer science and statistics/mathematics. It uses algorithmic concepts from computer science together with mathematical models to describe the data and be able to find meaningful patterns present. Thus, **the main goal of Machine Learning is to extract "meaning" from data**. This "meaning" is in practice represented as a **mathematical equation that best describes the data**. In a philosophical perspective, Machine Learning is used to get knowledge from data. By using this extracted knowledge it is possible to make useful predictions.

Although sometimes it might not be evident, machine learning algorithms are vastly part of our lives. Some simple examples in your everyday lives would be the face-detection features in our smartphones, the voice assistants in our electronic devices, spam filters in our emails, etc. Other areas were the usage of machine learning is gathering momentum is medicine: machine learning algorithms can detect diseases with an accuracy similar to that of doctors, they can also be used to predict efficiency of different drug combinations which normally is an extremely time-consuming process.

[Figure 1](#ml-model) shows a high level view of the process of building a machine learning model. As you can see, there are 3 big parts: **the input data**, **the machine learning model** and the **predicted output**. The **input data** is fed into the model so that it can learn a good mathematical representation of it. The **process of building the model** consist of several  steps. When we build the model, we do not give it all the data that we have. We split the data into the training data and the test data. The reasons will become evident in [the train-test split section below](#train-test-split). Then we choose a specific machine learning model. **The model is trained on the training data**. After the model training si done, we can use it to predict some results on the test data, outputing the **predicted outputs**. Given model predictions, we can check how the model has performed. As you can see, there is a cycle between the testing and learning phase. This happens because we iteratively train the model, check how it performs on unseen data and then, if the results are not good we go back to training and repeat the process until we reach some satisfactory results. When it reaches an optimal performance we assume the model is ready to be used in real world scenarions.
How do we determine whether a model is performing well or how do we measure the performance it depends on the task at hand. We will see some examples later on.



<a id="ml-model"></a>
<img src="images/ml_model.jpg" alt="Machine Learning Model">
<center><figcaption><em>Fig 1: Building a machine learning model</em></figcaption></center>

## Where Machine Learning is Used in Food Science?

Like in many other areas, machine learning is quite useful even when it comes to Food Science. There are different applications in diverse areas that include: quality control, sensory analysis, food safety, product development, etc. 
In the case of quality control, machine learning models can help people  to identify flaws in the food production process that may be related to contamination or other defects. When it comes to sensory analysis, machine learning models can use past data to understand consumer preferences and give suggestions on new products that can be developed. Machine learning can be useful in case of food safety as well since they can be used to analyze data from microbiological test results to predict possible contamination scenarios. Lastly, it can help taylor the product development to not only make it efficient, but to also find trends and customer preferences in order to maximize profits.
All in all, everyday and more, machine learning is playing an increasing role in food science. 

## What is the data?

By **data** we refer to the **collection of samples** obtained through different experimental procedures. Usually, machine learning models work with data in tabular format. In machine learning notation, we denote **the number of samples by m** and **the number of features by n**. By the number of samples we mean the number fo data points. While features determine the characteristics of each of these data points. The machine learning models use these features to learn insights and to construct a mathematical equation that will represent the data.

<a id="data"></a>
<img src="images\data.jpg" alt="Data" style="width:75%">
<center><figcaption><em>Fig 2: Data</em></figcaption></center>

## Supervised vs Unsupervised Learning

Generally, there are 4 types of machine learning algorithms: *supervised learning*, *unsupervised learning*, *semi-supervised learning* and *reinforcement learning*. In this series of tutorials we will explore supervised and unsupervised learning algorithms.

### Supervised Learning

**Supervised Learning** - in this setting we aim to build a model that will learn the data the best and will be able to predict future values. It is the same as building a mathematical equation or formula with many input variables in order to be able to derive the desired output variable. The data points that the model uses to learn, already have the corresponding outputs. This is how the model is able to derive a connection between inputs and outputs. There are two types of problems in the supervised setting: *regression* and *classification*. In **regression**, the output that the model learns and then tries to predict is a continuous value (e.g learning age, height of people, etc). In **classification**, the output that the model learns and then tries to predict is a categorical value (e.g a class from a finite number of classes like the whether a tumor cell is benign or malignant). 


In the case of regression, after the model learns from the data, when we use it, it will output a value similar to what it saw during the training phase. In the case of classification, after the model learns from the data and is ready to be used, when we give it a new, unseen sample it will output a class or a category from the set of categories that it saw during the training phase. [Figure 3](#sup_lear_reg_clf) illustrates the process.

<a id="sup_lear_reg_clf"></a>
<img src="images\supervised_learning__regression_classification.jpg" alt="Supervised learning regression and classification">
<center><figcaption><em>Fig 3: Regression and Classification</em></figcaption></center>

We will study all the steps in this process in future sections.

### Unsupervised Learning

**Unsupervised Learning** - in this setting, the data that we have does not have any values or categories that we can learn and later predict. Here, the models will try to find a structure in the data, or learn patterns present. Some use cases of such models would be: clustering, dimensionality reduction, data generation, anomaly detection etc. In the case of clustering, we try to find groups within the data, so that we can group similar samples together. In the case of dimensionality reduction, we move from data with many features, to compressed data, with very few features. While as the name suggests, in the case of data generation, we use the unlabelled data to learn a structure or underlying properties and based on this, the model will be generate similar samples. For anomaly detection, we can use machine learning models to find outliers in the data. Outliers are points that do not resemble the majority of the points in the dataset. [Fig 4](#unsup_lear) illustrates the idea. Still there is an output from the models and it outputs what the model has learned from the data. In the case of clustering, it will output a cluster number that will show with which other samples a specific sample is most similar to. In the case of dimensionality reduction, the output will be the sample but with less features. 

<a id="unsup_lear"></a>
<img src="images\unsupervised_learning__clustering_dimred.jpg" alt="Unsupervised learning clustering and dimensionality reduction">
<center><figcaption><em>Fig 4: Clustering and Dimensionality Reduction</em></figcaption></center>

In this series of tutorials we will focus only on clustering and dimensionality reduction.

## Datasets Used during the Tutorials

## Data Processing for Machine Learning Methods

### Train-test Split

### Standardization

### Outlier detection

### Data Quality Control