# What is machine learning?

## Simple example
- Jessie wants to buy an apartment and starts to do some research: 
    - Apartments of same size seem to cost more near the city center
    - At the same region, the price of an apartment seems to be mainly a function of its size  


- Jessie **learns from experience** that apartment prices depend mainly on size and location


- **This is what all machine learning is**: learning of patterns from data


- However, there is more to apartment prices than just the location and size:
    - How functional is the layout? 
    - Have the pipes been done? 
    - Which floor is the apartment at? 
    - How expensive is maintentance? 
    - And many more factors...


- This is where humans usually give up (and hire a real estate agent), the problem becomes too complex! There is too much data and too many factors.


- Luckily, computers are good at math and they are not afraid of data. Can we just give all the data to a computer and ask it to give us an estimate for a price? 
    - Yes we can! And this course teaches you all the important basic stuff you need to know in order to start teaching your computer to find patterns from data. 
    - See this example on predicting apartment prices in Finland with machine learning: [kannattaakokauppa](http://kannattaakokauppa.fi/#/en/ 'kannattaakokauppa')

# Vocabulary of machine learning

There are some terms, "vocabulary", in machine learning one needs to understand not to get confused. We use these terms all the time also at this course, so please pay attention:
### Basic terms in machine learning
- **Features**/**Variables**: Feature/variable is an individual measurable or observable property or characteristic of the phenomenon we are interested in. E.g. in the apartment price example features might be the location or the size of the apartment. The below image illustrates how features are usually presented in machine learning. Different charasteristics can also be modeled as differen types. For example, the location of an apartment can for be represented as continuous coordinates (latitude + longitude) or as a zip number.
<img src="img/feature-type.jpg" width="600"/>
- **Data set**: Data set consists of data points. Each data point consists of atleast one feature. For example in the apartment price example if we can collect the sizes and locations of multiple apartments, we would have a data set. In addition to data, data set *usually* also contains metadata. This metadata usually contains the names of the features and sometimes for example the date when the data set was created.  The below image illustrates a data set constructed from simple drawings. The data set contains of eight data points each with three features (color, shape and size). The image has the original drawings on the left and the data set on the right. Usually data sets are organized as matrices; rows represent data points, columns represent features and cells represent the features of data points
<img src="img/data.jpg" width="400"/>

- **Algorithm**: Algorithms are the work horse of machine learning. Different problems require different algorithms and there is no single (right) algorithm to any problem. However, different models make different assumpions about the problem so it is important to know for what purpose each algorithm has been made for. There is one important point in all algorithms: if the data has low quality, it does not really matter how good of an algorithm you choose. Often this is referred as "garbage in, garbage out".

# What kind of problems can we solve with (classical) machine learning?

In order to do classical machine learning, we need two things:
- Enough data that is represented as numbers (e.g. how tall is this tree?) or discrete classes (is this tree a spruce or a pine?) or both (this tree is a 20m tall spruce)
- A clearly defined question: What do we want to learn from the data?


The classic machine learning can be divided to two sub categories based on the question we are interested in:
- Supervised machine learning:
    - We are interested in predicting a class or number from data.
        - **Regression**: Predict a number; 
            - E.g. predict if there is a spruce or pine in a picture 
        - **Classification**: Predict a category;
            - E.g. predict the height of a tree in a picture.
    - When training the machine, the available data has to contain **labelled** samples.
        - E.g. we have a set of images and know the height/tree species in each of them. 

       
- Unsupervised machine learning
    - We are interested in finding something unknown from the data, without knowing the answer ourselves.
        - **Clustering**: Divide by similarity
            - E.g. divide a set of images of trees to similar trees, maybe the algorithm can find 
        - **Association**: Find relations in data 
            - E.g. data set might show that people who buy cheese and butter, also buy bread
        - **Dimensionality reduction**: Represent data in a more efficient way

# The boring but important part: preparing the data 
No one wants to spend time on manually looking at the data, but please do. Remember: "*Carbage in, carbage out*" - And we don't want that

Assuming that you have a problem you want to solve with a machine learning model, here is what you need to do before getting to the modeling part:

## Acquire data
The steps of acquiring data:
    
1) Decide the features that are potentially benefical for the task
- In classical machine learning, all data points must contain all features. No missing "cells" (features for data points) allowed in data set. It might be a good idea to 

2) Label the data points if the task is supervised
- This might be expensive as a human often needs to do the work. This time step requires time and please, be careful! Even the best model is only as good as the data you have fed to it - you don't want to make mistakes here!

*But how much data is enough?* 

There is no correct answer for this - it depends on the quality of the data, complexity of the task and number of features. Usually more is better: 100 data points for really simple tasks, 1000 data points for simple tasks and 10000 data points for complex tasks

*Pleas note*

Humans get biased really easily. We learn what we see. In order to stay unbiased, we shouldn't filter the data we see. The same applies to algorithms. Pay attention to not being selective or biased when collecting data.

For more on what happens if data is biased, see this post on [racist twitter algorithm](https://www.boredpanda.com/twitter-image-centering-algorithm-racist/) that went viral on 2020. Oops! You don't want your machine learning algorithm to be racist, huh?

## Pre process data

Some algorithms are bad at handling different data, that is why data is usually preprocesses or "pre-chewed" for the algorithm. This might include one or several of the following:

For more details see [link](https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825)

### Represent your data as numbers
Computers love numbers but unfortunately data is not always numbers. Especially categorical variables are troublesome and should be represented as numbers.

The easiest ones are nominal or ordinal variables with only two options (yes/no, available/not available, bad/good etc...). They can simply be replaced by 1 and 0 so that e.g. all "yes" are replaced with 1 and all "no" are replaced with 0. Simple! 

Nominal variables with more than two options are usually represented as so called "one-hot"-vectors. For instance, if there are three different categories (A, B and C), we represent the category with three separate binary variables so that only one of them is allowed to be one. The below image clarifies how this is done: 
<img src="img/data-binary.jpg" width="500"/>

For ordinal variables with multiple options, it is also possible to replace different options by integers. This way the order is retained. However, this is always not a good idea since different options might not have same distance. For example if the options are "bad", "good" and "excellent", the difference between "bad" and "good" is 
larger than between "good" and "excellent".

In addition to changing the data, it usually is necessary to also update the metadata related to the features. For binary presentations it should be clear from the name of the feature what 1 and 0 mean. For example in the dataset illustrated in the above image there are two options for size (large and small). Since the option "large" is replaced by ones, the feature is renamed to "large".

For more details on the topic see [link](https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63)


### Remove mistakes - if you can
Data is full of mistakes. Mistakes can be caused by humans or machine. 
- Human made mistakes are usually typos as is surprisingly easy to input "0.001" instead of "0.01" or "180m" instead of "180cm". 
- Machine made mistakes are for example caused by faulty sensors

Some models are bad with mistakes. Humans usually don't get distracted by mistakes. If you see a record of adult heights, you instantly see that 180m must be mistake as the rest of the data probably lies between 150cm and 200cm. However, if an algorithm sees the same record, it might get super confused and somehow tries to explain the  oddity.

Removing mistakes is hard for large datasets. Also, usually different people collect and analyse the data so the one analyzing the data might not even be able to tell if a value is a mistake. It is usually easy to spot the most crude mistakes by visualizations. We will soon show how.

### Balance the dataset
If the question you are interested in is predicting some rare event. For example you want to find an oil spill from satellite data or want to classify Finnish meadows based on if [Melitaea diamina](https://en.wikipedia.org/wiki/Melitaea_diamina) lives there, the data set is potentially really imbalanced. Imbalanced means that another class (no oil spill, no butterfly) is much, much more common than the other option.

In such cases the algorithms can perform well by only predicting for each data point (satellite image, meadow), that there is no occurance of rare event. It is same as if someone would ask you to predict if Finland will win the world championship on football this year. By answering no, the odds are on your favor.

In these cases, to force the algorithm to learn about the rare events, it is benefical to train the algorithm with more balanced data set. This can be achieved by removing the majority of the samples without the rare event, so that there are around 50% + 50% of samples of both classes.


### Scale numeric data to zero mean, unit variance

Some models are bad at understanding scale. Usually algorithms learn faster if the data has been processed so that numeric data of all features has zero mean and unit variance. A good rule of thumb is to always scale numerical data (discrete and continuous) and leave binary data as it is.

The reason why scaling is a good idea is because how some algorithms (like neural networks) are designed: They effectively change only on particular range. If the input is not on that range, the output saturates. The saturation of the output causes the gradients of the model parameters to become very small which makes it hard to train the algorithm

## Split the data to training an validation sets
The aim of the 


## Train your model

## Validate the performance of your model with the validation data

## Iterate

# Why should you split your data?


![Koneoppimisen periaate](img/classical_ml.jpg)

# What does this course cover?
- Introduction to ML & the basic concepts: 
    * Data preparation 
    * Why to split your dataset: Training, testing and validation datasets 
- Regression: Methods of predicting numeric value 
- Classification: Methods of prediction of qualitative class (species, forest type, …)  
- Clustering: Methods for dividing data to groups  
- Making sure the model does what it is supposed to: Model validation, validation metrics & their interpretation 
- Modern machine learning approaches methods: ANN, Deep learning 
- How to handle uncertainty: Bayesian machine learning 
- Exercises that provide hands on experience and useful scripts to later rely on 