# Data science

Data science is a field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from the structured and unstructured data.

## Classification

Classification is a set of supervised machine learning tasks that are supposed to predict the lable to which a particular object corresponds.

- **Binary classification**: is a task that separates samples into two groups.
- **Multiclass classification**: separates data into more that two groups.

### Multiclass

Multiclass classification is a group of machine learning approaches that enables the separation of data into $n$ groups.

There are two groups of the machine learning algorithms for multiclass classification:

- **One-vs-Rest** trains a binary classification model to distinguish samples that belonging to a particular class from the all others: $f^i(x)=P(y=i|x), i = \overline{1,n}$.
- **Multinomial** builds a transformation of the input data to produce a vector containing a score that describes the likelihood of each class to correspond to the given $x$: $f(x) = \left[P(y=1|x), P(y=2|x), \ldots, P(y=n|x)\right]$.

## DL mechanisms

Deep Learning is an advanced form of machine learning. It implements the concept of the human brain in the mathematical terms. A set of signals (actions) that appears when a specific input is received. In mathematics, this is usually implemented as a deep composition of functions - that's actually the term term "deep learning" come from.

The DL is typically used in computer vision and complex NLP tasks. There are other sections for applications. This section considers general deep learning approaches.

Check details in the [DL meachanisms](dl_mechanisms.ipynb)

## Computer Vision

Computer vision tasks are commonly divided into categories. It is important to understand the differences in order to select the correct approach.

- Classfication: is a form of computer vision in which a model is trained on a set of labaled images. The goal of the model is to predict the label to which an image belongs to.
- Object Detection: Identifying the location of specific object in an image. In the most typcal case, simply to build a box around the object in the image.
- Semantic segmentation: Identifying to each pixel a class that determines the object pixel belongs to.
- Generating pictures.

## NLP

Natural Language Processing (NLP) a set of tasks related to processing texts.

This field considers approaches related to:

- Texts classification. It defines which class a given text belongs to from a given set of classes.
    - Sentiment analysis: Identifying the tone of the text. For example, it can be positive, negative or neutral. **Note:** Some sources treat this as a separate task.
- Named Entity Recognition (NER): This task involves identifying specific componets (named entities) within texts that contain particular pieces of information.
- Texts generation.

For more details, check the [NLP](nlp.ipynb) page.

## ARMA

These model interpret observed time series as a sum of two components:

- $x_t$: determined part of the $t$-th element of time series.
- $\varepsilon_t$: random noise of the $t$-th element of time series.

Thus, the observed value in the sample is actually composed of $x_t + \varepsilon_t$.


The ARMA model assumes that the $t$-th value of the timeseries ($x_t$) depends linearly on $p$ previous values of the time series ($x_{t-1}, x_{t-2}, \ldots, x_{t-p}$) and $q$ previous values of the random noise ($\varepsilon_{t-1} + \varepsilon_{t-2} + \ldots + \varepsilon_{t-q}$).

It's typically can be written down as a equation:

$$X_t + \alpha_1 X_{t-1} + \alpha_2 X_{t-2} + \ldots + \alpha_{t-p} X_p = \varepsilon_t + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \ldots + \theta_q \varepsilon_{t-q}.$$

Where

- $\alpha_i, i = \overline{1,p}$: The coefficient that describes how the $t$-th value of the time series depends on the $t-i$ value of the time series.
- $\theta_i, i = \overline{1,q}$: The coefficient that describes how the $t$-th value of the time series depends on the random noise for the $t-i$-th observation.

The official definition can be a bit confusing because it does not express the paticular value of the time series. Thus, it can be rewritten using basic mathematical transformations as follows:

$$X_t - \varepsilon_t = \alpha_1 X_{t-1} + \alpha_2 X_{t-2} + \ldots + \alpha_{t-p} X_p - \theta_1 \varepsilon_{t-1} - \theta_2 \varepsilon_{t-2} - \ldots - \theta_q \varepsilon_{t-q}$$

Since an $\varepsilon_t$ is just a noise, the sign before it is not imporant. We can rewrite the entire identity as follows: 

$$X_t + \varepsilon_t = \alpha_1 X_{t-1} + \alpha_2 X_{t-2} + \ldots + \alpha_{t-p} X_p + \theta_1 \varepsilon_{t-1} + \theta_2 \varepsilon_{t-2} + \ldots + \theta_q \varepsilon_{t-q}$$