# **Introduction**
In the beginning data were used to perform day-by-day operations.
Most of the data were used once and then merely stored for archivial purposes.
**We can use data to improve our decision processes and learn from them.**

Lately there has been the so called **data explosion** due to a mature DBMS (database managment systems) technology, cheap storages and automatic data collection.

It's much easier to store data rather than analyze them: increasing distance between data generation and data comprehension (analysis isn't cheap!).

> "We are drowning in data and starved for information."

## How to learn from data
1. **Statistics**: descriptive or inferential, relies on statistical models;
2. **Machine learning**: gives computers the ability to learn without being explicitly programmed. 
Comes mainly in two flavours: learning by being told (from instructions) or learning from examples;
3. **Data mining**: computational process to discover patterns in data (large datasets digitally stored). 
Uses methods from statistics, machine learning, artificial intelligence and DBMS. 
Data-driven approach, when data are the main "source of energy".

## A very short terminology
**Business intelligence**: analyze huge amounts of data with some tools for business purposes;

**Analytics**: learn to draw specific conclusions from raw data;

**Data science**: umbrella term for all above.

## Discovery process (KDD process)
1. Data soruces of raw data (DBMS, data sensors, web,...);
> **Data consolidation**: process that combines all of that data wherever it may live, removes any redundancies, and cleans up any errors before it gets stored in one location, like a data warehouse or data lake.

1. Consolidated data;
> **Selection and preprocessing**: not all data are useful.

1. Prepared data;
> **Machine learning**: techniques to extract patterns and models.
1. Patterns and models;
> **Interpreation and evaluation**.
1. Knowledge.

**Data mining**: starts from user-specified objectives (usually not a very precise idea...) and looks for knowledge expressed as patterns and models.
Knowledge is derived from data and must be actionable (otherwise is pretty useless).

![](https://i.ibb.co/cTT4Rd2/photo-2020-12-29-16-32-10.jpg)

## The virtuous loop
* I want to improve some process: find a problem or an opportunity;
* Apply KDD process and infer some knowledge;
* With this knowledge model some data-driven action, apply them and get some results;
* Measure the effects of these actions and elaborate a strategy to find a problem or an opportunity.

## A toy example: soybean diseases
Use a small dataset to diagnose soybean's diseases with the use of experts of the domain (rule-based systems).
* Elicitation of rules is difficult, time consuming and expensive;
* Rules aren't independent and should be carefully checked;
* The accuracy with just the rules is about 72%, they're not able to capture all the expert knowledge by being told.

Alternative apporach with machine learning:
* Machine learning is used to generated classification rules;
* The accuracy with just the rules is about 97.5%, comparable with the one of junior experts (learning from examples).

## General application areas
* Decision support (market analysis, risk managment, fraud detection);
* Data analysis (text mining, social mining, image analysis);
* Prediction;
* Advanced diagnosis and predictive maintenance.

## From business problem to tasks
Data mining is a process that drive to a software deployment.
This process has several stages, several alternatives annd well-defined tasks.
The main tasks are:
* **Classification and class probability estimation**: discrete domain;
* **Regression**: continuous domain, given a set of numeric attribute values for an individual, estimate the value of another numeric attribute;
* **Similarity matching**: identify similar individuals based on data known about them (similarity measure can be complex);
* **Clustering**;
* **Co-occurence gropuing**;
* **Profiling**: behaviour description where population is usually divided in groups with similar beaviours. Useful to detect anomalies;
* **Link analysis**: on item and connections based domain (graph), try to infer missing connections;
* **Data reduction**: attempt to take a large dataset and replace it with a reduced one preserving most important informations. Smaller datasets are easier to manipulate but causes information loss;
* **Causal modelling**: understand what action or event influences others.

## Supervised Vs. Unsupervised methods
* **Unsupervised mining**: do our population naturally fall into different groups? There is no specific purpose or target for grouping: it emerges by observing the characteristics of the individuals;
* **Supervised learning**: specific target is defined (i.e. churn analysis).

The techniques are substantially different. Being supervised or unsupervised is a characterstic of the problem or the data, not a design choice.

There are two main ways to obtain supervised information:
1. Information provided by experts;
1. History: info not available at run-time, when we should decide, but later on, the history will tell us the valued of the unknown attribute that influences our actions. We want to learn how to guess the unknown attribute from the known ones.

## Reinforcement learning
The goal is a sequence of actions to obtain the best result.
Learn a policy with the loop:
> * Try a policy;
* Get a reward;
* Change the policy.

The focus is the overall policy rather than the single actions.

