# Data basics

## Scientific Type

### Data interpretation: Scientific Types

In [1]:
using Pkg; Pkg.activate("D:/JULIA/6_ML_with_Julia/D0-scitype"); Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `D:\JULIA\6_ML_with_Julia\D0-scitype`


### Machine type vs Scientific Type

#### Why make a distinction?

When analysing data, it is important to distinguish between

- how the data is encoded (e.g. ```Int```), and

- how the data should be interpreted (e.g. a class label, a count, ...)

How the data is encoded will be referred to as the **machine type** whereas how the data should be interpreted will be referred to as the **scientific type** (or ```scitype```).

In some cases, this may be un-ambiguous, for instance if you have a vector of floating point values, this should usually be interpreted as a continuous feature (e.g.: weights, speeds, temperatures, ...).

In many other cases however, there may be ambiguities, we list a few examples below:

- A vector of ```Int``` e.g. ```[1, 2, ...]``` which should be interpreted as categorical labels,

- A vector of ```Int``` e.g. ```[1, 2, ...]``` which should be interpreted as count data,

- A vector of ```String``` e.g. ```["High", "Low", "High", ...]``` which should be interpreted as ordered categorical labels,

- A vector of ```String``` e.g. ```["John", "Maria", ...]``` which should not interpreted as informative data,

- A vector of floating points ```[1.5, 1.5, -2.3, -2.3]``` which should be interpreted as categorical data (e.g. the few possible values of some setting), etc.

#### The Scientific Types

The package ```ScientificTypes.jl``` defines a barebone type hierarchy which can be used to indicate how a particular feature should be interpreted; in particular:

```Julia
Found
├─ Known
│  ├─ Textual
│  ├─ Finite
│  │  ├─ Multiclass
│  │  └─ OrderedFactor
│  └─ Infinite
│     ├─ Continuous
│     └─ Count
└─ Unknown

```

A scientific type convention is a specific implementation indicating how machine types can be related to scientific types. It may also provide helper functions to convert data to a given scitype.

The convention used in MLJ is implemented in ScientificTypes.jl. This is what we will use throughout; you never need to use ScientificTypes.jl unless you intend to implement your own scientific type convention.

#### Inspecting the scitype

The ```schema``` function

In [2]:
using RDatasets
using ScientificTypes

In [3]:
boston = dataset("MASS", "Boston")
sch = schema(boston)

┌─────────┬────────────┬─────────┐
│[22m names   [0m│[22m scitypes   [0m│[22m types   [0m│
├─────────┼────────────┼─────────┤
│ Crim    │ Continuous │ Float64 │
│ Zn      │ Continuous │ Float64 │
│ Indus   │ Continuous │ Float64 │
│ Chas    │ Count      │ Int64   │
│ NOx     │ Continuous │ Float64 │
│ Rm      │ Continuous │ Float64 │
│ Age     │ Continuous │ Float64 │
│ Dis     │ Continuous │ Float64 │
│ Rad     │ Count      │ Int64   │
│ Tax     │ Count      │ Int64   │
│ PTRatio │ Continuous │ Float64 │
│ Black   │ Continuous │ Float64 │
│ LStat   │ Continuous │ Float64 │
│ MedV    │ Continuous │ Float64 │
└─────────┴────────────┴─────────┘


In this cases, most of the variables have a (machine) type ```Float64``` and their default interpretation is ```Continuous```. There is also ```:Chas```, ```:Rad``` and ```:Tax``` that have a (machine) type ```Int64``` and their default interpretation is ```Count```.

While the interpretation as ```Continuous``` is usually fine, the interpretation as ```Count``` needs a bit more attention. For instance note that:

In [4]:
unique(boston.Chas)

2-element Vector{Int64}:
 0
 1

so even though it's got a machine type of ```Int64``` and consequently a default interpretation of ```Count```, it would be more appropriate to interpret it as an ```OrderedFactor```.

#### Changing the scitype

In order to re-specify the scitype(s) of feature(s) in a dataset, you can use the ```coerce``` function and specify pairs of variable name and scientific type:

In [5]:
boston2 = coerce(boston, :Chas => OrderedFactor);

the effect of this is to convert the ```:Chas``` column to an ordered categorical vector:

In [6]:
eltype(boston2.Chas)

CategoricalArrays.CategoricalValue{Int64, UInt32}

corresponding to the ```OrderedFactor``` scitype:

In [7]:
elscitype(boston2.Chas)

OrderedFactor{2}

You can also specify multiple pairs in one shot with ```coerce```:

In [8]:
boston3 = coerce(boston, :Chas => OrderedFactor, :Rad => OrderedFactor);

In [9]:
sch3 = schema(boston3)

┌─────────┬──────────────────┬─────────────────────────────────┐
│[22m names   [0m│[22m scitypes         [0m│[22m types                           [0m│
├─────────┼──────────────────┼─────────────────────────────────┤
│ Crim    │ Continuous       │ Float64                         │
│ Zn      │ Continuous       │ Float64                         │
│ Indus   │ Continuous       │ Float64                         │
│ Chas    │ OrderedFactor{2} │ CategoricalValue{Int64, UInt32} │
│ NOx     │ Continuous       │ Float64                         │
│ Rm      │ Continuous       │ Float64                         │
│ Age     │ Continuous       │ Float64                         │
│ Dis     │ Continuous       │ Float64                         │
│ Rad     │ OrderedFactor{9} │ CategoricalValue{Int64, UInt32} │
│ Tax     │ Count            │ Int64                           │
│ PTRatio │ Continuous       │ Float64                         │
│ Black   │ Continuous       │ Float64                         

### String and Unknown

If a feature in your dataset has String elements, then the default scitype is ```Textual```; you can either choose to drop such columns or to coerce them to categorical:

In [10]:
feature = ["AA", "BB", "AA", "AA", "BB"]

5-element Vector{String}:
 "AA"
 "BB"
 "AA"
 "AA"
 "BB"

In [11]:
elscitype(feature)

Textual

In [12]:
feature2 = coerce(feature, Multiclass)
elscitype(feature2)

Multiclass{2}

### Tips and tricks

#### Type to Type coercion

In some cases you will want to reinterpret all features currently interpreted as some scitype ```S1``` into some other scitype ```S2```. An example is if some features are currently interpreted as ```Count``` because their original type was ```Int``` but you want to consider all such as ```Continuous```:

In [13]:
data = select(boston, [:Rad, :Tax]);

In [14]:
schema(data)

┌───────┬──────────┬───────┐
│[22m names [0m│[22m scitypes [0m│[22m types [0m│
├───────┼──────────┼───────┤
│ Rad   │ Count    │ Int64 │
│ Tax   │ Count    │ Int64 │
└───────┴──────────┴───────┘


let's coerce from ```Count``` to ```Continuous```:

In [15]:
data2 = coerce(data, Count => Continuous);
schema(data2)

┌───────┬────────────┬─────────┐
│[22m names [0m│[22m scitypes   [0m│[22m types   [0m│
├───────┼────────────┼─────────┤
│ Rad   │ Continuous │ Float64 │
│ Tax   │ Continuous │ Float64 │
└───────┴────────────┴─────────┘


#### Autotype

A last useful tool is autotype which allows you to specify rules to define the interpretation of features automatically. You can code your own rules but there are three useful ones that are pre- coded:

- the : ```few_to_finite``` rule which checks how many unique entries are present

in a vector and if there are "few" suggests a categorical type,

- the : ```discrete_to_continuous``` rule converts ```Integer``` or ```Count``` to

Continuous

- the : ```string_to_multiclass``` which returns ```Multiclass``` for any string-like

column.

For instance:

In [16]:
boston3 = coerce(boston, autotype(boston, :few_to_finite));
schema(boston3)

┌─────────┬───────────────────┬───────────────────────────────────┐
│[22m names   [0m│[22m scitypes          [0m│[22m types                             [0m│
├─────────┼───────────────────┼───────────────────────────────────┤
│ Crim    │ Continuous        │ Float64                           │
│ Zn      │ OrderedFactor{26} │ CategoricalValue{Float64, UInt32} │
│ Indus   │ Continuous        │ Float64                           │
│ Chas    │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ NOx     │ Continuous        │ Float64                           │
│ Rm      │ Continuous        │ Float64                           │
│ Age     │ Continuous        │ Float64                           │
│ Dis     │ Continuous        │ Float64                           │
│ Rad     │ OrderedFactor{9}  │ CategoricalValue{Int64, UInt32}   │
│ Tax     │ Count             │ Int64                             │
│ PTRatio │ OrderedFactor{46} │ CategoricalValue{Float64, UInt32} │
│ Black   │ Continuou

|Rule Symbol    | scitype suggestion  |
|:--------------|:---------------------|
|**```:few_to_finite```**| an appropriate ```Finite``` subtype for columns with few distinct values |
|**```:discrete_to_continuous```** | if not ```Finite```, then ```Continuous``` for any ```Count``` or ```Integer``` scitypes/types|
|**```:string_to_multiclass```**   | ```Multiclass``` for any string-like column|

#### Examples

In [17]:
n = 50
X = (a = rand("abc", n),         # 3 values, not number        --> Multiclass
     b = rand([1,2,3,4], n),     # 4 values, number            --> OrderedFactor
     c = rand([true,false], n),  # 2 values, number but only 2 --> Multiclass
     d = randn(n),               # many values                 --> unchanged
     e = rand(collect(1:n), n))  # many values                 --> unchanged
schema(X)

┌───────┬────────────┬─────────┐
│[22m names [0m│[22m scitypes   [0m│[22m types   [0m│
├───────┼────────────┼─────────┤
│ a     │ Unknown    │ Char    │
│ b     │ Count      │ Int64   │
│ c     │ Count      │ Bool    │
│ d     │ Continuous │ Float64 │
│ e     │ Count      │ Int64   │
└───────┴────────────┴─────────┘


In [18]:
X2 = coerce(X, autotype(X, only_changes=true));

In [19]:
autotype(X, only_changes=true)

Dict{Symbol, Type} with 3 entries:
  :a => Multiclass
  :b => OrderedFactor
  :c => OrderedFactor

In [20]:
schema(X2)

┌───────┬──────────────────┬─────────────────────────────────┐
│[22m names [0m│[22m scitypes         [0m│[22m types                           [0m│
├───────┼──────────────────┼─────────────────────────────────┤
│ a     │ Multiclass{3}    │ CategoricalValue{Char, UInt32}  │
│ b     │ OrderedFactor{4} │ CategoricalValue{Int64, UInt32} │
│ c     │ OrderedFactor{2} │ CategoricalValue{Bool, UInt32}  │
│ d     │ Continuous       │ Float64                         │
│ e     │ Count            │ Int64                           │
└───────┴──────────────────┴─────────────────────────────────┘


In [21]:
X3 = coerce(X, autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite)));

In [22]:
autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))

Dict{Symbol, Type} with 4 entries:
  :a => Multiclass
  :b => OrderedFactor
  :e => Continuous
  :c => OrderedFactor

In [23]:
schema(X3)

┌───────┬──────────────────┬─────────────────────────────────┐
│[22m names [0m│[22m scitypes         [0m│[22m types                           [0m│
├───────┼──────────────────┼─────────────────────────────────┤
│ a     │ Multiclass{3}    │ CategoricalValue{Char, UInt32}  │
│ b     │ OrderedFactor{4} │ CategoricalValue{Int64, UInt32} │
│ c     │ OrderedFactor{2} │ CategoricalValue{Bool, UInt32}  │
│ d     │ Continuous       │ Float64                         │
│ e     │ Continuous       │ Float64                         │
└───────┴──────────────────┴─────────────────────────────────┘


If the keyword ```only_changes``` is passed set to ```true```, then only the column names for which the suggested type is different from that provided by the convention are returned.

To specify which rules are to be applied, use the ```rules``` keyword and specify a tuple of symbols referring to specific rules; the default rule is ```:few_to_finite``` which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate ```Finite``` type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.