# The Search Space

The [`SearchSpace`](../cpp_api/search_space.html) holds the terminals and operators used to define programs, and includes utilities for creating programs and modifying them. 
It has a few basic components:

- `node_map`: this object maps function signatures to specific node types. It is a nested map, made to most quickly match on return types first, then full signatures. It is structured this way to mutations and crossover lookups fast. 
- `terminal_map`: same as `node_map` but for terminals. 

Both of these maps have associated weights that are used to weight the probabilities of each operator/terminal being sampled. 
Users can optionally provide these weights.

## Initializing

At a minimum, initializing the search space requires that a `Dataset` is already defined, so that `SearchSpace` knows how to define the terminals. 

In [11]:
import pandas as pd
from pybrush import Dataset, SearchSpace

df = pd.read_csv('../examples/datasets/d_example_patients.csv')
X = df.drop(columns='target')
y = df['target']

df.describe()

Unnamed: 0,id,sex,race,target
count,993.0,993.0,993.0,993.0
mean,496.0,0.487412,2.625378,8.219092
std,286.79871,0.500093,1.72524,1.101319
min,0.0,0.0,0.0,1.33728
25%,248.0,0.0,1.0,7.836757
50%,496.0,0.0,3.0,8.404038
75%,744.0,1.0,4.0,8.81071
max,992.0,1.0,5.0,11.410597


In [13]:
data = Dataset(X,y)

search_space = SearchSpace(data)

By default, the search space includes all available operators that have at least one argument type matching a datatype in `Data`. 
That can be quite large. 

Instead, the user may specify operators with weightings that determine the probability of being sampled, i.e. 

In [14]:
user_ops = {
    'Add': 0.5,
    'Sub': 0.5,
    'Mul': 1.0,
    'Div': 0.1,
    'SplitBest':0.2
}

search_space = SearchSpace(data, user_ops)


## Inspecting

We now have a much smaller search space. To view it, call `print()`:

In [15]:
search_space.print()

=== Search space ===
terminal_map: {"ArrayI": ["x_2", "1.00"], "ArrayB": ["x_1", "1.00"], "ArrayF": ["x_0", "1.00", "1.00*MeanLabel"]}
terminal_weights: {"ArrayI": [0.01214596, 0.01214596], "ArrayB": [0.026419641, 0.026419641], "ArrayF": [0.056145623, 0.056145623, 0.056145623]}
node_map[ArrayB][["ArrayB", "ArrayB"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayI][["ArrayI", "ArrayI"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayF][["ArrayF", "ArrayF"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayF][["ArrayF", "ArrayF"]][Div] = Div, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF"]][Mul] = Mul, weight = 1
node_map[ArrayF][["ArrayF", "ArrayF"]][Sub] = Sub, weight = 0.5
node_map[ArrayF][["ArrayF", "ArrayF"]][Add] = Add, weight = 0.5



Note that the `node_map` includes two `SplitBest` operators: one with the signature `ArrayI(ArrayI, ArrayI)` and one with the signature `ArrayF(ArrayF, ArrayF)`. 
This is because our dataset contains both interger and floating point data types. 
Note also that the default behavior is to give both of these nodes the same weight as specified by the user. 

## Loading datatypes

If you pass a numpy array, Brush will try to infer datatypes based on its values.
If instead of passing the data directly you rather pass a pandas dataframe, then it will use the data types retrieved from the powerful pandas sniffer to use as its own data type.

In [27]:
data = Dataset(X.values, y.values)

search_space = SearchSpace(data, user_ops)
search_space.print()

=== Search space ===
terminal_map: {"ArrayI": ["x_2", "1.00"], "ArrayB": ["x_1", "1.00"], "ArrayF": ["x_0", "1.00", "1.00*MeanLabel"]}
terminal_weights: {"ArrayI": [0.01214596, 0.01214596], "ArrayB": [0.026419641, 0.026419641], "ArrayF": [0.056145623, 0.056145623, 0.056145623]}
node_map[ArrayB][["ArrayB", "ArrayB"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayI][["ArrayI", "ArrayI"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayF][["ArrayF", "ArrayF"]][SplitBest] = SplitBest, weight = 0.2
node_map[ArrayF][["ArrayF", "ArrayF"]][Div] = Div, weight = 0.1
node_map[ArrayF][["ArrayF", "ArrayF"]][Mul] = Mul, weight = 1
node_map[ArrayF][["ArrayF", "ArrayF"]][Sub] = Sub, weight = 0.5
node_map[ArrayF][["ArrayF", "ArrayF"]][Add] = Add, weight = 0.5



## Sampling

TODO. For now, see the mutation and crossover functions in the [Program](../cpp_api/program.html) class.