# Run MatChain by Configuration File

This notebook illustrates how to match two datasets A and B by using MatChain's configuration file. The notebook uses the same datasets and parameters as the notebook [run_matchain_api.ipynb](https://github.com/ae3000/matchain/blob/main/notebooks/run_matchain_api.ipynb). However, the latter also gives some background information about MatChain and explains some relevant parameters in more detail.

MatChain uses two configuration files in YAML format: The first one defines the entire matching chain, its commands (steps) and the parameters for each step. The second one defines the datasets and matching properties (columns). Since the first file refers to the second file, we only need the path of the first file as input parameter.

Instead of using the command line ```python matchain --config ./config/mccommands.yaml```, we can also call method ```run_config_file``` with the path of the configuration file as input parameter.

In [6]:
import matchain.chain
config_commands = './config/mccommands.yaml'
boards = matchain.chain.run_config_file(config_commands)


2023-11-02 21:41:32,030 INFO     selected datasets=['da']
2023-11-02 21:41:32,033 INFO     configuration=
{'autocal': {'delta': 0.025, 'threshold': 'estimated'},
 'blocking': {'blocking_threshold': 0.5,
              'name': 'sparsedottopn',
              'njobs': 1,
              'ntop': 10,
              'query_strategy': 'smaller',
              'shingle_size': 3,
              'vector_type': 'shingle_tfidf'},
 'chain': ['prepare', 'blocking', 'similarity', 'autocal', 'evaluate'],
 'dataset': {'blocking_props': ['title', 'authors', 'venue'],
             'data_1': './data/Structured/DBLP-ACM/tableA.csv',
             'data_2': './data/Structured/DBLP-ACM/tableB.csv',
             'dataset_name': 'da',
             'file_matches': './data/Structured/DBLP-ACM',
             'props_sim': {'authors': ['tfidf_sklearn'],
                           'title': ['tfidf_sklearn'],
                           'venue': ['tfidf_sklearn'],
                           'year': 'equal'}},
 'dir_data': '

2023-11-02 21:41:32,054 INFO     size_1=2616, size_2=2294, concat df_data=4910
2023-11-02 21:41:32,056 INFO     finished command=prepare, time=0.022005319595336914
2023-11-02 21:41:32,057 INFO     running command=blocking


matching for:
da        ['prepare', 'blocking', 'similarity', 'autocal', 'evaluate']



2023-11-02 21:41:32,252 INFO     generated vectors=(6567, 8257), time=0.17551732063293457, df_values=(4910, 3), df_index_array=(4910, 3), values=6567
2023-11-02 21:41:32,319 INFO     blocking prop=title, new candidates=3649, all candidates=3649, total time nn search=0.06553459167480469
2023-11-02 21:41:32,457 INFO     blocking prop=authors, new candidates=8067, all candidates=9230, total time nn search=0.20407652854919434
2023-11-02 21:41:32,476 INFO     blocking prop=venue, new candidates=5200, all candidates=14385, total time nn search=0.22260761260986328
2023-11-02 21:41:32,489 INFO     candidate pairs=14385
2023-11-02 21:41:32,490 INFO     finished command=blocking, time=0.4327075481414795
2023-11-02 21:41:32,491 INFO     running command=similarity
2023-11-02 21:41:32,493 INFO     computing vectorized similarity
2023-11-02 21:41:32,496 INFO     computed vectorized similarity=14385, sim columns=['0'], time=0.002004861831665039
2023-11-02 21:41:32,496 INFO     computing similarity tf

The returned list contains one ```PinBoard``` object for each matched dataset pair (or ```None``` if an exception was raised during matching the dataset pair). In our example, we only match the datasets A and B located in ```./data/Structured/DBLP-ACM```. We can access the predicted matches and the evaluation result as follows:

In [7]:
boards[0].predicted_matches


MultiIndex([(   0,  117),
            (   1, 1093),
            (   3, 1125),
            (   4, 1450),
            (   5,   49),
            (   7, 1179),
            (   8, 1759),
            (   9, 1885),
            (  10, 2289),
            (  11, 2010),
            ...
            (2603, 1947),
            (2604, 1334),
            (2606, 1153),
            (2607, 2293),
            (2608,  555),
            (2609, 1343),
            (2610, 1689),
            (2611, 1237),
            (2613,  535),
            (2615, 1406)],
           names=['idx_1', 'idx_2'], length=2192)

In [8]:
boards[0].evaluation_metrics['union_set']['estimated']


{'t': 0.425,
 'f1': 0.97416,
 'p': 0.98038,
 'r': 0.96802,
 'tpos': 2149,
 'fpos': 43,
 'fneg': 71}

## Dataset Configuration File

The config file that defines the mapping columns for the dataset pair of our example is located in ```.\config\mcdatasets.yaml```. The snippet in the next cell only shows the definition for the datasets of our example, namely *DBLP* and *ACM*. The configuration file contains similar definitions for further dataset pairs.

Each mapping starts with an arbitrary (but unique) reference name (here, ```da```), followed by 
- the type ```dataset``` 
- the paths to the datasets
- an optional path to the ground truth (for evaluation only)
- the names of the properties that should be used for matching, including a similarity function or a list of similarity functions for each property
- a list of blocking properties

Users may define variables with an arbitrary name (such as ```dir_data```). When reading the config file, MatChain replaces each variable by its value if its name is surrounded by curly brackets and occurs in a string (e.g. ```{dir_data}``` in the strings for ```data_1```, ```data_2``` and ```file_matches```).

## Command Configuration File

The config file that defines the commands (i.e. the matching steps) for our example is located in ```.\config\mccommands.yaml```. We will run through the blocks of the config file one by another.

In the first block, the dataset configuration file is included and the datasets that should be matched are selected. In our example, we only select a single pair, namely *DBLP* and *ACM*, denoted by  ```da```.

The next line defines the command chain:

For each command, the parameters are specified in separate YAML block.

The next cell shows the block for the command ```prepare```. It allows to set the seed and the logging directories. 

Once again, MatChain replaces variable names by their values, even in a nested manner. The following variables are pre-defined and have a special meaning:

- ```current_time```: MatChain sets current time when it is started
- ```seed```: an integer-valued random seed
- ```dataset_name```: the name of the dataset pair (e.g. ```da```)
- ```log_file```: the path of the log file
- ```log_config_file```: the path of the file used to configure logging; if null message are only logged to console. 

The block in the next cell means that there is a separate log file for each dataset pair. If more than one dataset pair is selected, they are matched one by another and all log files are stored in the same directory. Moreover, MatChain creates a new log directory with a current time stamp each time it is started. 

By contrast, ```subdir```, ```tag```, ```dir_experiments```, and ```dir_config``` are user-defined variables that are not referred by the code and thus can be renamed. 



The next cell shows the blocks for the commands ```blocking```, ```similarity```, and ```autocal```. Their parameters are described in the notebook  [run_matchain_api.ipynb](https://github.com/ae3000/matchain/blob/main/notebooks/run_matchain_api.ipynb). 

In our example, the parameters for command ```blocking``` determine sparse shingle vectors and approximate nearest neighbour search for blocking. In this case, the embedding parameters for command ```similarity``` are ignored; they are only used when ```vector_type: embedding``` (in combination with ```name: faiss```) is configured for blocking. 