# Manual of MetImage (0.3.0)

## Content for manual
1. Introduction
2. Dataset parparation
3. Image conversion
4. Split datasets
5. Tile split and tile selection
6. Model building
7. Model training
8. To do list

## 1. Introduction
MetImage is a python based approach to convert LC–MS-based untargeted metabolomics data into digital images. MetImage encoded the raw LC–MS data into multi-channel images, and each image retained the characteristics of mass spectra from the raw LC–MS data. MetImage can build diagnose model by multi-channel images with deep learning model.

## 2: Dataset parparation
### 1. Raw LC-MS data
MS1 data conversion: Convert raw MS data files (e.g., Agilent .d files, Sciex .wiff files and Thermofisher .raw files) to mzXML format using ProteoWizard (version 3.0.6150). Only MS1 peak picking for following conversion.

### 2. Create dataset
Copy datasets into an independent dir named such as "Rawdata".
    .\Rawdata
        ├─File1.mzxml
        └ File2.mzxml

If you want to construct diagnosis model, please creat subfolders with name of groups, and then copy .mzxml files in corresponding subfolders.
    .\Rawdata
        ├─Group1
            ├─File1.mzxml
            └ File2.mzxml
        └ Group2
            ├─File1.mzxml
            └ File2.mzxml

## 3: Image conversion
    DataConverter can convert raw LC-MS data (.mzXML) into digital image matrix. To use DataConverter, the dataset is processed using the code shown below:

In [None]:
import metimage
from metimage.datas.DataConverter import ConvertDataset
rawdata_dir = metimage.__path__ [0]+"/demo/Rawdata"
save_path = metimage.__path__ [0]+"/demo/Convert"
ConvertDataset(rawdata_dir,pattern="mzXML",mzmin=60, mzmax=1200, binSize=0.01, Threads=6, save_path=save_path)

Parameters:
(1) rawdata_dir: the dir of dataset, containing .mzXML files.
(2) pattern: MS data format. (default mzXML)
(3) mzmin: the minimum value of m/z bin
(4) mzmax: the maximum value of m/z bin
(5) binSize: the Da of every bin in m/z binning
(6) Threads: number of thread used for multiprocessing
(7) save_path: the dir of outputs (whole image, .npz)

## 4: Split datasets (optional)

__Note: If you want to predict unlabelled sample with trained model, please skip chapter 4, chapter 5 and chapter 7.__

In common, a training set, a validation set and a testing set are necessary for model training and testing. At least training set and validation set are necessary for training. Given that the user may split their dataset in different methods, MetImage doesn't provide automated dataset split fuctions. Instead, a stratified sampling method is provided to split training set and validation set. Please refer to following code to split your datasets.

In [None]:
import metimage
from metimage.utils.SplitDataset import SplitDataset
dataset_wd = metimage.__path__ [0]+"/demo/Convert"
SplitDataset(dataset_wd, test_split=0.3, seed=42)

Parameters:
(1) dataset_wd: the dir of __converted__ dataset, containing .npz files.
(2) test_split: the ratio of validation dataset. test_split = 0.3 means 30% samples of dataset split as testing set.
(3) seed: seed used for sampling. (Use consistent seed number to keep reproducibility.)

## 5: Tile split and tile selection (optional)

__Note: If you want to use all tiles to construct a deep learning model, please skip this chapter.__

MetImage provides two indictors to select information rich tiles, 1D image entropy and pooled intensity of tiles. We recommand __only__ us training set to select tiles. Please create selected tiles list with following codes:

In [None]:
import metimage
from metimage.datas.TileSelection import GenerateIndex, SelectTiles
dataset_wd = metimage.__path__ [0]+"/demo/Convert/train"
save_path = metimage.__path__ [0]+"/demo/Convert"
GenerateIndex(dataset_wd,cal_mean=True,cal_entropy = True,pixelx=224, pixely=224, overlap_col=0, overlap_row=0,save_path=save_path)
Sampling_list = SelectTiles(dir_mean=save_path+"/mean", dir_entropy=save_path+"/1DEntropy", TopMean=1000, TopEntropy=1000,save_path=save_path)

Parameters:
(1) dataset_wd: the dir of __converted__ dataset, containing .npz files.
(2) cal_mean, cal_entropy: calculate pooled intensity or 1D image entropy.
(3) pixelx,pixely: the width (x) and length (y) of every tile in pixel.
(4) overlap_col,overlap_row: the overlap width (col) and length (row) value of tiles split in pixel.
(5) save_path: the dir of outputs
(6) dir_mean: pathway of calculated pooled intensity. (.etp)
(7) dir_entropy: pathway of calculated 1D image entropy. (.etp)
(8) TopMean: Top N pooled intensity selected.
(9) TopEntropy: Top N entropy selected.

Note: If only pooled intensity or 1D image entropy is used for tile selection, please set another varible as __None__ (eg. TopMean=None)

__After the tile selection, a Samplinglist.lst file will be generated. This file contained indexes of selected tiles. Please copy the generated Samplinglist.lst file for training or prediction.__

## 6: Model building
### 6.1 load selected tiles index
Please load list of selected tiles index before model construction.

In [None]:
import metimage
from metimage.models.train import ParseIndex
Sampling_list = ParseIndex(lst_dir=metimage.__path__ [0]+"/demo/Convert/Samplinglist.lst")

Parameters:
(1) lst_dir: the dir of Samplinglist.lst files.

### 6.2 Construct deep learning model

### 6.2.1 use bulit-in ResNet model
MetImage provide a ResNet (Residual neural network) model as an example to build deep learning model.
To know about ResNet, please refer to [Link](https://doi.org/10.48550/arXiv.1512.03385)
To call this model, please use following methods:

In [None]:
from metimage.models.train import MakeModel
model = MakeModel(Tiles_no=len(Sampling_list))

### 6.2.2 use customized deep learning model
To use other customized deep learning model (eg. VGG-19), please modified the source code.
(1) Copy the model in MetImage/models
(2) Replace the ResNet_v2 in function MakeModel and LoadModel into customized deep learning model.

## 7: Model training
### 7.1 Train deep learning model
Before the model training, please check the following list and ensure all object prepared before.
- training set
- validation set
- multi-channel model (constructed in chapter 6)
- Sampling_list (selected tiles index, loaded in chapter 6)

Then training the model by following codes:

In [None]:
import metimage
from metimage.models.train import train
train_dir = metimage.__path__ [0]+"/demo/Convert/train"
test_dir = metimage.__path__ [0]+"/demo/Convert/validation"
save_path = metimage.__path__ [0]+"/demo/Convert"
trained_model = train(train_dir= train_dir, test_dir= test_dir, batch_size = 4, model=model, Sampling_list=Sampling_list, seed=42, save_dir=save_path, epochs=10, workers=6, Augmentation=False, save_during_training=False, earlystopping=False, save_best_model=True)

Parameters:
(1) train_dir: the dir of training set.
(2) test_dir: the dir of validation set.
(3) batch_szie: batch size for model training.
(4) model: deep learning model.
(5) Sampling_list: selected tiles index (list)
(6) seed: seed used for model training.
(7) save_dir: the dir of outputs (information of model training, model weight)
(8) epoch: maxinum training epoch for model training. Training will be stopped when achieves the epoch.
(9) workers: worker used for model training.
(10) Augmentation: use data augmentation (bool, refer to 7.3)
(11) save_during_training: save model weights during model training. (bool, refer to 7.4)
(12) earlystopping: apply earlystop or not. (bool, refer to 7.5)
(13) save_best_model: save model weights of best model. (bool)

### 7.2 Monitor model training
In default, log of training will output in save_path/log. The training process can be visualized in tensorbroad by following code. (Use Linux or terminal)
    tensorboard --logdir='save_path/log'

### 7.3 Data augmentation
MetImage can simulate the shift of retention time, m/z and intensity by applying random disturbance in the range set by user. To apply augmentation, the following parameters need to be added in function train.
(1) RT_shift: maxinum pixels of RT shift for tiles. (Default: 20)
(2) MZ_shift: maxinum pixels of RT shift for tiles. (Default: 10)
(3) Int_shift: a list of fold for intensity shift range. Int_shift = [0.1,10] reperesents intensity changed between 0.1 times and 10 times. (Default: [0.1,10])

### 7.4 Checkpoints
MetImage can save model weights during training with a certain interval. For example, if you want to record model weights every 10 epochs, please add following parameters in function train.
save_during_training=True, save_epoch=10
parameters:
(1) save_epoch: the interval of saving model weights. (Default: 10)

### 7.5 Earlystopping
To determine best model and avoid overfitting, earlystopping is provided. Please add following parameters in function train to apply earlystopping.
(1) min_delta: Minimum change in loss of validation set to qualify as an improvement. (Default: 0)
(2) patience: Number of epochs with no improvement after which training will be stopped.
To understand earlystopping, refer to *tf.keras.callbacks.EarlyStopping*

## 7: Model evaluation
### 7.1 Load trained model
Before evaluate a trained model, please check the following list and ensure all object prepared before.
- testing set or unlabelled sample.
- multi-channel model (constructed in chapter 6)
- Sampling_list (selected tiles index, loaded in chapter 6)
- weight of model (trained in chapter 7)

Use following codes to load a trained model:

In [None]:
import metimage
from metimage.models.train import LoadModel, ParseIndex
Sampling_list = ParseIndex(lst_dir=metimage.__path__ [0]+"/checkpoints/Samplinglist.lst")
model = LoadModel(Tiles_no=len(Sampling_list), weight_dir=metimage.__path__ [0]+"/checkpoints/best_weights.hdf5")

parameters:
(1) weight_dir: pathway of model weight (.cpkt, .h5 or .hdf5)

To evaluate a testing set, refer to 7.2.
To predict unlabelled samples, refer to 7.3.

### 7.2 Evaluate model performance
Please use following code to evaluate model performance:

In [None]:
import metimage
from metimage.models.train import Model_evaluation
Model_evaluation(test_dir = metimage.__path__ [0]+"/demo/Convert/test", model=model, Sampling_list=Sampling_list, workers=12)

parameters:
(1) test_dir: dir of testing set.
(2) model: deep learning model.
(3) Sampling_list: selected tiles index (list)
(4) workers: worker used for model training.

### 7.3 Predict unlabelled sample
Please use following code to predict unlabelled samples:

In [None]:
import metimage
from metimage.models.train import Model_prediction
Model_prediction(pred_dir=metimage.__path__ [0]+"/demo/Convert/test", model=model, Sampling_list=Sampling_list, print=True,save_dir=metimage.__path__ [0]+"/demo/Convert",workers=12)

parameters:
(1) pred_dir: dir of unlabelled sample(s).
(2) model: deep learning model.
(3) Sampling_list: selected tiles index (list)
(4) print: write the prediction probability value or not. (bool)
(5) save_dir: output dir for prediction results.
(6) workers: worker used for model training.

If set print is true, the prediction results will be output in savedir/prediction.csv

## 8: To do list
- Visualization module
- Automated biological interperation
- Information enrichment mode (without tile selection)
- Inference mode (reduce size of model)