# Convering imzML files with i2nca

With this notebook, imzML files can be converted into other .imzML file types.
There are two different distinctions within imzML files.:

- **processed** or **continuous**:
This parameter references wheteher the mass axis of all spectra is shared. A shared axis is stored only once, otherwise, each pixel was its own axis.
 add here ims accesnsions
 add here ims accesnsions

- **profile** or **centroid**:
This set of parameters indicates if the spectra are stroed as a line spectra or if each signal in the spectrum is assigned to single centroid. A profile spectrum contains much more datapoints, but also shows the peakshapes. A centroided spectrum reduced a spectral peak to only a single datapoint.


# A very short introduction to jupyter
If you are not familiar with jupyter notebooks, they are a really cool way how to make code accessible for people with little to no experience in coding. Please think of this as a nice text document that allows you to send some defined code boxes into a happy place where they get taken care of.
Everything inside this document is organized in cells. Some cells contain a special kind of text (markdown) and some contain python code.
All you need to do is to "run" a cell to make the code inside do its funky stuff. You can do this ususally by hitting a play-button in the upper menu (it should say "Run Cell or Execute Cell") or by right-clicking a cell and using the "Run / Execute / ..." statement.

# Imports

Before you can start the Conversion tools, load the required tools and libraries.

PLEASE RUN THE CELL BELOW WITHOUT CHANGES

In [None]:
# import statements
import m2aia as m2
from i2nca import report_pp_to_cp,  convert_pc_to_cc_imzml
from i2nca import report_prof_to_centroid, write_profile_to_pc_imzml, set_find_peaks

# Processed Profile to Continous Profile

Some Mass spectrometers record the data in a very precise manner. Even though the mass data is recorded as a processed spectra, there are only very little technical variations between the mz data points between each pixel. These collections of ever-so-slightly varying data points are reffered to as "pseudo-bins" in this notebook. With the following conversion tool, we can test how large the standard deviation for one of these "pseudo-bins" is and how far the mean of two neighbouring "pseudo-bins" is spaced.
An underlying assuption for this test is that each pixel starts and ends with nearly the same mz value ad has the same number of data points

With the following code, we will create a pdf report that checks all our assumptions and plots them. This provides human-readable output whether we should continue with the conversion.
after reading the report, you can then decide if you want to continue the conversion. If the conversion parameters looks very bad, you might skip to the next step of using the centroiding steps.

### Loading a dataset

To start a conversion, you need to first load the *processed profile* imzML dataset.

THE CELL BELOW NEEDS YOUR INPUT

Please change the following  variables in order to load the dataset:
- `file_path`: Please provide a path to the imzML file on your machine. It must end with the full filename (`name` and `".imzML"`).\
   Using the  `r"..."` notation allows to add Windows-style paths without escaping each backslash.


In [None]:
file_path = r".../path/to/file/.../file-name.imzML"
I = m2.ImzMLReader(file_path)

### Creating a report

After loading the dataset, we can start the conversion report. It checks if all the parameters assumed for out conversion are as assumed and creates a report for us to look at.
In further updates, this evaluation will be performed automatically.


THE FOLLOWING CELL NEEDS YOUR INPUT

Please change the following variables in order to test the imzML file structure:

- `output_filepath`: Please provide a path on your machine. The output pdf file will be saved there. Preset is the input location from the used imzML file.

- `coverage`: The coverage is a subsetting method for large datasets. A coverage of 0.3 means that the pixels get subsetted amounting to 30% of the full measurent. This allows faster computation for large datasets. TRy avalue of 0.05 for very large datsets. For small datasets, the value should be 1.0.


In [11]:
output_filepath = file_path[:-6]

coverage = 0.1 # value between 0 and 1

reference_mzs = report_pp_to_cp(I, output_filepath, coverage)

report generated at:  D:\data\Jannik\Files_for_minidata\metabo\metabolomics_small_mz_processed_profilecontrol_report_pp_to_cp.pdf


You can now check at the specified file location how the conversion tests look. If you approve of the data, continue with the following paragraph.

For now lets check the created figures:
The `Number of data points per pixel` should be identical for each pixel.
Both the `minimal m/z value per pixel` and the  `minimal m/z value per pixel` should be around the same value. small variations here are okay.
The  `Comparison between aquisition-based binning` should show that the mz deviation between two "pseudo-bins" should be larger that within a "pseudo-bin".

<!---
Not yet implemented:
Otherwise, check a sparce data averaging method (like a clustering of the subsample mz values) to get reference mz values.
--> 



### Convert the file to continuous profile

With our reference masses generated and the report checked, we can create a new imzML file as a continuous profile file.
Afger the cell has finished executing, the newly prepared file is produced.

PLEASE RUN THE CELL BELOW WITHOUT CHANGES


In [None]:
# produce the converted file
report_pp_to_cp(I, reference_mzs, output_filepath)

# Continous Profile to Processed Centroid

In order to obtain a Centroid file, we need to perform peak detection on each pixel to reduce each peak to a single "centroid" signal. This greatly reduces the file size.
This can be either perfromed on Continuous files or Processed files.

For any peak detection algorithm to run, we need to determine with that parameters this is done. To aid the desicion-making, i2nca provides a report function that creates some informative graphics as a pdf.
After checking this report, you can decide how to move forward.

### Loading a dataset

To start a conversion, you need to first load the *processed profile* imzML dataset.


THE CELL BELOW NEEDS YOUR INPUT

Please change the following  variables in order to load the dataset:
- `file_path`: Please provide a path to the imzML file on your machine. It must end with the full filename (`name` and `".imzML"`).\
   Using the  `r"..."` notation allows to add Windows-style paths without escaping each backslash.
- `output_filepath`: Please provide a path on your machine. The output pdf file will be saved there. Preset is the input location from the used imzML file.

In [None]:
file_path = r".../path/to/file/.../file-name.imzML"
I = m2.ImzMLReader(file_path)

output_filepath = file_path[:-6]

### Creating a report

After loading the dataset, we can start the conversion report. In this report, take a look at the graphics created on the mean spectrum to help you decide how to parametrize a peak detection function.
In further updates, this evaluation will be given to you automatically.

PLEASE RUN THE CELL BELOW WITHOUT CHANGES

In [None]:
report_prof_to_centroid(I, output_filepath)

### Convert the file to Processed Centroid

With our report in hand, we cancheck the parameters there and decide on a detection function and it'S parameters.
Predefined are the following detection functions:

- `set_find_peaks`: A simple implementation of the Scipy.Signal find_peaks function. It detects local maxima. Use the `set_find_peaks` to set the parameters accoringly to the procedure shown below.

- `set_find_peaks_cwt`: An implementation of the Scipy.Signal continous wavelet transformation detection function. It detects peaks by continous wavelet transformation. This takes some time on larger datasets.

All these functions are wrapped to be used on a per-pixel basis. If you want use a special detection function, you can make your own function object that takes only the variables  of (mz, intensity). Due to i2ncas functional style, this can also be applied to each pixel.

When using some of the predetermined functions, you need to set the parameters.  Let's  assume that the noise estimation in my report showed me that a nice intensity cutoff is 20 for the peak detection.
This might look a bit weird to people who have some experience in python, because it is a concept rarely used. (Trust me, it works non the less :D)

Running this code will generate a new file with the processed centroid data.

THE CELL BELOW NEEDS YOUR INPUT

Please change the following  variables in order to load the dataset:
- `detection_function parameter`: Please provide detection function
- `detection_function preset`: Please provide teh preset values of the detection function you want to use.

In [None]:
write_profile_to_pc_imzml(I, output_filepath, detection_function = set_find_peaks(height=20))

# Processed Centroid to Continous Centroid

In order to obtain continous centroid files from processed centroid files, we need to manipulate the mz axis so that not each pixel has it's own, but all use a shared axis.
THis might result in slightly larger files, as sometimes, a lot of zeros are written.


### Get the file and output path
Firstly, we need to load a Processed Centroid dataset.

As this dataset is usually very small, we will leave the loading part to the wrinting function.
This is also possible for al lthe other writing fuctions shown before. They are explained in further detail in the automatic_conversion_runner notebook.


THE CELL BELOW NEEDS YOUR INPUT

Please change the following  variables in order to load the dataset:
- `file_path`: Please provide a path to the imzML file on your machine. It must end with the full filename (`name` and `".imzML"`).\
   Using the  `r"..."` notation allows to add Windows-style paths without escaping each backslash.
- `output_filepath`: Please provide a path on your machine. The output pdf file will be saved there. Preset is the input location and name from the used imzML file.

In [None]:
input_file_path = r".../path/to/file/.../file-name.imzML"

output_filepath = file_path[:-6]

### Choose a binning strategy

The conversion is acieved by using a preprocessing called binning. On a fixed reference mz axis,  the mz values of each pixel are sorted. The correcponing intensities per bin are summed, and empyt bins get assigned an intensity value of 0.

There are different strategies accessible in `i2nca` to perfrom this:

- `fixed`: By using a fixed binning, the existing mz values are sorted into an axis set by a fixed ppm value (e.g. using the parameter `bin_accuracy=5` would result in each bin being set 5 ppm apart).

- `unique`: Firstly, all occuring mz values are collected over all the pixels. These unique values are used as the refence mz axis. This setting is ionly recommended for small datasets or those that were prevoiusly binned.


THE CELL BELOW NEEDS YOUR INPUT

Please change the following  variables in order to load the dataset:
- `bin_strategy`: Either `"fixed"` or `"unique"`, depending on the binning strategy that should be employed.
- `bin_accuracy`: The accuracy of binning in ppm. If unique is chosen, this parameter can be ommitted.

In [None]:
convert_pc_to_cc_imzml(input_file_path, output_filepath, bin_strategy="fixed", bin_accuracy=20)