Skip to content

Latest commit

 

History

History
146 lines (124 loc) · 3.91 KB

File metadata and controls

146 lines (124 loc) · 3.91 KB

Input File Format

Our algorithm is really powerful. However, it still has not possessed human intelligence yet. This unfortunately means that the input files need to conform with some certain format.

In this section, we will take a deep dive into the read_file function provided in the data_curation module of pMTnet_Omni_Document package.

File Format

Although the default format that read_file expects is .csv, since under the hood, we use the read_csv function from the pandas package, read_file accepts inputs of major file formats, as long as the corresponding sep argument is supplied.

Acceptable File Format
File Format sep
.csv ,
.txt User defined. Usually , or \t
.tsv \t

Column Names

As methods of pMTnet Omni manipulates the input dataframe based on its column names, harmonizing column names is necessary. Therefore, when reading the user input, read_file will first attempt to find the following column names.

Note

Details on the data format will be explained in :ref:`Data Format` section.

Column Name
Name Meaning Mandatory Note
va The name of the Alpha chain for the V segment No At least one of va and vaseq needs to be supplied
vaseq The actual sequence of amino acids of va No At least one of va and vaseq needs to be supplied
vb The name of the Beta chain for the V segment No At least one of vb and vbseq needs to be supplied
vbseq The actual sequence of amino acids of vb No At least one of vb and vbseq needs to be supplied
cdr3a The sequence of amino acids for the CDR3 region on the Alpha chain Yes  
cdr3b The sequence of amino acids for the CDR3 region on the Beta chain Yes  
peptide The sequence of amino acids presented by the MHC Yes  
mhc The name(s) of the MHC No At least one of mhc and mhcseq needs to be supplied
mhcseq The sequence(s) of amino acids of the corresponding mhc No At least one of mhc and mhcseq needs to be supplied
tcr_species Species of the TCR Yes human or mouse
pmhc_species Species for the peptide-MHC Yes human or mouse

If read_file can not match these names exactly, it will try modifying the found column names in the user input to match the names. Specifically, it will change the names to lower cases, strip all white spaces, and remove special characters like *, _, +, -. See the following for some acceptable input examples as well as some unacceptable input examples.

Example Input
Input Acceptable
V_A seq Yes
pMHC SPECIES Yes
V Alpha No
Epitope No

Data Format

We understand that the nomenclatures are not always consistent in the field of biology. Albeit being conceptually viable, it would be too overwhelming for us to take all commonly-used naming conventions into consideration.

In this section, the required format for all fields in the user input will be elaborated upon.

.. toctree::
   :maxdepth: 1
   :caption: Amino Acids Sequences

   amino_acids_seq

.. toctree::
   :maxdepth: 1
   :caption: VA, VB

   va_vb

.. toctree::
   :maxdepth: 1
   :caption: MHC

   mhc