# **Raw Data Preprocessing**

*Due to heterogeneous data reporting, some data cannot be stored in its raw publication format. This file documents some decision on how data was stored, and any pre-processing performed on the data before feature engineering. Parameters which are not featured in the following list were stored as reported.*

### Geofeature
Geofeatures are usually mentioned with slightly different names to represent the same habitat. For instance, a *shallow marine hot spring* can be referred to as simply a hot spring or a shallow hydrothermal vent. The habitat count was transformed to be as irreducible as possible, though always respecting its original name when another more characteristic name was not found.

### Respiration
Most strains are strict aerobes, with some presenting fermentative respiration. Only one paper mentioned the microorganism to be a facultative aerobe, so it was grouped into the more populated facultative anaerobes group as they also represent strains that tolerate oxygen, but can live without it, and vice versa.

### Min/Opt/Max for Temperature, pH and Salt %
All info was retrieved from the main paper. Order of priority for value to be input was given as such: maximization of EPS growth, bacterial growth in a lab. If such info was nowhere to be found, because it was important for statistical analysis, info was retrieved from the actual subspecies description in Bergey’s Handbook. When the subspecies is unknown, info was retrieved when a *sp. nov* paper existed in the International Journal of Systematic and Evolutionary Microbiology and related to atleast the geographic location of strain collection. If all of those were unavailable, info of growth conditions of the bacteria in its natural environment was used instead, when available.

### Salt
The substance used to maintain salinity is, for the most part, not an independent agent, but part of a formulation (e.g. Marine Broth). When it is mentioned the % of a specific salt (usually NaCl), that info is used. Otherwise, the major salt in the formula used and its concentration is the input for the database, to avoid doing the sum of all salts (usually, 8-10+, most in very minor quantities) and putting “various” as the input for Salt used, which generalizes database entries and masks information.

### Multiplicity & Real Monomer Count
Multiplicity defines how many different types of monomers the SRU of a polymer contains. Real monomer count is the total number of monomers present in the SRU, repetitive or non-repetitive, such as that, in theory, Multiplicity ≤ RMC in all cases. RMC is usually indicated by the number of anomeric carbons observed in 1H-NMR. There are cases where RMC < Multiplicity, but they derive from experimental flaw: some papers report monomers of very low abundance that are accounted in multiplicity, but then are not detected in NMR and GC-MS analysis for SRU reporting, because of its loss during acid hydrolysis or methanolysis preparations. As it cannot be distinguished if (i) low-abundance monomers are media contaminants from polymer production or (ii) they are part of the SRU but not detected or occur esporadically in the chain, we prefer not to set RMC = Multiplicity, but rather report this paradoxical outcome. This way, we can get a measure of data corroboration between HPLC/GC and NMR analysis, as a means to assess purification quality.

### Monomer composition
Monomer composition reporting varied between weight percentage, molar percentage, weight ratio, molar ratio. The **unit was uniformized to molar ratio**: no information on weight percentage is lost, so absolute content is preserved, and the relativistic composition is present. As a different way to represent data, absolute weight percentage is also presented.

The calculation was done as follows:

$INSERT FORMULA HERE$

### Non-osidic Subst
A lot of papers report non-osidic substituents, like O-acetyl, N-acetyl and uronic acid moieties, which necessarily identifying the sugar it corresponds too. This is because these moieties have a characteristic fingerprint in HPLC/GC analysis so they can be easily pinpointed. There are inconsistensies in some papers, where they report uronic acid presence but no specific uronic acids in the quantified composition, and vice-versa. While in some studies, knowing if a polymer has N-acetyl moieties or not is sufficient, this study is based on thorough characterization, so this additional parameter was created. It does not serve a purpose of primary parameter for analysis, but an auxiliary crutch to finetuning the analysis.

### Residuals
Some papers report a **trace** quantity for some monomers. For an analytical standpoint, trace cannot be converted to 0 because it does exist, nor to any number because the value is unknown. This column was created to report the trace monomers reported, although they're not accounted for in the composition. However, some papers inconsistently report trace monomers in the SRU but not the composition, which seems non-sensical. This parameter is a fallback to post-analytical conclusion-taking.

### α and β linkages
Papers that report an SRU or the linkages between monomers directly report this information, but most only report the NMR J-coupling of the residues, followed by a manno of gluco-galactose configuration designation. In these cases, the type of linkage was reported as follows:

- For manno configuration, the J(H1-H2) is between 0-1 Hz (SMALL) and J(C1-H1) is about 170 Hz (LARGE), so the linkage is α.
- For gluco-galacto configuration, the J(H1-H2) is between 3.7-8 Hz (LARGE) and J(C1-H1) is about 160 Hz (SMALL), so the linkage is β.

### Primary, Secondary and Tertiary Conformations
Inspired by the primary-quaternary levels of protein organization, I created an arbitrary hierarchy for polysaccharide structure classification.

<img src="assets/structureClassification.svg">

### Polarity
About 60% of the data on this column was indirectly derived from the paper. Most papers do not directly indicate the polarity of the biopolymer, but they do indicate the purification column used, and at which salt concentration the biopolymer eluted. Most columns are cationic DEAE, which means they bind anionic polysaccharides. If the polymer elutes with no salt concentration in the eluent, it means the polymer is not anionic, mostly likely neutral, so it elutes promptly. If salt concentration is used on a DEAE and it elutes, then the polymer is anionic, and its anionicity is proportional to the salt concentration used due to charge displacement properties.