# **Data Collection**

*A database of unique polymers, describing their origin, production conditions, structural parameters and functional properties was created. This document describes the collection procedure for the systematic review and analysis of the database. Most data is collected from the main paper. Absent data was usually retrieved from secondary, related papers from the same author. From there on, completeness of all parameters for a given entry was not possible anymore, and this reflects on the quality of the investigation, although masked by extra data collection.
Any extra info that was deemed essential was obtained from extraneous sources, for example handbooks, review papers, or systematic journals.*

*Most parameters are not even in the original paper, they are thoroughly searched using related papers by the same author following an undergoing multi-paper research. For instance, phylum and class were rarely mentioned in the original paper.*

<font color='blue'>*Parameters in blue were not directly retrieved from the paper or related papers, but calculated with raw data from it or just an assumption.*</font>

## **Codebook**

### Identity
- **<font color='blue'>ID</font>**: unique identifier.
- **Name**: Abbreviated name of the polymer. Usually named after the subspecies that produces it.
- **Strain**: name of strain that produces the polymer, using  genus, species and subspecies identification.
- **DOI**: unique identifier of the publication where the information was reported.
- **<font color='blue'>Refs more</font>**: Does this paper report other polysaccharides notable to be added to this database? Personal identifier, not relevant to analysis.
- **<font color='blue'>Adequate</font>**: 0 = Not adequate for analysis but still added. 1 = adequate for analysis. 2 = excellent for analysis because also reports a fully characterized SRU. Personal identifier, not relevant to analysis.
- **Year**: year of publication.
- **Journal**: journal where it was published.
- **IF**: impact factor of the journal where it was published, at the year of data collection (2020).
- **ExtremeType**: type of extremophile that produces this polymer.
- **<font color='blue'>Source</font>**: was this information obtained from literature or my experimental data? Personal identifier, not relevant to analysis.
- **<font color='blue'>InputGoal</font>**: goal of this data to be used for. Is it for systematic analysis of growth conditions? Structural insights? Personal identifier, not relevant to analysis.
- **Country**: country where this microorganism was found/inhabits.
- **Region**: specific region on this country. Closely describes its habitat and its local fluctuations in growth conditions (can be a lake region, dry land region).
- **Geofeature**: nature of habitat.
- **Coords**: specific coordinates of this region on the globe.
- **Kingdom**: highest hierarchical class of the microorganism. Most are bacteria, as we chose this type of microorganism specifically for analysis.
- **Phylum**: lower hierarchical class of the microorganism. Phylum can describe high-level differences in habitat. For instance, proteobacteria dominate water habitats.
- **Class**: lowest hierarchical class of the microorganism considered. At this level, differences in respiration, gram-nature and optimal growth conditions start to show.


### Growth & production
- **Gram**:  +/- notation. Most EPS-producing microorganisms of interest for cryo are gram-negative.
- **Shape**: shape of the microorganism.
- **Respiration**: if it requires oxygen or not to survive, in a strict or facultative manner.
- **MinT**: minimum growth temperature of that microorganism. If no info on the microorganism is found, this reflects temperature ranges in the habitat.
- **OptT**: optimal growth temperature of that microorganism. Sometimes the optimal value reported is for when polymer production is successful.
- **MaxT**: maximum growth temperature of that microorganism. If no info on the microorganism is found, this reflects temperature ranges in the habitat.
- **MinPH**: minimum growth pH of that microorganism. If no info on the microorganism is found, this reflects pH ranges in the habitat.
- **OptPH**: optimal growth pH of that microorganism. Sometimes the optimal value reported is for when polymer production is successful.
- **MaxPH**: maximum growth pH of that microorganism. If no info on the microorganism is found, this reflects pH ranges in the habitat.
- **MinSalt%**: minimum growth saline concentration of that microorganism. If no info on the microorganism is found, this reflects saline concentration ranges in the habitat.
- **OptSalt%**: optimal growth saline concentration of that microorganism. Sometimes the optimal value reported is for when polymer production is successful.
- **MaxSalt%**: maximum growth saline concentration of that microorganism. If no info on the microorganism is found, this reflects saline concentration ranges in the habitat.
- **Salt Media**: the salt formulation used in production, usually a mix of many different components.
- **Salt**: the major salt used, mostly NaCl. The value considered in the Salt% is usually the amount of this majority salt that is reported, not the medium.
- **CarbonSrc**: carbon source used. Mostly influences polymer production.
- **CarbonSrc%**: amount of carbon source used.
- **NitrogenSrc**: nitrogen source used. Mostly influences microorganism growth.
- **NitrogenSrc%**: amount of nitrogen source used.
- **<font color='blue'>C/N ratio</font>**: carbon-to-nitrogen ratio.
- **RPM**: rotations per minute used in bioreactor/erlenmeyer agitation. Can influence oxygen flow, nutrient diffusion.
- **Aeration**: aeration rate used. Microaerobic and anaerobic microorganisms have little to no aeration, so reported as 0 unless stated otherwise (after pre-processing).
- **Time**: production time in hours.
- **Productivity**: milligrams of polymer produced per Liter of broth per day.
- **Specific yield**: grams of polymer produced per gram of dry cell weight.

### Polymer composition
- **MW**: molecular weight of the polymer.
- **Multiplicity**: amount of unique monosaccharide units.
- **Real Monomer Count**: total amount of monosaccharide units in the SRU, including repeats.


- **Glc**: glucose content, in both relative molar ratio and absolute percentage.
- **GlcN**: glucosamine content, in both relative molar ratio and absolute percentage.
- **GlcA**: glucuronic acid content, in both relative molar ratio and absolute percentage.
- **GlcNAc**: N-acetylglucosamine content, in both relative molar ratio and absolute percentage.
- **Gal**: galactose content, in both relative molar ratio and absolute percentage.
- **GalN**: galactosamine content, in both relative molar ratio and absolute percentage.
- **GalA**: galacturonic acid content, in both relative molar ratio and absolute percentage.
- **GalNAc**: N-acetylgalactosamine content, in both relative molar ratio and absolute percentage.
- **Fru**: fructose content, in both relative molar ratio and absolute percentage.
- **Fuc**: fucose content, in both relative molar ratio and absolute percentage.
- **Man**: mannose content, in both relative molar ratio and absolute percentage.
- **ManN**: mannosamine content, in both relative molar ratio and absolute percentage.
- **Xyl**: xylose content, in both relative molar ratio and absolute percentage.
- **Ara**: arabinose content, in both relative molar ratio and absolute percentage.
- **Rha**: rhamnose content, in both relative molar ratio and absolute percentage.
- **Rib**: ribose content, in both relative molar ratio and absolute percentage.
- **Tre**: trehalose content, in both relative molar ratio and absolute percentage.
- **Alt**: altrose content, in both relative molar ratio and absolute percentage.
- **QuiNAc**: N-acetylquinovosamine content, in both relative molar ratio and absolute percentage.
- **Kdo**: ketodeoxyoctonic acid content, in both relative molar ratio and absolute percentage.


- **Aminoacid Subst**: aminoacid substituents present in the polysaccharidic SRU
- **Non-osidic Subst**: non-sugar derivates present in the polysaccharidic SRU
- **Residuals**: monomer units present in trace amounts. Not accounted in total composition percentages or SRU.
- **%Carbohydrate**: percentage of the carbohydrate fraction in the polymer.
- **%UA**: percentage of uronic acids in the polymer.
- **%HA**: percentage of hexosamines in the polymer.
- **%Acetyl**: percentage of acetyl moieties in the polymer.
- **%Phosphate**: percentage of phosphate moieties in the polymer.
- **%Sulfate**: percentage of sulfate moieties in the polymer.
- **%Pyruvate**: percentage of acetyl moieties in the polymer.
- **%Protein**: percentage of the protein fraction in the polymer.
- **%Lipids**: percentage of the lipid fraction in the polymer.
- **%NucleicAcids**: percentage of nucleic acids in the polymer.

### Polymer structure
- **Linkage**: Defines the nature of monomer linkages in the chain. Can be α, β or both.
- **α/%α**: total number/percentage of α linkages.
- **β/%β**: total number/percentage of β linkages.
- **<font color='blue'>α/β Ratio</font>**: relative proportion of α and β linkages in the SRU.
- **PrimaryCfm**: first-level conformation of the polymer. Can be linear or branched.
- **SecondaryCfm**: second-level conformation of the polymer. How primary elements are organized: linear chains as helices, for instance.
- **TertiaryCfm**: third-level conformation of the polymer. How secondary elements are organized: a bundle of helices can form a rod, for instance.
- **Polarity**: net-charge of the polymer.
- **LinkageTypes**: all unique linkages present between monomers. Nomenclature contains the linkage type, and both carbon numeric identifiers of each of the two monomers that form the linkage.
- **SRU**: the structural repeating unit -- a unique, irreducible sequence of the least number of monomers that repeats throughout the chain and defines the polymer as a whole. The SRU is also a unique identifier of each polymer: two equal SRUs indicate the same polymer.

### Polymer properties
- **TDBehavior**: thermodynamic behavior observed in liquid-state differential scanning calorimetry (DSC).
- **Tc**: freezing temperature in liquid-state DSC.
- **Tm1**: first melting temperature in liquid-state DSC.
- **Tm2**: second melting temperature in liquid-state DSC.
- **Tg**: glass transition temperature in liquid-state DSC.
- **Tdecomp**: temperature of polymer decomposition, defines upper threshold of chain stability before degradation occurs.
- **Stress-rheology**: rheological stress-strain behavior of the polymer observed in a rheometer.
- **Time-rheology**: rheological stress-time behavior of the polymer observed in a rheometer.
- **Viscosity**: estimated zero-point viscosity of the polymer observed in a rheometer. Units are normalized to milliPascal second per weight percentage.
- **[α]D25°C**: optical rotation of light induced by the polymer chain at 25 degrees Celsius.
- **SpaceLoc**: spatial location the polymer occupies in a cellular environment under a freezing phenomenon. Can be intracellular, extracellular, transmembranar.
- **HDR**: hydrodynamic radius of the polymer chain when hydrated. Units in nanometers.
- **Zeta**: zeta-potential of the polymer.
- **Osmo**: osmolarity of the polymer.
- **XRD**: x-ray diffraction properties of the polymer.


### Polymer functionality
- **Biocompatible**: is it biocompatible, which means not cytotoxic?
- **Anticytotoxic**: does it nullify the cytotoxicity of other substances in the medium?
- **Antioxidant**: does it actively neutralize reactive oxygen species or work at the expression level of antioxidant molecule production?
- **Antitumor**: does it have antitumoral activity?
- **Antiradiation**: does it protect against radiation effects and is it resistant against photodegradation?
- **Antibiofilm**: does it prevent biofilm formation?
- **Thermostable**: is it structurally stable along wide temperature ranges?
- **HMT**: does it have heavy metal tolerance, also known as a bioflocculant effect?
- **Emulsifying**: is it capable of forming emulsions?
- **Swelling**: does it swell in aqueous environments?
- **Gelling**: is it capable of gellifying?
- **Foaming**: does it form a foam?
- **Immunoregulatory**: is it capable of triggering or regulating immunological responses?
- **Antibacterial**: is it antibacterial?
- **Antiviral**: is it antiviral?
- **IRI**: can it inhibit ice recrystallization?

### Outcome
- **<font color='blue'>AssumedCryoOutcome</font>**: based on extremophilic type, is it assumed to have a cryoprotective outcome?
- **ReportsCryoOutcome**: does the paper actually report a cryoprotective function?
- **RealCryoOutcome**: quantitative (comparable after pre-processing) value for the polymer's cryoprotective function.

### Theoretical calculations from monomer properties
- **<font color='blue'>MolarWeightAVG</font>**: a score criteria for the molecular weight of all monomers in the chain. Represents the "heaviness" of the polymer.
- **<font color='blue'>NumberAVG</font>**: a score criteria for the weight-normalized monomer ratio in the chain. Represents the "monomer balancing" of the polymer.
- **<font color='blue'>Chain Impact Factor</font>**: a ratio of Molar Weight and Number averages. It is a descriptor of chain efficiency per complexity: more complex, less carbon efficiency, lower CIF.
- **<font color='blue'>Global charge</font>**: net charge of the polymer when all polar groups are accounted for in the calculation.
- **<font color='blue'>Global pKa</font>**: net acidity of the polymer when all polar groups are accounted for in the calculation.
- **<font color='blue'>Global pKb</font>**: net basicity of the polymer when all polar groups are accounted for in the calculation.
- **<font color='blue'>H-balance</font>**: amount of protons capable of participating in water hydration. The more negative this is, we more monomers are recruited for hydrogen bonds with water.
- **<font color='blue'>H-available</font>**: average free proton count when all monomers in the chain are accounted for.
- **<font color='blue'>Avg. Surface Area</font>**: average surface area, in $Å^{2}$, if we normalize all local contributions to a global surface area for the polymer.
- **<font color='blue'>Avg. Polarizability</font>**: average polarizability, in $Å^{3}$, if we normalize all local contributions to a global polarizability for the polymer.
- **<font color='blue'>Avg. logS</font>**: solubility of the polymer, on a logarithmic scale.
- **<font color='blue'>Structural Complexity Factor</font>**: ratio of multiplicity and total monomer count.