# 5. Data Types — Expanded with Bioinformatics Examples (EN)

**Goals**
- Understand Python's core data types (`int`, `float`, `bool`, `str`, `bytes`, `None`) and when to use each.
- Practice conversions and precision choices (`decimal`, `fractions`).
- Map bioinformatics data (genomics, transcriptomics, proteomics, metabolomics, microbiome) onto these types.
- Run compact domain examples for each omics area.

## 1) Core built-ins

- **`int`**: arbitrary-precision integers (good for counts, indices, read lengths).
- **`float`**: IEEE-754 double (good for measurements: mass, intensity, p-values). Beware rounding.
- **`bool`**: truth values (`True/False`) for QC flags.
- **`str`**: text; in bio, also used to store sequences `"ATG..."` or peptides `"ACDEFG"`.
- **`bytes`**: raw bytes; useful for FASTQ quality strings or binary formats.
- **`None`**: missing value placeholder (e.g., unknown coverage, no hit).

> Tip: Prefer `int` for counts (reads, peptides). Use `float` for measurements; if exact decimal is required (e.g., currency-like quantities, or exact mass calculations), consider `decimal.Decimal`.

In [2]:
# Quick sampler
a: int = 42
b: float = 3.14159
c: bool = True
d: str = "ATGC"
e: bytes = b"@ABC"            # arbitrary bytes (FASTQ qualities are ASCII-coded)
n = None

print(type(a), type(b), type(c), type(d), type(e), n is None)

<class 'int'> <class 'float'> <class 'bool'> <class 'str'> <class 'bytes'> True


## 2) Numeric precision: `float` vs `decimal` vs `fractions`

- **float**: fast, approximate; use for intensities, expression values, probabilities.
- **decimal.Decimal**: exact decimal arithmetic (configurable precision); useful for precise mass sums (didactic).
- **fractions.Fraction**: rational numbers; nice for exact ratios (e.g., coverage fractions in toy demos).

In [3]:
from decimal import Decimal, getcontext
from fractions import Fraction

# Float rounding surprise
x = 0.1 + 0.2
print("float 0.1+0.2 =", x)   # not exactly 0.3

# Decimal with higher precision
getcontext().prec = 28
xd = Decimal("0.1") + Decimal("0.2")
print("decimal 0.1+0.2 =", xd)

# Rational fraction
f = Fraction(2, 3) + Fraction(1, 6)  # 2/3 + 1/6 = 5/6
print("fraction:", f, "=", float(f))

float 0.1+0.2 = 0.30000000000000004
decimal 0.1+0.2 = 0.3
fraction: 5/6 = 0.8333333333333334


## 3) `str` for sequences (Genomics & Transcriptomics)

Strings are perfect for DNA/RNA sequences (A/C/G/T/U). Use `.count`, slicing, `.replace`, and membership tests.

In [4]:
dna: str = "ATGCGTACGT"
rna: str = dna.replace("T", "U")
gc = (dna.count("G") + dna.count("C")) / len(dna)
print("DNA:", dna, "| RNA:", rna, "| GC%:", round(gc*100, 1))

# k-mer (k=3) slicing
k = 3
kmers = [dna[i:i+k] for i in range(0, len(dna)-k+1)]
print("3-mers:", kmers[:5], "... total:", len(kmers))

DNA: ATGCGTACGT | RNA: AUGCGUACGU | GC%: 50.0
3-mers: ['ATG', 'TGC', 'GCG', 'CGT', 'GTA'] ... total: 8


## 4) `bytes` for FASTQ qualities (Proteomics/Transcriptomics sequencing)

FASTQ encodes per-base quality as ASCII (Phred+33). We can store/read as bytes efficiently.

In [5]:
# Example: convert a simple quality string to Phred scores
qual_str = "!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65"
# In reality, it's often the same length as the read. We'll just show mapping:
def phred33_to_scores(q: str) -> list[int]:
    return [ord(ch) - 33 for ch in q]

scores = phred33_to_scores(qual_str)
print("first 10 scores:", scores[:10], "min/max:", min(scores), max(scores))

# As bytes (compact)
qual_bytes: bytes = qual_str.encode("ascii")
print("bytes length:", len(qual_bytes), "first byte:", qual_bytes[0])

first 10 scores: [0, 6, 6, 9, 7, 7, 7, 7, 9, 9] min/max: 0 37
bytes length: 60 first byte: 33


## 5) Transcriptomics: counts (`int`) and TPM (`float`)

Counts are integers; normalized values (TPM/FPKM) are floats.

In [6]:
# Simple TPM-like normalization (toy)
genes = ["g1","g2","g3"]
counts: dict[str,int] = {"g1": 100, "g2": 400, "g3": 500}
length_kb = {"g1": 1.0, "g2": 2.0, "g3": 1.0}  # kilobases

# RPK = counts / length_kb
rpk = {g: counts[g]/length_kb[g] for g in genes}
scale = sum(rpk.values())/1e6 or 1.0
tpm = {g: rpk[g]/scale for g in genes}
print("counts (int):", counts)
print("tpm (float):", {k: round(v,2) for k,v in tpm.items()})

counts (int): {'g1': 100, 'g2': 400, 'g3': 500}
tpm (float): {'g1': 125000.0, 'g2': 250000.0, 'g3': 625000.0}


## 6) Proteomics: peptide mass (using `Decimal` for didactic precision)

We'll compute a simple monoisotopic mass sum from a peptide string. (Values are illustrative—not a full chemistry engine.)

In [7]:
from decimal import Decimal, getcontext
getcontext().prec = 28

AA_MASS = {
    "A": Decimal("71.03711"),  "C": Decimal("103.00919"),
    "D": Decimal("115.02694"), "E": Decimal("129.04259"),
    "F": Decimal("147.06841"), "G": Decimal("57.02146"),
    "H": Decimal("137.05891"), "I": Decimal("113.08406"),
    "K": Decimal("128.09496"), "L": Decimal("113.08406"),
    "M": Decimal("131.04049"), "N": Decimal("114.04293"),
    "P": Decimal("97.05276"),  "Q": Decimal("128.05858"),
    "R": Decimal("156.10111"), "S": Decimal("87.03203"),
    "T": Decimal("101.04768"), "V": Decimal("99.06841"),
    "W": Decimal("186.07931"), "Y": Decimal("163.06333"),
}

WATER = Decimal("18.01056")  # add for peptide (N-terminus H, C-terminus OH)

def peptide_mass(peptide: str) -> Decimal:
    total = WATER
    for aa in peptide:
        if aa not in AA_MASS:
            raise ValueError(f"Unknown residue: {aa}")
        total += AA_MASS[aa]
    return total

pep = "ACDE"
print("peptide:", pep, "mass:", peptide_mass(pep))

peptide: ACDE mass: 436.12639


## 7) Metabolomics: m/z (float), intensity (float), TIC/base peak

Raw centroided spectra can be represented as parallel lists of m/z and intensities (floats).

In [8]:
mz: list[float] = [100.0, 101.1, 150.2, 300.05, 500.3]
intensity: list[float] = [10.0, 250.0, 30.0, 400.0, 120.0]

tic = sum(intensity)                  # total ion current
base_idx = max(range(len(intensity)), key=lambda i: intensity[i])
base_peak = (mz[base_idx], intensity[base_idx])

# Normalize to relative intensity (0..1)
imax = max(intensity) or 1.0
rel = [i/imax for i in intensity]

print("TIC:", tic)
print("base peak:", base_peak)
print("relative intensities:", [round(x,3) for x in rel])

TIC: 810.0
base peak: (300.05, 400.0)
relative intensities: [0.025, 0.625, 0.075, 1.0, 0.3]


## 8) Microbiome: OTU/ASV table (int), relative abundance (float), flags (bool)

An OTU table is a mapping from taxon (str) to counts (int) per sample. We compute relative abundance and simple QC flags.

In [9]:
# Example counts for a single sample
otu_counts: dict[str,int] = {
    "Escherichia coli": 1200,
    "Bacteroides fragilis": 600,
    "Lactobacillus casei": 200,
    "Unassigned": 0,
}

total = sum(otu_counts.values()) or 1
rel_abundance: dict[str,float] = {k: v/total for k, v in otu_counts.items()}

# Simple QC flags (bool)
flags: dict[str,bool] = {
    "has_reads": total > 0,
    "dominant_taxon_over_50pct": max(rel_abundance.values()) > 0.5,
}

print("relative abundance:", {k: round(v,3) for k,v in rel_abundance.items()})
print("flags:", flags)

relative abundance: {'Escherichia coli': 0.6, 'Bacteroides fragilis': 0.3, 'Lactobacillus casei': 0.1, 'Unassigned': 0.0}
flags: {'has_reads': True, 'dominant_taxon_over_50pct': True}


## 9) `None` for missing values (and safe handling)

Use `None` when a value is missing/unknown. Check with `is None`.

In [10]:
coverage: float | None = None
if coverage is None:
    coverage = 0.0   # default
print("coverage:", coverage)

coverage: 0.0


## 10) Conversions & Validation

Use `int()`, `float()`, `str()`, `.encode()/.decode()` and custom validators to sanitize inputs.

In [11]:
def to_int(x) -> int:
    try:
        return int(x)
    except Exception:
        raise ValueError(f"not an int: {x}")

def to_float(x) -> float:
    try:
        return float(x)
    except Exception:
        raise ValueError(f"not a float: {x}")

print(to_int("42"), to_float("3.14"))
print("ATGC".encode("ascii").decode("ascii"))

42 3.14
ATGC


## 11) Exercises

1. **DNA %AT**: Write a function `percent_at(seq: str) -> float` returning AT percentage as a float 0..1.  
2. **FASTQ qualities**: Write a function `mean_phred(qstr: str) -> float` that returns mean Phred score (Phred+33).  
3. **Peptide mass**: Extend `peptide_mass` to support a fixed modification: +`15.9949` (oxidation) on `M` residues.  
4. **Metabolomics scaling**: Normalize intensities to TIC=1.0 and print new base peak.  
5. **Microbiome RA**: Given a dict of taxon->count, return the top-2 taxa by relative abundance.

In [12]:
# TODO 1) DNA %AT
def percent_at(seq: str) -> float:
    seq = seq.upper()
    if not seq:
        return 0.0
    at = seq.count("A") + seq.count("T")
    return at / len(seq)

print("AT% of ATGCGT =", round(percent_at("ATGCGT")*100,1))

AT% of ATGCGT = 50.0


In [13]:
# TODO 2) mean Phred
def mean_phred(qstr: str) -> float:
    if not qstr:
        return 0.0
    vals = [ord(c) - 33 for c in qstr]
    return sum(vals)/len(vals)

print("mean phred:", round(mean_phred("IIIIIIIIII"),2))  # 'I' ~ 40

mean phred: 40.0


In [14]:
# TODO 3) peptide mass + oxidation on M (+15.9949 each)
from decimal import Decimal

OX = Decimal("15.9949")
def peptide_mass_oxidized(peptide: str) -> Decimal:
    total = Decimal("18.01056")
    for aa in peptide:
        if aa not in AA_MASS:
            raise ValueError(f"Unknown residue: {aa}")
        total += AA_MASS[aa]
        if aa == "M":
            total += OX
    return total

print("ACDM mass (oxidized Ms):", peptide_mass_oxidized("ACDM"))

ACDM mass (oxidized Ms): 454.11919


In [15]:
# TODO 4) TIC normalize
def tic_normalize(intensities: list[float]) -> list[float]:
    s = sum(intensities) or 1.0
    return [i/s for i in intensities]

norm = tic_normalize([10.0, 250.0, 30.0, 400.0, 120.0])
print("normalized TIC:", round(sum(norm), 6))

normalized TIC: 1.0


In [16]:
# TODO 5) Top-2 taxa by relative abundance
def top2_taxa(counts: dict[str,int]) -> list[tuple[str,float]]:
    total = sum(counts.values()) or 1
    ra = {k: v/total for k, v in counts.items()}
    return sorted(ra.items(), key=lambda kv: kv[1], reverse=True)[:2]

print("top2:", top2_taxa({"A":10, "B":5, "C":1}))

top2: [('A', 0.625), ('B', 0.3125)]
