# 🧬 Genome Guide: Comprehensive Analysis Report
## Chapters 2 & 3: From DNA Foundations to Gene Architecture

This report validates the **Genome Guide** bioinformatics engine against real genomic data (**hg38, Chromosome 22**). It bridges biological theory with computational results, showcasing the depth of the implemented analytics.

---

## 1. Formal Architecture Documentation

The system implements a **Separation of Concerns** across four primary layers:

1.  **Raw Data Layer:** Standard bioinformatics formats (`.fa`, `.gtf`, `.txt` from UCSC).
2.  **Orchestration Layer (Snakemake):** Manages the DAG (Directed Acyclic Graph) of tasks. Ensures idempotency and dependency tracking.
3.  **Relational Layer (SQLite/SQLAlchemy):** Stores entities (`Chromosome`, `Gene`, `Exon`, `Utr`, `CpgIsland`) with indexing on genomic coordinates for fast spatial queries.
4.  **Analytics Layer:** Decoupled Python scripts that consume database sessions and compute high-level statistics, storing them as JSON in a specialized `genome_stats` table.

### Algorithmic Efficiency Summary

| Component | Algorithm | Complexity | Status |
| :--- | :--- | :--- | :--- |
| **SSR Engine** | Regex Pattern Matching | $O(N_{seq})$ | Optimized |
| **Density Binning** | Linear Partitioning | $O(G_{genes})$ | Optimized |
| **Nested Discovery** | Spatial Pruning | $O(G \log G)$ | Optimized |
| **CpG Association** | Spatial Index Join | $O(I \cdot \log G)$ | Optimized |

In [None]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import json
import os
from IPython.display import display, HTML

# Global Styling
sns.set_theme(style='whitegrid', palette='muted')
plt.rcParams['figure.figsize'] = (14, 7)
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 14

# --- 1. Targeted Database Connection ---
db_paths = ['../genome_guides.db', 'genome_guides.db', '../../genome_guides.db']
db_path = None
for path in db_paths:
    if os.path.exists(path) and os.path.getsize(path) > 1000000: 
        db_path = path
        break

if not db_path:
    print("CRITICAL ERROR: Populated 'genome_guides.db' not found.")
else:
    conn = sqlite3.connect(db_path)
    print(f"Connected to Genome Engine at {os.path.abspath(db_path)}")

def fetch_stat(name):
    try:
        query = f"SELECT stat_value FROM genome_stats WHERE stat_name = '{name}'"
        row = pd.read_sql(query, conn)
        return json.loads(row.iloc[0,0]) if not row.empty else None
    except:
        return None

## 2. Genomic Inventory (Truth Check)
Validating entity counts against biological expectations.

In [None]:
print('--- GENOMIC ENTITY INVENTORY ---')
tables = ['chromosomes', 'genes', 'exons', 'utrs', 'cpg_islands', 'centromeres', 'telomeres']
counts = {}
for t in tables:
    try:
        counts[t] = pd.read_sql(f'SELECT count(*) FROM {t}', conn).iloc[0,0]
    except:
        counts[t] = 0

inv_df = pd.DataFrame(counts.items(), columns=['Entity', 'Value'])
display(inv_df.set_index('Entity').style.format('{:,}'))

print('
--- DATA SAMPLE (First 10 Genes) ---')
display(pd.read_sql('SELECT gene_id, gene_name, start_pos, end_pos, strand FROM genes LIMIT 10', conn))

## 3. DNA Landscape (Chapter 2)
Analyzing base composition, GC skew, and repeats.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

# 3A. Pie Chart: Base Distribution
nuc = fetch_stat('nuclear_base_composition')
if nuc:
    ax1.pie(nuc.values(), labels=nuc.keys(), autopct='%1.1f%%', startangle=140, 
            colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'], explode=[0.05]*4)
    ax1.set_title('Genome-wide Nuclear Base Distribution', fontweight='bold')

# 3B. GC Content vs. Average
gc = fetch_stat('gc_content_per_chromosome')
if gc:
    sns.barplot(x=list(gc.keys()), y=list(gc.values()), color='steelblue', ax=ax2)
    ax2.axhline(41, color='red', linestyle='--', label='Human Average (~41%)')
    ax2.set_ylabel('Percentage (%)')
    ax2.legend()

plt.show()

# 3C. Dinucleotide Signature Heatmap
dinu = fetch_stat('dinucleotide_frequency')
if dinu:
    bases = ['A', 'C', 'G', 'T']
    matrix = np.zeros((4, 4))
    for i, b1 in enumerate(bases):
        for j, b2 in enumerate(bases):
            matrix[i, j] = dinu.get(b1+b2, 0)
    plt.figure(figsize=(10, 8))
    sns.heatmap(matrix, annot=True, fmt=',.0f', xticklabels=bases, yticklabels=bases, cmap='YlGnBu')
    plt.title('Dinucleotide Frequency Heatmap', fontweight='bold')
    plt.show()

### Microsatellite Discovery: SSRs

In [None]:
ssr_stat = fetch_stat('simple_sequence_repeats')
if ssr_stat and 'ssrs' in ssr_stat:
    df_ssrs = pd.DataFrame(ssr_stat['ssrs'])
    display(HTML('<h4>Top 15 Longest Microsatellites Detected</h4>'))
    display(df_ssrs[['chromosome_name', 'motif', 'type', 'length']].sort_values('length', ascending=False).head(15))
    
    plt.figure(figsize=(8, 4))
    df_ssrs['type'].value_counts().plot(kind='barh', color='teal')
    plt.title('SSR Motif Category Distribution')
    plt.show()

## 4. Gene Architecture (Chapter 3)
Exploring density, overlap, and regulatory association.

In [None]:
# 4A. Gene Density Area Plot
dens = fetch_stat('gene_density_1mb')
if dens:
    for chrom, bins in dens['data'].items():
        plt.figure(figsize=(16, 4))
        plt.fill_between(range(len(bins)), bins, color='forestgreen', alpha=0.3)
        plt.plot(bins, color='forestgreen', lw=2)
        plt.title(f'Gene Density: {chrom} (1Mb Bins)', fontweight='bold')
        plt.ylabel('Genes / Mb')
        plt.show()

# 4B. Spatial Association: CpG Islands
cpg = fetch_stat('cpg_island_gene_association')
if cpg and cpg.get('total_islands', 0) > 0:
    plt.figure(figsize=(7, 7))
    plt.pie([cpg['associated_with_genes'], cpg['non_associated']], 
            labels=['Gene-Proximal', 'Intergenic'], autopct='%1.1f%%', colors=['#ffcc99','#99ff99'])
    plt.title('CpG Island Proximity to Gene Bodies')
    plt.show()

### Hypotheses & Topology

In [None]:
fig, ax1 = plt.subplots(1, 1, figsize=(10, 7))

# 4C. Density vs. Length Correlation
len_corr = fetch_stat('gene_density_length_correlation')
if len_corr and 'chromosome_data' in len_corr:
    df_lc = pd.DataFrame(len_corr['chromosome_data'])
    sns.regplot(data=df_lc, x='density', y='average_gene_length', ax=ax1, color='orange')
    ax1.set_title(f"Density vs. Avg Length (R={len_corr['correlation_coefficient']:.2f})")
plt.show()

# 4D. Nested Genes Summary
nested = fetch_stat('nested_genes_statistics')
if nested:
    df_n = pd.DataFrame(nested['nested_pairs'])
    display(HTML(f'<h4>Identified {len(df_n)} Nested Gene Pairs</h4>'))
    display(df_n.head(15))

# 4E. UTR Structure Metrics
utr = fetch_stat('utr_transcript_correlation')
if utr:
    display(HTML('<h4>UTR Analysis Summary</h4>'))
    display(pd.Series(utr))