# Problem Set 1: DNA Sequence Analysis

**Author**: Collin McNeil
**Course**: BIN602 - Data Mining for Bioinformatics
**Date**: 12/11/2025

## Overview

This notebook analyzes DNA sequences from mouse chromosome 1 SNP data. It demonstrates:

1. **Codon frequency analysis** - counting non-overlapping 3-nucleutide codons in a specific sequence
2. **GC content calculation** - computing the percentage of G and C nucleotides across multiple sequences

The analysis uses the `dna_ops.py` library module, which contains reusable functions for DNA sequence operations.

## Data Source

- **File**: `data/mouse/chr1.Build37.snp`
- **Format**: Space-separated nucleutide sequences (A, T, G, C)
- **Origin**: Mouse genome chromosome 1 SNP data (Build 37)

## Setup: Important Libraries and Load Data

In [3]:
# Import functions from dna_ops library
import sys
sys.path.append('../scripts')
from dna_ops import clean_dna_sequence, count_codons, calculate_gc_content

# Read the SNP file
with open('../data/mouse/chr1.Build37.snp', 'r') as f:
    lines = f.readlines()

print(f"Successfully loaded {len(lines)} lines from chr1.Build37.snp")

Successfully loaded 2002 lines from chr1.Build37.snp


## Analysis 1: Codons Counts for Line 3

Count the frequency of all non-overlapping 3-nucleotide codons in the third line of the SNP file.

In [4]:
# Get the 3rd line (index 2)
line_3 = lines[2].rstrip('\n')

# Clean the sequence
cleaned_line_3 = clean_dna_sequence(line_3)

# Count codons
codon_counts = count_codons(cleaned_line_3)

# Display results
print("Codon Counts for Line 3:")
print("=" * 40)
print(f"Total codons: {sum(codon_counts.values())}") 
print(f"Unique codons: {len(codon_counts)}")
print("\nTop 10 most frequent codons:")
for codon, count in sorted(codon_counts.items(), key=lambda x: -x[1])[:10]:
    print(f"  {codon}: {count}")

Codon Counts for Line 3:
Total codons: 581
Unique codons: 43

Top 10 most frequent codons:
  AAA: 100
  GGG: 67
  GAA: 49
  AGG: 48
  AAG: 40
  GGA: 39
  AGA: 28
  GAG: 22
  CAA: 19
  AAC: 18


## Analysis 2: GC Content for First 10 Lines

Calculate the GC content (percentage of G and C nucleotides) for each of the first 10 lines in the SNP file.

In [5]:
# Calculate GC content for first 10 lines
print("GC Content for First 10 Lines:")
print("=" * 40)
print(f"{'Line':<6} {'GC Content (%)':<15}")
print("-" * 40)

gc_values = []
for i in range(10):
    # Clean the sequence
    cleaned = clean_dna_sequence(lines[i].rstrip('\n'))

    # Calculate GC content
    gc_content = calculate_gc_content(cleaned)
    gc_values.append(gc_content)

    # Display
    print(f"{i+1:<6} {gc_content:>6.2f}%")

# Summary statistics
print("\n" + "=" * 40)
print(f"Average GC content: {sum(gc_values)/len(gc_values):.2f}%")
print(f"Min GC content: {min(gc_values):.2f}%")
print(f"Max GC content: {max(gc_values):.2f}%")

GC Content for First 10 Lines:
Line   GC Content (%) 
----------------------------------------
1       50.86%
2       48.39%
3       48.85%
4       50.52%
5       48.51%
6       50.46%
7       48.62%
8       48.62%
9       50.11%
10      47.88%

Average GC content: 49.28%
Min GC content: 47.88%
Max GC content: 50.86%
