# DNA Base Sequence Content Analysis

This notebook demonstrates how to analyze and visualize the distribution of nucleotide bases in DNA sequence data using Polars and matplotlib.

## 0. Setup Environment

Before we begin our analysis, we should clean up any temporary files created by DataFusion/Apache Arrow from previous runs. These temporary catalogs can accumulate over time as they are created each time we run queries.

In [None]:
import os
import shutil

def cleanTMP():
    tmp_path = os.path.join(os.getcwd(), 'tmp')
    if os.path.exists(tmp_path):
        print(f"Usuwanie folderu tymczasowego: {tmp_path}")
        shutil.rmtree(tmp_path, ignore_errors=True)
        print("Folder tymczasowy usunięty.")
    else:
        print("Folder tymczasowy nie istnieje - brak potrzeby czyszczenia.")

cleanTMP()

## 1. Create Sample DNA Sequence Data

Let's import the necessary libraries for our analysis.

In [None]:
import polars as pl
import matplotlib.pyplot as plt
from polars_bio.quality_control_op import base_sequence_content
from polars_bio.quality_control_viz import plot_base_content

# Set matplotlib style for better visualizations
plt.style.use('ggplot')
%matplotlib inline

Now we'll create a simple example dataset with DNA sequences.

In [None]:
# Simple example with short sequences
short_sequences = pl.DataFrame({
    "sequence": ["ATGC", "AAGC", "ATTC", "GTCC"]
})

## 2. Analyze Base Sequence Content

Now we'll use the `base_sequence_content` function to analyze the distribution of bases at each position.

In [None]:
# Calculate base content at each position
result = base_sequence_content(short_sequences)
print(result)

## 3. Visualize Base Distribution

Let's visualize the distribution of bases at each position using our custom plotting function.

In [None]:
# Plot the base content distribution for our simple example
plot_base_content(result, figsize=(10, 6))

## 4. Creating More Realistic Data

Let's generate a more realistic dataset with longer sequences to better visualize base content distribution.

In [None]:
import random

# Function to generate random DNA sequence
def generate_dna(length, n_freq=0.05):
    bases = ['A', 'C', 'G', 'T']
    sequence = []
    for _ in range(length):
        if random.random() < n_freq:
            sequence.append('N')  # Add some Ns with a small frequency
        else:
            sequence.append(random.choice(bases))
    return ''.join(sequence)

# Generate 100 random sequences of length 100
random.seed(42)  # For reproducibility
num_sequences = 100
seq_length = 100

sequences = [generate_dna(seq_length) for _ in range(num_sequences)]
df_sequences = pl.DataFrame({"sequence": sequences})

# Look at the first few sequences
df_sequences.head()

In [None]:
# Analyze base content on our larger dataset
result_large = base_sequence_content(df_sequences)
result_large.head()

In [None]:
# Plot the base content distribution for our larger dataset
plot_base_content(result_large, figsize=(12, 7), title='Base Distribution Across Sequence Positions')

Clean up temporary files created during the analysis

In [None]:
cleanTMP()