# Exploratory Data Analysis of Biological Sequences

## Introduction

In this notebook, we perform exploratory data analysis (EDA) on the biological sequences to understand their properties before feeding them into the model.

## Contents

- Loading the data
- Sequence length distribution
- Nucleotide frequency analysis
- Visualization of sample sequences


In [None]:
## Code Snippets

import matplotlib.pyplot as plt
from Bio import SeqIO
import seaborn as sns

# Load sequences
sequences = [str(record.seq) for record in SeqIO.parse('data/sample_sequences.fasta', 'fasta')]

# Sequence lengths
seq_lengths = [len(seq) for seq in sequences]

# Plot sequence length distribution
plt.hist(seq_lengths, bins=10)
plt.title('Sequence Length Distribution')
plt.xlabel('Sequence Length')
plt.ylabel('Frequency')
plt.show()

# Nucleotide frequency
from collections import Counter
all_sequences = ''.join(sequences)
nucleotide_counts = Counter(all_sequences)

# Plot nucleotide distribution
sns.barplot(x=list(nucleotide_counts.keys()), y=list(nucleotide_counts.values()))
plt.title('Nucleotide Frequency')
plt.xlabel('Nucleotide')
plt.ylabel('Count')
plt.show()