# NonBDNA Finder - Standalone Jupyter Notebook

This notebook provides a user-friendly interface for detecting non-B DNA structures in genomic sequences.

## Features:
- Upload FASTA files directly in the notebook
- Detect 11 classes of non-B DNA motifs
- Interactive results visualization
- Download results as a zipped package

## Detected Motif Classes:
1. **G-Quadruplex Family** - G-rich four-stranded structures
2. **i-motif family** - C-rich quadruplex structures
3. **Z-DNA** - Left-handed helical structures
4. **Triplex** - Three-stranded DNA structures
5. **Cruciform DNA** - Four-way junction structures
6. **Curved DNA** - Bent DNA structures
7. **Slipped DNA** - Repetitive sequence structures
8. **A-philic DNA** - A-tract containing regions
9. **R-loop** - RNA-DNA hybrid structures
10. **Hybrid** - Overlapping motif structures
11. **Non-B DNA cluster regions** - Hotspot regions

---

## Setup and Installation

First, install the required dependencies:

In [None]:
# Install required packages (run this cell first)
!pip install pandas numpy biopython matplotlib openpyxl plotly ipywidgets

# Try to install hyperscan for optimized performance (optional)
try:
    !pip install hyperscan
    print("✅ Hyperscan installed for optimized performance")
except:
    print("⚠️ Hyperscan not available, using standard regex (still functional)")

## Import Libraries and Initialize

In [None]:
import os
import io
import base64
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display, HTML, FileLink
import ipywidgets as widgets
from ipywidgets import FileUpload, Button, Output, VBox, HBox, HTML as HTMLWidget

# Import our core NonBDNA finder module
from nbdfinder_core import StandaloneNonBDNAFinder, analyze_fasta_file

print("✅ All libraries loaded successfully!")
print("📊 Ready to analyze non-B DNA structures")

## File Upload Interface

Upload your FASTA file using the widget below:

In [None]:
# Create file upload widget
file_upload = FileUpload(
    accept='.fasta,.fa,.fas,.fna,.txt',
    multiple=False,
    description="Upload FASTA",
    style={'description_width': 'initial'}
)

# Create output areas
upload_output = Output()
analysis_output = Output()
results_output = Output()

# Create analysis button
analyze_button = Button(
    description="🧬 Analyze Sequences",
    button_style='primary',
    disabled=True,
    layout=widgets.Layout(width='200px', height='40px')
)

# Global variables to store data
uploaded_content = None
results_df = None
results_zip = None

def on_upload_change(change):
    global uploaded_content
    
    with upload_output:
        upload_output.clear_output()
        
        if file_upload.value:
            # Get the uploaded file
            uploaded_file = list(file_upload.value.values())[0]
            content = uploaded_file['content']
            filename = uploaded_file['metadata']['name']
            
            try:
                # Decode the content
                uploaded_content = content.decode('utf-8')
                
                # Validate FASTA format
                if not uploaded_content.strip().startswith('>'):
                    print("❌ Error: File does not appear to be in FASTA format")
                    print("   FASTA files should start with '>' followed by sequence identifier")
                    analyze_button.disabled = True
                    return
                
                # Count sequences
                seq_count = uploaded_content.count('>')
                total_length = len(uploaded_content.replace('>', '').replace('\n', '').replace(' ', ''))
                
                print(f"✅ Successfully uploaded: {filename}")
                print(f"📊 Contains {seq_count} sequence(s)")
                print(f"📏 Total sequence length: ~{total_length:,} characters")
                print(f"\n📝 Preview (first 200 characters):")
                print(uploaded_content[:200] + "..." if len(uploaded_content) > 200 else uploaded_content)
                
                analyze_button.disabled = False
                
            except UnicodeDecodeError:
                print("❌ Error: Could not decode file. Please ensure it's a text file.")
                analyze_button.disabled = True
            except Exception as e:
                print(f"❌ Error processing file: {str(e)}")
                analyze_button.disabled = True

def on_analyze_click(button):
    global results_df, results_zip
    
    with analysis_output:
        analysis_output.clear_output()
        
        if uploaded_content is None:
            print("❌ No file uploaded yet")
            return
        
        print("🔬 Starting NonBDNA analysis...")
        print("🧬 Scanning for motif patterns...")
        
        try:
            # Run the analysis
            results_df, results_zip = analyze_fasta_file(uploaded_content)
            
            if len(results_df) == 0:
                print("🔍 Analysis completed - no non-B DNA motifs detected")
                print("💡 This could mean:")
                print("   • The sequence doesn't contain detectable motifs")
                print("   • Motifs may be present but below detection thresholds")
                print("   • The sequence may need different analysis parameters")
                return
            
            print(f"✅ Analysis completed successfully!")
            print(f"🎯 Detected {len(results_df)} non-B DNA motifs")
            print(f"🏷️ Found {results_df['Class'].nunique()} different motif classes")
            
            # Show class distribution
            class_counts = results_df['Class'].value_counts()
            print(f"\n📊 Motif class distribution:")
            for class_name, count in class_counts.items():
                print(f"   • {class_name}: {count} motifs")
            
            print(f"\n💾 Results package ready for download!")
            
            # Trigger results display
            display_results()
            
        except Exception as e:
            print(f"❌ Error during analysis: {str(e)}")
            import traceback
            print(f"Debug info: {traceback.format_exc()}")

# Set up event handlers
file_upload.observe(on_upload_change, names='value')
analyze_button.on_click(on_analyze_click)

# Display the interface
display(VBox([
    HTMLWidget("<h3>📁 Step 1: Upload FASTA File</h3>"),
    file_upload,
    upload_output,
    HTMLWidget("<h3>🔬 Step 2: Run Analysis</h3>"),
    analyze_button,
    analysis_output
]))

## Results Visualization and Download

Results will appear here after analysis:

In [None]:
def display_results():
    """Display analysis results with visualizations"""
    global results_df, results_zip
    
    with results_output:
        results_output.clear_output()
        
        if results_df is None or len(results_df) == 0:
            print("No results to display yet. Please upload a file and run analysis.")
            return
        
        print("📊 ANALYSIS RESULTS")
        print("=" * 50)
        
        # Summary statistics
        print(f"Total motifs detected: {len(results_df)}")
        print(f"Unique sequences: {results_df['Sequence_Name'].nunique()}")
        print(f"Unique classes: {results_df['Class'].nunique()}")
        print(f"Mean score: {results_df['Normalized_Score'].mean():.3f}")
        print(f"Total coverage: {results_df['Length'].sum():,} bp")
        
        # Show first few results
        print(f"\n📋 First 10 detected motifs:")
        display_cols = ['Class', 'Subclass', 'Start', 'End', 'Length', 'Normalized_Score', 'Sequence_Name']
        display(results_df[display_cols].head(10))
        
        # Class distribution chart
        print("\n📊 Motif Class Distribution:")
        class_counts = results_df['Class'].value_counts()
        
        fig = px.bar(
            x=class_counts.index,
            y=class_counts.values,
            title="Non-B DNA Motif Classes Detected",
            labels={'x': 'Motif Class', 'y': 'Number of Motifs'},
            color=class_counts.values,
            color_continuous_scale='viridis'
        )
        fig.update_layout(
            xaxis_tickangle=-45,
            height=500,
            showlegend=False
        )
        fig.show()
        
        # Score distribution
        print("\n📈 Score Distribution:")
        fig2 = px.histogram(
            results_df,
            x='Normalized_Score',
            title="Distribution of Normalized Scores",
            nbins=20,
            color_discrete_sequence=['skyblue']
        )
        fig2.update_layout(height=400)
        fig2.show()
        
        # Length distribution by class
        print("\n📏 Motif Length Distribution by Class:")
        fig3 = px.box(
            results_df,
            x='Class',
            y='Length',
            title="Motif Length Distribution by Class",
            color='Class'
        )
        fig3.update_layout(
            xaxis_tickangle=-45,
            height=500,
            showlegend=False
        )
        fig3.show()
        
        # Create download link
        print("\n💾 DOWNLOAD RESULTS")
        print("=" * 30)
        
        # Save zip file temporarily
        zip_filename = "nbdfinder_results.zip"
        with open(zip_filename, 'wb') as f:
            f.write(results_zip)
        
        # Create download link
        download_link = f'<a href="{zip_filename}" download="{zip_filename}">📥 Download Results Package (ZIP)</a>'
        display(HTML(f"""
        <div style="background-color: #e8f5e8; padding: 15px; border-radius: 5px; margin: 10px 0;">
            <h4>🎉 Analysis Complete!</h4>
            <p>Your results package contains:</p>
            <ul>
                <li>📄 <strong>CSV file</strong> - Spreadsheet format for data analysis</li>
                <li>📊 <strong>Excel file</strong> - Multi-sheet workbook with class-specific data</li>
                <li>🧬 <strong>GFF3 file</strong> - Genome browser compatible format</li>
                <li>📋 <strong>Summary report</strong> - Human-readable analysis summary</li>
            </ul>
            <p style="font-size: 16px; margin-top: 15px;">
                {download_link}
            </p>
        </div>
        """))
        
        print(f"📁 Package contains {len(results_df)} motifs across {results_df['Class'].nunique()} classes")
        print(f"📊 Ready for further analysis in your favorite tools!")

# Display results area
display(results_output)

## Example Usage

If you don't have your own FASTA file, you can test with this example sequence:

In [None]:
# Example: Create a test FASTA file with various non-B DNA structures
example_fasta = """>example_sequence_with_multiple_motifs
GGGTTTTGGGTTTTGGGTTTTGGGAAACCCAAACCCAAACCCAAACCCATATATATATATATGCGCGCGCGCGCGC
AAAAAAAAATGCGTAAAAAAAAATGCGTAAAAAAAAATGCGTCGCGCGCGCGCGCGCGCGGGGGGGGGGGGGGGGG
CCCCCCCCCCCCCCCCAGATCTCGAGCTCGAGCTCGAGCTCGAGCTCGAGTTTTTTTTTTTAAAAAAAAAAAGGGG
GGGGGGGTTTTGGGTTTTGGGTTTTGGGCCCCTTTCCCCTTTCCCCTTTCCCATCGATCGATCGATCGATCGATCG
>another_test_sequence
AAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTCGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCC
ATATATATATATATATTTTAAAAAAAAGGGGTTTTGGGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGG
"""

# Save example to file
with open("example_sequences.fasta", "w") as f:
    f.write(example_fasta)

print("📁 Created example FASTA file: example_sequences.fasta")
print("💡 You can download this file and upload it using the interface above")
print("🧬 This example contains various non-B DNA structures for testing")

# Or run directly on the example
example_button = Button(
    description="🧪 Analyze Example",
    button_style='info',
    layout=widgets.Layout(width='200px', height='40px')
)

def analyze_example(button):
    global uploaded_content, results_df, results_zip
    uploaded_content = example_fasta
    
    with analysis_output:
        analysis_output.clear_output()
        print("🧪 Running analysis on example sequences...")
        
        try:
            results_df, results_zip = analyze_fasta_file(example_fasta)
            print(f"✅ Example analysis completed!")
            print(f"🎯 Detected {len(results_df)} non-B DNA motifs")
            display_results()
        except Exception as e:
            print(f"❌ Error: {str(e)}")

example_button.on_click(analyze_example)
display(example_button)

## Advanced Analysis (Optional)

For users who want to explore the results further:

In [None]:
# Advanced analysis functions
def create_motif_position_map(df):
    """Create a position map of motifs along sequences"""
    if df is None or len(df) == 0:
        print("No data available for position mapping")
        return
    
    fig = px.scatter(
        df,
        x='Start',
        y='Sequence_Name',
        color='Class',
        size='Length',
        hover_data=['Subclass', 'Normalized_Score'],
        title="Motif Position Map",
        labels={'Start': 'Position (bp)', 'Sequence_Name': 'Sequence'}
    )
    fig.update_layout(height=max(400, len(df['Sequence_Name'].unique()) * 50))
    fig.show()

def analyze_motif_clustering(df, window_size=1000):
    """Analyze clustering of motifs in windows"""
    if df is None or len(df) == 0:
        print("No data available for clustering analysis")
        return
    
    clustering_data = []
    
    for seq_name in df['Sequence_Name'].unique():
        seq_df = df[df['Sequence_Name'] == seq_name]
        
        if len(seq_df) == 0:
            continue
            
        max_pos = seq_df['End'].max()
        
        for window_start in range(0, max_pos, window_size):
            window_end = window_start + window_size
            window_motifs = seq_df[
                (seq_df['Start'] >= window_start) & 
                (seq_df['Start'] < window_end)
            ]
            
            clustering_data.append({
                'Sequence': seq_name,
                'Window_Start': window_start,
                'Window_End': window_end,
                'Motif_Count': len(window_motifs),
                'Unique_Classes': window_motifs['Class'].nunique()
            })
    
    clustering_df = pd.DataFrame(clustering_data)
    
    if len(clustering_df) > 0:
        fig = px.bar(
            clustering_df,
            x='Window_Start',
            y='Motif_Count',
            color='Unique_Classes',
            facet_col='Sequence',
            title=f"Motif Clustering Analysis (Window size: {window_size} bp)",
            labels={'Window_Start': 'Position (bp)', 'Motif_Count': 'Number of Motifs'}
        )
        fig.show()
        
        return clustering_df
    else:
        print("No clustering data available")

# Interactive widgets for advanced analysis
advanced_button = Button(
    description="🔬 Position Map",
    button_style='success',
    layout=widgets.Layout(width='150px')
)

clustering_button = Button(
    description="📊 Clustering",
    button_style='success',
    layout=widgets.Layout(width='150px')
)

def show_position_map(button):
    create_motif_position_map(results_df)

def show_clustering(button):
    clustering_df = analyze_motif_clustering(results_df)

advanced_button.on_click(show_position_map)
clustering_button.on_click(show_clustering)

print("🔬 Advanced Analysis Tools:")
display(HBox([advanced_button, clustering_button]))

## Summary and Next Steps

This notebook provides a complete standalone solution for non-B DNA motif detection. Here's what you can do with your results:

### 📊 **Data Analysis:**
- Import CSV files into R, Python, or Excel for statistical analysis
- Use the Excel file for quick visualization and filtering
- Analyze motif distributions and clustering patterns

### 🧬 **Genomic Visualization:**
- Load GFF3 files into genome browsers (UCSC, IGV, JBrowse)
- Overlay with other genomic features
- Create publication-ready visualizations

### 🔬 **Further Research:**
- Correlate motif locations with gene expression data
- Study evolutionary conservation patterns
- Investigate functional relationships between motif classes

### 💡 **Tips for Best Results:**
- Use high-quality sequence data (avoid too many N's)
- Consider running analysis on both strands
- Validate interesting findings with experimental methods
- Check motif predictions against known structural databases

---

**Need help?** Check the documentation or contact the development team.

**Citation:** If you use this tool in your research, please cite the original NonBDNAFinder publication.

---

*Happy motif hunting! 🧬*