# üß¨ NonBScanner - Local Usage Notebook

**Comprehensive notebook for running NonBScanner locally**

---

## üìã Table of Contents
1. [Installation](#installation)
2. [Quick Start](#quickstart)
3. [Analyze Single Sequence](#single-sequence)
4. [Analyze FASTA File](#fasta-file)
5. [Visualize Results](#visualization)
6. [Export Results](#export)

---

## üîß Installation <a id="installation"></a>

### Prerequisites
Make sure you have Python 3.8+ installed.

In [None]:
# Install required dependencies
# Uncomment the line below if packages are not installed
# !pip install numpy pandas matplotlib seaborn plotly biopython scipy scikit-learn

## üöÄ Quick Start <a id="quickstart"></a>

### Import NonBScanner

In [None]:
# Import the scanner module
from scanner import analyze_sequence, analyze_multiple_sequences, export_results_to_dataframe
from scanner import get_motif_classification_info
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì NonBScanner loaded successfully!")

### Check Available Motif Classes

In [None]:
# Get information about motif classification
info = get_motif_classification_info()
print(f"NonBScanner Version: {info['version']}")
print(f"Total Classes: {info['total_classes']}")
print(f"Total Subclasses: {info['total_subclasses']}")
print("\nMotif Classes:")
print("="*60)
for class_num, class_info in info['classification'].items():
    print(f"{class_num:2}. {class_info['name']:25} ({len(class_info['subclasses'])} subclasses)")
    for subclass in class_info['subclasses'][:3]:  # Show first 3 subclasses
        print(f"    - {subclass}")
    if len(class_info['subclasses']) > 3:
        print(f"    ... and {len(class_info['subclasses']) - 3} more")

## üî¨ Analyze Single Sequence <a id="single-sequence"></a>

### Example: Detect Non-B DNA Motifs

In [None]:
# Example sequence with multiple motif types
example_sequence = "GGGTTAGGGTTAGGGTTAGGGAAAAATTTTAAAAATTTTCGCGCGCGCGCGCACACACACACACACA"

# Analyze the sequence
print(f"Analyzing sequence ({len(example_sequence)} bp)...")
motifs = analyze_sequence(example_sequence, "example_seq")

print(f"\nFound {len(motifs)} motifs:")
print("="*80)

# Display results
for motif in motifs:
    print(f"Class: {motif['Class']:20} Subclass: {motif['Subclass']:25}")
    print(f"  Position: {motif['Start']:4}-{motif['End']:4}   Length: {motif['Length']:3} bp   Score: {motif['Score']:.3f}")
    print(f"  Sequence: {motif['Sequence'][:50]}..." if len(motif['Sequence']) > 50 else f"  Sequence: {motif['Sequence']}")
    print()

### Custom Sequence Analysis

In [None]:
# Enter your own sequence here
my_sequence = """ATCGATCGATCGGGGTTAGGGTTAGGGTTAGGGCCCCTAACCCCTAACCCCTAACCC
AAAAATTTTAAAAATTTTCGCGCGCGCGCGCACACACACACACACA"""

# Clean sequence (remove whitespace and newlines)
my_sequence = ''.join(my_sequence.split()).upper()

# Analyze
my_motifs = analyze_sequence(my_sequence, "my_sequence")

print(f"Analyzed {len(my_sequence)} bp sequence")
print(f"Found {len(my_motifs)} motifs")

# Count by class
class_counts = {}
for m in my_motifs:
    cls = m['Class']
    class_counts[cls] = class_counts.get(cls, 0) + 1

print("\nMotifs by class:")
for cls, count in sorted(class_counts.items()):
    print(f"  {cls:25} {count:3} motifs")

## üìÑ Analyze FASTA File <a id="fasta-file"></a>

### Read and Analyze FASTA Sequences

In [None]:
def read_fasta(filename):
    """Read sequences from a FASTA file"""
    sequences = {}
    current_name = None
    current_seq = []
    
    with open(filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                # Save previous sequence
                if current_name:
                    sequences[current_name] = ''.join(current_seq)
                # Start new sequence
                current_name = line[1:].split()[0]  # Get first word after >
                current_seq = []
            else:
                current_seq.append(line.upper())
        
        # Save last sequence
        if current_name:
            sequences[current_name] = ''.join(current_seq)
    
    return sequences

# Example: Analyze the example FASTA file
try:
    fasta_file = "example_all_motifs.fasta"
    sequences = read_fasta(fasta_file)
    
    print(f"Loaded {len(sequences)} sequence(s) from {fasta_file}")
    for name, seq in sequences.items():
        print(f"  {name}: {len(seq)} bp")
    
    # Analyze all sequences
    print("\nAnalyzing sequences...")
    all_results = analyze_multiple_sequences(sequences, use_multiprocessing=False)
    
    # Display summary
    print("\nAnalysis Summary:")
    print("="*60)
    for seq_name, motifs in all_results.items():
        print(f"{seq_name}: {len(motifs)} motifs detected")
        
except FileNotFoundError:
    print(f"File '{fasta_file}' not found. Please ensure the file exists in the current directory.")
    print("You can create your own FASTA file or use the example below.")

### Analyze Your Own FASTA File

In [None]:
# Specify your FASTA file path
your_fasta_file = "your_sequences.fasta"  # Change this to your file

# Uncomment to run analysis on your file
# sequences = read_fasta(your_fasta_file)
# results = analyze_multiple_sequences(sequences, use_multiprocessing=True)
# print(f"Analyzed {len(results)} sequences")

## üìä Visualize Results <a id="visualization"></a>

### Motif Distribution Plot

In [None]:
# Create a sample analysis for visualization
test_seq = "GGGTTAGGGTTAGGGTTAGGGAAAAATTTTAAAAATTTTCGCGCGCGCGCGCACACACACACACACACCCCTAACCCCTAACCCCTAACCC"
test_motifs = analyze_sequence(test_seq, "test")

if len(test_motifs) > 0:
    # Count motifs by class
    class_counts = {}
    for m in test_motifs:
        cls = m['Class']
        class_counts[cls] = class_counts.get(cls, 0) + 1
    
    # Create bar plot
    plt.figure(figsize=(12, 6))
    classes = list(class_counts.keys())
    counts = list(class_counts.values())
    
    plt.bar(classes, counts, color='steelblue', edgecolor='black')
    plt.xlabel('Motif Class', fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.title('Non-B DNA Motif Distribution', fontsize=14, fontweight='bold')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.grid(axis='y', alpha=0.3)
    plt.show()
else:
    print("No motifs detected for visualization")

### Motif Position Map

In [None]:
if len(test_motifs) > 0:
    # Create position plot
    plt.figure(figsize=(14, 6))
    
    # Get unique classes and assign colors
    unique_classes = list(set(m['Class'] for m in test_motifs))
    colors = plt.cm.Set3(range(len(unique_classes)))
    class_colors = dict(zip(unique_classes, colors))
    
    # Plot each motif as a horizontal bar
    for i, motif in enumerate(test_motifs):
        start = motif['Start']
        end = motif['End']
        cls = motif['Class']
        plt.barh(i, end - start, left=start, height=0.8, 
                color=class_colors[cls], edgecolor='black', linewidth=0.5,
                label=cls if cls not in [m['Class'] for m in test_motifs[:i]] else "")
    
    plt.xlabel('Position (bp)', fontsize=12)
    plt.ylabel('Motif Index', fontsize=12)
    plt.title('Non-B DNA Motif Positions', fontsize=14, fontweight='bold')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.grid(axis='x', alpha=0.3)
    plt.show()
else:
    print("No motifs detected for visualization")

### Score Distribution

In [None]:
if len(test_motifs) > 0:
    # Extract scores by class
    class_scores = {}
    for m in test_motifs:
        cls = m['Class']
        if cls not in class_scores:
            class_scores[cls] = []
        class_scores[cls].append(m.get('Score', 0))
    
    # Create violin plot
    plt.figure(figsize=(12, 6))
    
    # Prepare data for plotting
    plot_data = []
    plot_labels = []
    for cls, scores in class_scores.items():
        plot_data.append(scores)
        plot_labels.append(cls)
    
    plt.violinplot(plot_data, positions=range(len(plot_data)), showmeans=True)
    plt.xticks(range(len(plot_labels)), plot_labels, rotation=45, ha='right')
    plt.xlabel('Motif Class', fontsize=12)
    plt.ylabel('Score', fontsize=12)
    plt.title('Motif Score Distribution by Class', fontsize=14, fontweight='bold')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("No motifs detected for visualization")

## üíæ Export Results <a id="export"></a>

### Export to CSV

In [None]:
# Convert motifs to DataFrame
if len(test_motifs) > 0:
    df = export_results_to_dataframe(test_motifs)
    
    # Display first few rows
    print("Results DataFrame (first 5 rows):")
    print(df.head())
    
    # Save to CSV
    output_file = "nonbscanner_results.csv"
    df.to_csv(output_file, index=False)
    print(f"\n‚úì Results saved to {output_file}")
else:
    print("No motifs to export")

### Export to Excel

In [None]:
# Export to Excel with multiple sheets (requires openpyxl)
if len(test_motifs) > 0:
    try:
        output_excel = "nonbscanner_results.xlsx"
        
        with pd.ExcelWriter(output_excel, engine='openpyxl') as writer:
            # All motifs
            df.to_excel(writer, sheet_name='All_Motifs', index=False)
            
            # Separate sheets by class
            for cls in df['Class'].unique():
                if cls != 'NA':
                    class_df = df[df['Class'] == cls]
                    sheet_name = cls.replace('/', '_').replace(' ', '_')[:31]  # Excel sheet name limit
                    class_df.to_excel(writer, sheet_name=sheet_name, index=False)
        
        print(f"‚úì Results saved to {output_excel}")
    except ImportError:
        print("openpyxl not installed. Install with: pip install openpyxl")
else:
    print("No motifs to export")

### Export Summary Statistics

In [None]:
# Generate summary statistics
if len(test_motifs) > 0:
    summary = {
        'Total_Motifs': len(test_motifs),
        'Unique_Classes': len(df['Class'].unique()),
        'Unique_Subclasses': len(df['Subclass'].unique()),
        'Sequence_Length': len(test_seq),
        'Average_Motif_Length': df[df['Length'] != 'NA']['Length'].astype(float).mean(),
        'Average_Score': df[df['Score'] != 'NA']['Score'].astype(float).mean()
    }
    
    summary_df = pd.DataFrame([summary])
    print("Summary Statistics:")
    print(summary_df.T)
    
    # Save summary
    summary_df.to_csv("nonbscanner_summary.csv", index=False)
    print("\n‚úì Summary saved to nonbscanner_summary.csv")
else:
    print("No motifs to summarize")

## üéì Advanced Usage

### Batch Processing Multiple Sequences

In [None]:
# Example: Process multiple sequences
sequences_dict = {
    'seq1': 'GGGTTAGGGTTAGGGTTAGGG',
    'seq2': 'AAAAATTTTAAAAATTTT',
    'seq3': 'CGCGCGCGCGCGCG',
    'seq4': 'CACACACACACACACA'
}

# Analyze all sequences
batch_results = analyze_multiple_sequences(sequences_dict, use_multiprocessing=False)

# Combine all results
all_motifs_list = []
for seq_name, motifs in batch_results.items():
    all_motifs_list.extend(motifs)

print(f"Processed {len(sequences_dict)} sequences")
print(f"Total motifs detected: {len(all_motifs_list)}")

# Export combined results
if len(all_motifs_list) > 0:
    combined_df = export_results_to_dataframe(all_motifs_list)
    combined_df.to_csv("batch_results.csv", index=False)
    print("‚úì Batch results saved to batch_results.csv")

### Filter Results by Class

In [None]:
# Filter motifs by specific class
if len(test_motifs) > 0:
    target_class = 'G-Quadruplex'  # Change this to filter by other classes
    
    filtered_motifs = [m for m in test_motifs if m['Class'] == target_class]
    
    print(f"Found {len(filtered_motifs)} {target_class} motifs:")
    for motif in filtered_motifs:
        print(f"  Position {motif['Start']}-{motif['End']}: {motif['Subclass']} (score: {motif['Score']:.3f})")
else:
    print("No motifs to filter")

---

## üìö Additional Resources

- **Documentation**: See README.md for detailed information
- **Web Interface**: Run `streamlit run app.py` for interactive analysis
- **GitHub**: https://github.com/VRYella/NonBScanner

---

## üìù Notes

- NonBScanner detects 11 major classes with 22+ subclasses
- Results include comprehensive metadata for each motif
- Export formats: CSV, Excel, JSON
- Supports single sequences and batch processing

---

**Author**: Dr. Venkata Rajesh Yella  
**License**: MIT  
**Version**: 2024.1