# Novel Scenification Project - Tag Analysis Runner

This notebook provides an interactive interface for running the tag analysis pipeline, exploring results, and managing the workflow. 

## Project Overview

This repository supports research on the evolving use of scenes in the English novel circa 1800. The project integrates computational tools with custom scene annotations to understand the evolution of narrative techniques.

### Key Components:
- **Input HTML Files**: Located in `data/input/`, these contain literary texts with custom HTML tags for scene types
- **Tag Analysis Script**: `count_tags.py` parses HTML files and counts tag occurrences and word contents
- **Summary Excel File**: `data/tag_counts_summary.xlsx` contains aggregated analysis from all HTML files
- **GitHub Actions**: Automatically processes new files uploaded to `data/input/`

## 1. Setup Environment

First, let's make sure we have all the necessary packages installed and import the required libraries:

In [None]:
# Import required libraries
import os
import pandas as pd
import glob
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import subprocess
from IPython.display import display, HTML, FileLink, Markdown

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = [12, 8]

# Check if we're in GitPod
in_gitpod = os.environ.get('GITPOD_WORKSPACE_ID') is not None
print(f"Running in GitPod: {in_gitpod}")

## 2. Checking Input Files

Let's examine the HTML files in the `data/input` directory:

In [None]:
# List HTML files in the input directory
input_files = sorted(glob.glob('data/input/*.html'))
file_info = []

for file_path in input_files:
    file_name = os.path.basename(file_path)
    file_size = os.path.getsize(file_path) / 1024  # Size in KB
    
    # Count lines in the file
    with open(file_path, 'r', encoding='utf-8') as f:
        line_count = sum(1 for _ in f)
    
    file_info.append({
        'File': file_name,
        'Size (KB)': f"{file_size:.2f}",
        'Lines': line_count,
    })

# Display as DataFrame
input_df = pd.DataFrame(file_info)
display(input_df)

print(f"Found {len(input_files)} HTML files in data/input/")

## 3. Run Tag Analysis

Execute the `count_tags.py` script to process all HTML files in the `data/input` directory:

In [None]:
def run_tag_analysis():
    """Run the tag analysis script and capture its output"""
    print("Running tag analysis...")
    result = subprocess.run(['python', 'count_tags.py'], 
                           capture_output=True, text=True)
    
    # Display the output
    print("\nOutput from count_tags.py:\n")
    print(result.stdout)
    
    if result.stderr:
        print("\nErrors/Warnings:\n")
        print(result.stderr)
    
    return result.returncode == 0  # True if success

# Run the analysis
success = run_tag_analysis()

## 4. Explore Results

After running the tag analysis, we can explore the results in the Excel summary file:

In [None]:
# Check if the Excel summary file exists
excel_path = 'data/tag_counts_summary.xlsx'
if os.path.exists(excel_path):
    # Show Excel file link for download
    display(HTML(f'<a href="{excel_path}" target="_blank">Download Excel Summary File</a>'))
    
    # Load and display summary sheet
    summary_df = pd.read_excel(excel_path, sheet_name='Summary')
    display(summary_df)
    
    # Display available sheets
    sheets = pd.ExcelFile(excel_path).sheet_names
    print(f"\nAvailable sheets in the Excel file: {sheets}")
else:
    print(f"Excel summary file not found at: {excel_path}")

## 5. Data Visualization

Let's create some visualizations from the summary data:

In [None]:
def plot_summary_data():
    """Create visualizations from the summary data"""
    excel_path = 'data/tag_counts_summary.xlsx'
    if not os.path.exists(excel_path):
        print(f"Excel summary file not found at: {excel_path}")
        return
    
    # Load summary data
    summary_df = pd.read_excel(excel_path, sheet_name='Summary')
    
    # Extract sheet names from the HYPERLINK formula
    def extract_sheet_name(link_text):
        if isinstance(link_text, str) and '"' in link_text:
            return link_text.split('"')[3]  # Extract the name from the formula
        return link_text
    
    summary_df['TextName'] = summary_df['Sheet'].apply(extract_sheet_name)
    
    # Plot total words by text
    plt.figure(figsize=(12, 8))
    ax = sns.barplot(x='TextName', y='Total_Words', data=summary_df)
    plt.title('Total Word Count by Text')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Plot scene type distribution
    plt.figure(figsize=(12, 8))
    plot_data = summary_df.melt(id_vars=['TextName'], 
                             value_vars=['SceneAction_Count', 'SceneDia_Count', 'Dialogue_Count'],
                             var_name='Scene Type', value_name='Count')
    sns.barplot(x='TextName', y='Count', hue='Scene Type', data=plot_data)
    plt.title('Scene Type Distribution by Text')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Calculate percentages of scene types
    summary_df['SceneAction_Pct'] = summary_df['SceneAction_Words'] / summary_df['Total_Words'] * 100
    summary_df['SceneDia_Pct'] = summary_df['SceneDia_Words'] / summary_df['Total_Words'] * 100
    summary_df['Dialogue_Pct'] = summary_df['Dialogue_Words'] / summary_df['Total_Words'] * 100
    
    # Plot percentage distribution
    plt.figure(figsize=(12, 8))
    plot_data = summary_df.melt(id_vars=['TextName'], 
                             value_vars=['SceneAction_Pct', 'SceneDia_Pct', 'Dialogue_Pct'],
                             var_name='Scene Type', value_name='Percentage')
    sns.barplot(x='TextName', y='Percentage', hue='Scene Type', data=plot_data)
    plt.title('Scene Type Percentage Distribution by Text')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Percentage of Total Words')
    plt.tight_layout()
    plt.show()

# Create visualizations
plot_summary_data()

## 6. Tag Frequency Analysis

Let's examine the frequency data from the "Summary Freq Tags" sheet:

In [None]:
def explore_tag_frequency():
    """Explore tag frequency data from the summary file"""
    excel_path = 'data/tag_counts_summary.xlsx'
    if not os.path.exists(excel_path):
        print(f"Excel summary file not found at: {excel_path}")
        return
    
    # Load frequency data
    try:
        freq_df = pd.read_excel(excel_path, sheet_name='Summary Freq Tags')
        
        # Get all column names
        all_columns = freq_df.columns.tolist()
        
        # Filter for tag count columns (ending with _Count)
        tag_columns = [col for col in all_columns if col.endswith('_Count') and col not in 
                       ['Total_Tags', 'Chapter_Count']]
        
        # Extract tag names from column names
        tag_names = [col.replace('_Count', '') for col in tag_columns]
        
        print(f"Found {len(tag_names)} tags in frequency data. Top 10 most frequent:")
        for i, tag in enumerate(tag_names[:10]):
            print(f"{i+1}. {tag}")
            
        # Create a visualization of the top 10 tags by frequency
        if tag_columns:
            # Sum the counts across all texts for each tag
            tag_totals = {tag: freq_df[f"{tag}_Count"].sum() for tag in tag_names}
            
            # Create DataFrame for plotting
            plot_df = pd.DataFrame(list(tag_totals.items()), columns=['Tag', 'Count'])
            plot_df = plot_df.sort_values('Count', ascending=False).head(10)
            
            plt.figure(figsize=(12, 8))
            sns.barplot(x='Tag', y='Count', data=plot_df)
            plt.title('Top 10 Most Frequent Tags')
            plt.xticks(rotation=45, ha='right')
            plt.tight_layout()
            plt.show()
    except Exception as e:
        print(f"Error exploring tag frequency: {e}")

# Explore tag frequency
explore_tag_frequency()

## 7. View Markdown Summary

Let's display the markdown summary that was generated by the script:

In [None]:
def display_markdown_summary():
    """Display the markdown summary file"""
    summary_path = 'data/SUMMARY.md'
    if os.path.exists(summary_path):
        with open(summary_path, 'r', encoding='utf-8') as f:
            content = f.read()
        display(Markdown(content))
    else:
        print(f"Markdown summary file not found at: {summary_path}")

# Display markdown summary
display_markdown_summary()

## 8. Commit and Push Changes

If you've made changes to the data or generated new analysis, you can commit and push the changes back to GitHub:

In [None]:
def git_commit_push():
    """Commit and push changes to GitHub"""
    # Check if there are changes to commit
    status = subprocess.run(['git', 'status', '--porcelain'], 
                           capture_output=True, text=True)
    
    if not status.stdout.strip():
        print("No changes to commit.")
        return
    
    print("Changes to be committed:")
    print(status.stdout)
    
    # Get input for commit message
    from IPython.display import display, HTML
    from ipywidgets import widgets
    
    message_input = widgets.Textarea(
        value='Update tag analysis results',
        placeholder='Enter commit message',
        description='Commit Message:',
        disabled=False,
        layout=widgets.Layout(width='80%', height='100px')
    )
    
    # Create commit button
    commit_button = widgets.Button(
        description='Commit & Push',
        button_style='success',
        tooltip='Commit and push changes to GitHub'
    )
    
    output = widgets.Output()
    
    def on_button_click(b):
        with output:
            commit_message = message_input.value
            
            # Add all changes
            print("Adding changes...")
            subprocess.run(['git', 'add', '.'])
            
            # Commit with the provided message
            print("Committing changes...")
            commit_result = subprocess.run(
                ['git', 'commit', '-m', commit_message],
                capture_output=True, text=True
            )
            print(commit_result.stdout)
            if commit_result.stderr:
                print("Errors:")
                print(commit_result.stderr)
            
            # Push to GitHub
            print("Pushing to GitHub...")
            push_result = subprocess.run(
                ['git', 'push'],
                capture_output=True, text=True
            )
            print(push_result.stdout)
            if push_result.stderr:
                print("Errors:")
                print(push_result.stderr)
    
    commit_button.on_click(on_button_click)
    
    # Display the widgets
    display(message_input)
    display(commit_button)
    display(output)

# Show commit interface
git_commit_push()

## 9. Tag Matching Tool

This project uses tag patterns in the `keep_for_summary_tags.tsv` file to match with column headers. You can view and generate tag matching results using this tool:

In [None]:
def check_tag_matches():
    """Check if tag matching files exist and display their contents"""
    match_files = [
        'tag_matches_summary.tsv',
        'tag_matches_detailed.tsv',
        'keep_for_summary_tags.tsv'
    ]
    
    for file_name in match_files:
        if os.path.exists(file_name):
            print(f"\n{file_name} exists. Contents:")
            df = pd.read_csv(file_name, sep='\t')
            display(df.head())
            print(f"Total rows: {len(df)}")
        else:
            print(f"\n{file_name} does not exist.")

# Check tag matches
check_tag_matches()

## 10. Generate New Tag Matches

If you need to create or update tag matches based on patterns in `keep_for_summary_tags.tsv`, you can run the following code:

In [None]:
def generate_tag_matches():
    """Generate tag matches based on patterns in keep_for_summary_tags.tsv"""
    if not os.path.exists('keep_for_summary_tags.tsv'):
        print("Error: keep_for_summary_tags.tsv does not exist.")
        return
    
    # Read the pattern file
    patterns_df = pd.read_csv('keep_for_summary_tags.tsv', sep='\t')
    if 'Pattern' not in patterns_df.columns:
        print("Error: keep_for_summary_tags.tsv must have a 'Pattern' column.")
        return
    
    patterns = patterns_df['Pattern'].tolist()
    print(f"Found {len(patterns)} patterns in keep_for_summary_tags.tsv")
    
    # Get column headers from the Excel file
    excel_path = 'data/tag_counts_summary.xlsx'
    if not os.path.exists(excel_path):
        print(f"Error: Excel summary file not found at: {excel_path}")
        return
    
    # Load and get all column names from the Summary All Tags sheet
    all_tags_df = pd.read_excel(excel_path, sheet_name='Summary All Tags')
    all_columns = all_tags_df.columns.tolist()
    
    # Filter for tag columns (ending with _Count or _Words)
    tag_columns = [col for col in all_columns if col.endswith('_Count') or col.endswith('_Words')]
    tag_names = set()
    for col in tag_columns:
        if col.endswith('_Count'):
            tag_names.add(col.replace('_Count', ''))
        elif col.endswith('_Words'):
            tag_names.add(col.replace('_Words', ''))
    
    tag_names = sorted(list(tag_names))
    print(f"Found {len(tag_names)} unique tag names in Excel file")
    
    # Match patterns with column headers
    import fnmatch
    
    # Create summary results
    summary_results = []
    detailed_results = []
    
    for pattern in patterns:
        matching_columns = [col for col in tag_names if fnmatch.fnmatch(col.lower(), pattern.lower())]
        summary_results.append({
            'Pattern': pattern,
            'Match_Count': len(matching_columns),
            'Matches': ', '.join(matching_columns[:5]) + ('...' if len(matching_columns) > 5 else '')
        })
        
        # Add detailed results
        for col in matching_columns:
            detailed_results.append({
                'Pattern': pattern,
                'Matched_Column': col
            })
    
    # Create and save DataFrames
    summary_df = pd.DataFrame(summary_results)
    summary_df.to_csv('tag_matches_summary.tsv', sep='\t', index=False)
    print(f"Created tag_matches_summary.tsv with {len(summary_results)} rows")
    
    detailed_df = pd.DataFrame(detailed_results)
    detailed_df.to_csv('tag_matches_detailed.tsv', sep='\t', index=False)
    print(f"Created tag_matches_detailed.tsv with {len(detailed_results)} rows")
    
    # Display summary results
    display(summary_df)

# Function to create a pattern file if it doesn't exist
def create_pattern_file():
    """Create a pattern file if it doesn't exist"""
    if not os.path.exists('keep_for_summary_tags.tsv'):
        # Create a basic pattern file with some common patterns
        patterns = [
            'sceneaction',
            'scenedia',
            'dia*',
            'chapmarker',
            'description',
            'scene*',
            'ch*'
        ]
        
        patterns_df = pd.DataFrame({'Pattern': patterns})
        patterns_df.to_csv('keep_for_summary_tags.tsv', sep='\t', index=False)
        print("Created keep_for_summary_tags.tsv with default patterns")
        return True
    return False

# Create pattern file if needed
created = create_pattern_file()

# Generate tag matches
generate_tag_matches()

## 11. View and Edit Pattern File

You can view and edit the pattern file directly in this notebook:

In [None]:
def edit_pattern_file():
    """View and edit the pattern file"""
    # Create file if it doesn't exist
    if not os.path.exists('keep_for_summary_tags.tsv'):
        create_pattern_file()
    
    # Read the current patterns
    patterns_df = pd.read_csv('keep_for_summary_tags.tsv', sep='\t')
    
    # Show current patterns
    print("Current patterns:")
    display(patterns_df)
    
    # Create widgets for editing
    from ipywidgets import widgets
    
    # Convert DataFrame to text
    patterns_text = "Pattern\n" + "\n".join(patterns_df['Pattern'].tolist())
    
    # Create text area for editing
    text_area = widgets.Textarea(
        value=patterns_text,
        description='Patterns:',
        disabled=False,
        layout=widgets.Layout(width='80%', height='200px')
    )
    
    # Create save button
    save_button = widgets.Button(
        description='Save Patterns',
        button_style='success',
        tooltip='Save patterns to file'
    )
    
    output = widgets.Output()
    
    def on_save_click(b):
        with output:
            try:
                # Parse the text area content
                lines = text_area.value.strip().split('\n')
                
                # Skip header if present
                if lines[0].lower() == 'pattern':
                    lines = lines[1:]
                
                # Create new DataFrame
                new_patterns = [line for line in lines if line.strip()]
                new_df = pd.DataFrame({'Pattern': new_patterns})
                
                # Save to file
                new_df.to_csv('keep_for_summary_tags.tsv', sep='\t', index=False)
                print(f"Saved {len(new_patterns)} patterns to keep_for_summary_tags.tsv")
                
                # Regenerate tag matches
                print("\nRegenerating tag matches...")
                generate_tag_matches()
            except Exception as e:
                print(f"Error saving patterns: {e}")
    
    save_button.on_click(on_save_click)
    
    # Display widgets
    display(text_area)
    display(save_button)
    display(output)

# Show pattern editor
edit_pattern_file()

## 12. Open Excel Summary in Browser

You can open the Excel summary file in the browser using various techniques:

In [None]:
def open_excel_summary():
    """Display the Excel summary in the browser"""
    excel_path = 'data/tag_counts_summary.xlsx'
    if not os.path.exists(excel_path):
        print(f"Excel summary file not found at: {excel_path}")
        return
    
    # For GitPod, create a link to the file
    if in_gitpod:
        # Get the GitPod workspace URL
        workspace_url = os.environ.get('GITPOD_WORKSPACE_URL')
        if workspace_url:
            file_url = f"{workspace_url.replace('https://', 'https://8888-')}/files/{excel_path}"
            display(HTML(f'<a href="{file_url}" target="_blank">Open Excel Summary in New Tab</a>'))
    
    # Also provide direct download link
    display(FileLink(excel_path))

# Open Excel summary
open_excel_summary()

## Project Documentation

### Summary Tabs in Excel File

The `tag_counts_summary.xlsx` file contains several summary tabs:

1. **Summary**: Basic overview showing key metrics for each document
2. **Summary Freq Words**: Tags sorted by total word count (most text content first)
3. **Summary Freq Tags**: Tags sorted by frequency/count (most common tags first)
4. **Summary All Tags**: All tags listed alphabetically
5. **Summary Included Tags**: Tags matched by patterns in `keep_for_summary_tags.tsv`
6. **Summary Excluded Tags**: Tags not matched by patterns

After these tabs, individual document tabs show detailed tag counts for each text.

### Useful Command Line Operations

```bash
# Run tag analysis from command line
python count_tags.py

# Check git status
git status

# Commit and push changes
git add .
git commit -m "Update tag analysis results"
git push
```