# Novel Scenification - Tag Analysis Notebook

This notebook allows you to run the tag analysis script and explore the resulting data.

## Project Overview

This project analyzes scene usage in English novels circa 1800 by processing custom-tagged HTML files to extract metrics on narrative techniques.

## 1. Run Tag Analysis

First, let's run the `count_tags.py` script which will:
- Process HTML files in `data/input/`
- Generate CSV files in `data/counts/`
- Create a summary Excel file at `data/tag_counts_summary.xlsx`
- Generate Markdown summaries in `data/SUMMARY.md` and `data/SAMPLES.md`

Run the cell below to execute the script:

## Install the library package dependencies

In [None]:
%pip install -r requirements.txt

## IMPORTANT: After installing make sure to RESTART the notebook kernel

## Run the external tag analysis script


In [None]:
!python count_tags.py

## 2. Import Libraries

Now let's import the necessary libraries for data analysis and visualization:

In [None]:
import os
import glob
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Markdown

# Configure visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = [12, 8]

## 3. Explore Input and Output Files

In [None]:
# List input HTML files
input_files = sorted(glob.glob('data/input/*.html'))
print(f"Found {len(input_files)} HTML files in data/input/")

# Display the first few files and their sizes
print("\nInput HTML files:")
for file in input_files[:5]:
    size_kb = os.path.getsize(file) / 1024
    print(f"- {os.path.basename(file)} ({size_kb:.1f} KB)")
    
if len(input_files) > 5:
    print(f"...and {len(input_files) - 5} more files")

In [None]:
# List output CSV files
output_files = sorted(glob.glob('data/counts/*.csv'))
print(f"Found {len(output_files)} CSV files in data/counts/")

# Display the first few files
print("\nOutput CSV files:")
for file in output_files[:5]:
    print(f"- {os.path.basename(file)}")
    
if len(output_files) > 5:
    print(f"...and {len(output_files) - 5} more files")

## 4. Analyze an Individual Text

Let's create a function to analyze tag patterns in a specific text:

In [None]:
def analyze_text(text_name=None):
    """Analyze tag distributions in a specific text
    
    Args:
        text_name: Part of the filename to match. If None, will analyze the first file.
    """
    # Get all CSV files
    csv_files = glob.glob('data/counts/*.csv')
    
    if not csv_files:
        print("No CSV files found. Run the tag analysis first.")
        return None
    
    # If no text specified, use the first one
    if text_name is None:
        csv_file = csv_files[0]
    else:
        # Find matching files
        matches = [f for f in csv_files if text_name in f]
        if not matches:
            print(f"No files found matching '{text_name}'")
            return None
        csv_file = matches[0]
    
    print(f"Analyzing: {os.path.basename(csv_file)}")
    
    # Load the CSV
    df = pd.read_csv(csv_file)
    
    # Get document totals
    totals = df[df['tag'] == 'totaldoctagswords'].iloc[0]
    total_tags = totals['tag_count']
    total_words = totals['word_count']
    
    print(f"Total tags: {total_tags}")
    print(f"Total words: {total_words}")
    
    # Remove totals row
    df = df[df['tag'] != 'totaldoctagswords']
    
    # Show top tags by frequency
    freq_df = df.sort_values('tag_count', ascending=False).head(15)
    print("\nTop 15 most frequent tags:")
    display(freq_df[['tag', 'tag_count', 'word_count']])
    
    # Visualize top tags by frequency
    plt.figure(figsize=(14, 6))
    sns.barplot(x='tag', y='tag_count', data=freq_df)
    plt.title(f'Top Tags by Frequency in {os.path.basename(csv_file)}')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Visualize top tags by word count
    word_df = df.sort_values('word_count', ascending=False).head(15)
    plt.figure(figsize=(14, 6))
    sns.barplot(x='tag', y='word_count', data=word_df)
    plt.title(f'Top Tags by Word Count in {os.path.basename(csv_file)}')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    return df

In [None]:
# Analyze the first text (if available)
output_files = glob.glob('data/counts/*.csv')
if output_files:
    text_df = analyze_text()
else:
    print("No output files found. Please run the tag analysis script first.")

## 5. Analyze Nested Tag Combinations

Let's examine compound tags (those with an underscore, indicating nesting) across the corpus:

In [None]:
def analyze_compound_tags():
    """Analyze nested tag combinations across all texts"""
    # Load all CSV files
    csv_files = glob.glob('data/counts/*.csv')
    if not csv_files:
        print("No CSV files found. Run the tag analysis first.")
        return None
    
    # Combine data from all files
    all_data = []
    for csv_file in csv_files:
        df = pd.read_csv(csv_file)
        text_name = os.path.basename(csv_file).replace('.csv', '')
        df['text'] = text_name
        all_data.append(df)
    
    combined_df = pd.concat(all_data, ignore_index=True)
    
    # Filter for compound tags (containing underscore)
    compound_tags = combined_df[combined_df['tag'].str.contains('_')]
    
    # Sum counts across all texts
    tag_totals = compound_tags.groupby('tag').agg({
        'tag_count': 'sum',
        'word_count': 'sum'
    }).reset_index()
    
    # Get top compounds by frequency
    top_by_freq = tag_totals.sort_values('tag_count', ascending=False).head(15)
    print("Top nested tag combinations by frequency:")
    display(top_by_freq)
    
    # Visualize top compounds
    plt.figure(figsize=(14, 6))
    sns.barplot(x='tag', y='tag_count', data=top_by_freq)
    plt.title('Top Nested Tag Combinations by Frequency')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Get top compounds by word count
    top_by_words = tag_totals.sort_values('word_count', ascending=False).head(15)
    print("\nTop nested tag combinations by word count:")
    display(top_by_words)
    
    # Visualize top compounds by word count
    plt.figure(figsize=(14, 6))
    sns.barplot(x='tag', y='word_count', data=top_by_words)
    plt.title('Top Nested Tag Combinations by Word Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    return compound_tags

# Run the analysis
compound_data = analyze_compound_tags()