# **ETL for Unified Topic Modeling Analysis (UTMA) - CDC MMWR Journals and Beyond**

### **Author**: [Your Name]  
### **Date**: [Date of Notebook Creation]

---

## **Notebook Overview**

This notebook is part of the **Unified Topic Modeling Analysis (UTMA)** project, focusing on the **Extract, Transform, Load (ETL)** process for preparing textual data from various sources—including medical journals, novels, and general public health documents—for topic modeling and diachronic analysis.

### **1. Objectives of the ETL Process**
The primary purpose of this ETL notebook is to preprocess textual data from multiple time periods and sources in a manner that ensures high variability in the resulting corpora, which is crucial for effective topic modeling and subsequent diachronic analysis. Specifically, this notebook will:
- **Combine Data from Multiple Time Periods**: Expand the temporal scope by merging different corpora (e.g., CDC MMWR publications from 2010-2014, 2014-2019, and 2020-2024) into a unified text to increase diversity.
- **Balanced Sampling and Temporal Tagging**: Ensure that each document is tagged with its respective time period, enabling balanced representation across temporal spans during training.
- **Diversity-Oriented Batch Preparation**: Implement strategies to diversify the data, such as weighted sampling and clustering, to mitigate the limited variance that impacts model quality.

### **2. Data Sources**
- **CDC MMWR Journals** (2010-2024): Split into three segments (2010-2014, 2014-2019, 2020-2024) to allow for **temporal analysis** and facilitate the understanding of changes in public health discourse.
- **Additional Texts (Future Expansion)**: General-purpose pipeline to support documents such as novels, newspaper articles, and other public health records.

### **3. Key Preprocessing Steps**
This notebook will perform the following steps to ensure that the final output is prepared for topic modeling in a robust and variance-rich manner:
- **Text Extraction**:
  - Extract raw paragraphs from input documents and corpora, ensuring all parts of the data are captured.
  
- **Content Transformation**:
  - **Tokenization**: Utilize spaCy’s NLP pipeline to tokenize the paragraphs.
  - **Stop Word Handling**: Evaluate and experiment with retaining or removing stop words, considering their role in providing context and facilitating diachronic analysis.
  - **POS Filtering and Character Length Adjustment**: Initially process only content-rich words (nouns, adjectives, verbs, adverbs) with 5+ characters, but allow flexibility to include other parts of speech or adjust the threshold based on analysis needs.
  - **Temporal Tagging**: Tag paragraphs with their respective time period to retain temporal information for balanced sampling and diachronic analysis.

### **4. Strategies for Variability Enhancement**
Given the challenges of **limited variance** within homogenous corpora (e.g., CDC MMWR journals), this notebook employs multiple strategies to ensure variability:
- **Balanced Sampling Across Time Periods**: Incorporate a balanced mix of paragraphs from all three time segments in each training batch.
- **Diversity-Based Batch Formation and Weighted Sampling**: Use techniques like **TF-IDF-based clustering** and **weighted sampling** to diversify the batches, ensuring maximum differentiation between topics.
  
### **5. Data Outputs**
- **Unified Corpus**: A unified text, comprising paragraphs from all three temporal segments, tagged accordingly.
- **Segmented Corpora**: Individual segments for each time period, maintaining temporal integrity for diachronic analysis.
- **Batch Preparation for Training**: Diverse and balanced batches of paragraphs ready for input into the **Gensim LDA model**, ensuring that variability is maximized to improve coherence, perplexity, and topic differentiation.

### **6. Challenges Addressed in ETL**
- **Limited Corpus Variance**: By expanding the data scope and employing balanced sampling and batch diversification, the goal is to address and mitigate the uniformity issues that limit the richness of LDA topics.
- **Preparation for Future Analysis**: This notebook not only sets the foundation for robust LDA modeling but also prepares the data for **dynamic topic modeling (DTM)** and **hierarchical LDA (hLDA)**, to be implemented in subsequent stages for a richer understanding of topic evolution over time.

### **7. Future Considerations**
- **Data Augmentation**: Consider data augmentation techniques such as synonym replacement or back-translation to further enhance variability in future iterations.
- **Advanced Modeling Approaches**: This ETL process will feed into subsequent modeling efforts, including experimenting with **Non-Negative Matrix Factorization (NMF)** and **hierarchical topic modeling**, as well as exploring **adaptive thresholding** for coherence metrics.

---

### **Notebook Flow**
1. **Setup and Initialization**: Import necessary libraries, set up paths, and initialize logging.
2. **Data Extraction**: Load and preview the data from the three corpora (2010-2014, 2014-2019, 2020-2024).
3. **Text Preprocessing**: Tokenize, filter, and tag data for each paragraph.
4. **Batch Formation**: Implement batch formation strategies (balanced sampling, diversity-based clustering).
5. **Output Generation**: Save the processed data for further use in topic modeling.

**Note**: This notebook is designed to be modular and adaptive, supporting different types of textual data as we continue to refine the topic modeling pipeline.

--- 

**Let’s get started by setting up the environment and extracting the data.** 


In [6]:
# Standard Library Imports
import os            # Provides functions for interacting with the operating system, e.g., file handling.
import re            # Supports regular expressions for text manipulation and pattern matching.
import csv           # Facilitates reading from and writing to CSV files.
import json          # Enables reading from and writing to JSON files, often used for structured data.
from time import time  # Allows timing operations for performance measurement.
import logging       # Provides logging functionalities to monitor code execution.
import codecs        # Handles different text encodings, important for text data processing.
import multiprocessing  # Supports parallel processing, enhancing performance on large datasets.
import pprint as pp  # Pretty-prints data structures, useful for debugging and visualization.
from hashlib import md5  # Provides MD5 hashing, useful for generating unique IDs or checking data integrity.

# Encoding and Parsing Imports
import chardet        # Detects character encoding of text files, allowing for accurate reading of various encodings.
import unicodedata    # Handles Unicode character information, useful for identifying and removing non-printable characters.
from bs4 import BeautifulSoup  # Parses HTML content, enabling extraction of specific tags (e.g., <p> tags) for processing.


# NLTK Imports
import nltk                             # Natural Language Toolkit, a suite for text processing.
from nltk.corpus import stopwords       # Provides lists of stop words to remove from text.
from nltk.corpus.reader.api import CorpusReader  # Base class for reading and structuring corpora.
from nltk import sent_tokenize, pos_tag, wordpunct_tokenize  # Tokenizers and POS tagger for sentence processing.
stop_words = stopwords.words('english')  # Initializing English stop words list for filtering out common words.

# Gensim Imports
import gensim                           # Library for topic modeling and word vector creation.
from gensim.models import Word2Vec, ldamulticore  # Word2Vec for word embeddings, ldamulticore for topic modeling.
from gensim.models.phrases import Phrases, Phraser  # Constructs multi-word phrases (e.g., bigrams) from tokenized text.
import gensim.corpora as corpora        # Handles creation of dictionaries and corpora from text data.
from gensim.utils import simple_preprocess  # Preprocesses text into a list of tokens.

# SpaCy Import (specific model)
import en_core_web_lg                   # SpaCy's large English NLP model for advanced text processing.
nlp = en_core_web_lg.load(disable=['parser','ner'])  # Loads model, with parsing and named entity recognition disabled.

# Readability Import
from readability.readability import Unparseable  # Exception handling for parsing errors in HTML.
from readability.readability import Document as Paper  # Extracts readable content from HTML, discarding noise.

# BeautifulSoup and HTML5lib
import bs4                                # BeautifulSoup for parsing HTML and XML documents.
import html5lib                           # Parser for HTML5, used by BeautifulSoup for web scraping.

# Data Processing and Scientific Libraries
import numpy as np                        # Supports efficient numerical operations on large arrays and matrices.
import pandas as pd                       # Data analysis library for handling structured data (e.g., DataFrames).
from sklearn.manifold import TSNE         # Dimensionality reduction for visualizing high-dimensional data.
from matplotlib import pyplot as plt      # Plotting library for creating visualizations.
from tqdm import tqdm                     # Adds progress bars to loops, useful for monitoring lengthy operations.

import unicodedata
import re
from collections import defaultdict

import statistics

In [25]:
DOC_ID = r'.*[\d\w\-.]+\.(html|json)$'  # Regular expression pattern to identify document filenames ending in .html or .json.

TAGS = ['p']  # List of HTML tags to extract content from; 'p' is commonly used to denote paragraphs in HTML.


In [54]:

class DocumentParser(CorpusReader):
       
    def __init__(self, root, tags=TAGS, fileids=DOC_ID, **kwargs):
        CorpusReader.__init__(self, root, fileids)
        self.tags = tags
        # Statistics tracking
        self.error_character_stats = defaultdict(int)
        self.parsing_errors_stats = defaultdict(int)
        self.token_frequency_stats = defaultdict(int)
        self.foreign_sentence_count = 0
        self.mixed_language_sentence_count = 0
        self.paragraph_language_counts = defaultdict(int)
        self.sentence_length_stats = []
        self.paragraph_length_stats = []
        self.unique_tokens = set()
        self.special_character_usage = defaultdict(int)

    def resolve(self, fileids=None):
        return fileids 

    def detect_encoding(self, file_path):
        with open(file_path, 'rb') as f:
            raw_data = f.read()
        return chardet.detect(raw_data)['encoding']
    
    def process_content(self, content, non_html_log_file):
        content = unicodedata.normalize('NFKC', content.replace("\n", "\n").strip())

        try:
            soup = BeautifulSoup(content, 'html.parser')
            paragraphs = soup.find_all('p')

            for p in paragraphs:
                text = p.get_text()
                tokenized_paragraph = re.findall(r'\b\w+\b', text)

                # Track token frequencies and unique tokens
                for token in tokenized_paragraph:
                    self.token_frequency_stats[token] += 1
                    self.unique_tokens.add(token)

                # Track control characters in individual tokens
                for token in tokenized_paragraph:
                    for char in token:
                        if (ord(char) < 32 and char not in '\n\t') or unicodedata.category(char) in ['Cc', 'Cf']:
                            self.error_character_stats[(f"U+{ord(char):04X}", unicodedata.name(char, "UNKNOWN"))] += 1

                # Count foreign and mixed-language sentences
                contains_foreign = any(any(ord(char) > 127 for char in token) for token in tokenized_paragraph)
                contains_latin = any(any('LATIN' in unicodedata.name(char, '') for char in token if ord(char) > 127) for token in tokenized_paragraph)

                if contains_foreign:
                    self.foreign_sentence_count += 1

                if contains_foreign and contains_latin:
                    self.mixed_language_sentence_count += 1

                # Determine predominant language in paragraphs
                non_latin_count = sum(1 for token in tokenized_paragraph for char in token if ord(char) > 127)
                latin_count = sum(1 for token in tokenized_paragraph for char in token if 'LATIN' in unicodedata.name(char, '') and ord(char) <= 127)

                if non_latin_count > latin_count:
                    self.paragraph_language_counts['foreign'] += 1
                else:
                    self.paragraph_language_counts['latin'] += 1

                # Track paragraph length
                self.paragraph_length_stats.append(len(tokenized_paragraph))

                # Track special character usage
                for char in text:
                    if not char.isalnum() and char not in (' ', '\n', '\t'):
                        self.special_character_usage[char] += 1

                # Remove all control characters except newline (U+000A) and tab (U+0009)
                cleaned_text = re.sub(r'[^\x09\x0A\x20-\x7E\x80-\uFFFF]', '', text)
                
                # Identify and log any remaining control characters (excluding newline and tab)
                control_chars = [
                    (char, f"U+{ord(char):04X}", unicodedata.name(char, "UNKNOWN"))
                    for char in text if ord(char) < 32 and char not in '\n\t'
                ]

                if control_chars:
                    char_details = "; ".join([f"{c} ({code}: {name})" for c, code, name in control_chars])
                    non_html_log_file.write(f"Control characters found in paragraph: {char_details}\n")
                    
                # Track sentence length statistics
                sentences = nltk.sent_tokenize(text)
                for sentence in sentences:
                    self.sentence_length_stats.append(len(re.findall(r'\b\w+\b', sentence)))
                
                yield cleaned_text
        except Exception as e:
            self.parsing_errors_stats[str(e)] += 1
            non_html_log_file.write(f"Error processing as HTML: {str(e)}\n")
            yield None


    def docs(self, fileids=None, pattern=None, non_html_log_file=None):
        fileids = self.resolve(fileids)
        if pattern is not None:
            regex = re.compile(pattern, re.IGNORECASE)
        else:
            regex = re.compile(r'.*([\d\w\-.]+)\.(html|json)$', re.IGNORECASE)
            
        for path, encoding in self.abspaths(fileids, include_encoding=True):
            if regex.search(path):
                encoding = self.detect_encoding(path)
                
                if path.lower().endswith('.json'):
                    with codecs.open(path, 'r', encoding=encoding) as f:
                        try:
                            data = json.load(f)
                            if isinstance(data, list):
                                for item in data:
                                    if isinstance(item, str):
                                        for content in self.process_content(item, non_html_log_file):
                                            if content:
                                                yield self.replace_curly_quotes(content)
                            else:
                                print(f"Error: {path} does not contain a list of HTML strings.")
                        except json.JSONDecodeError:
                            print(f"Error: {path} is not a valid JSON file.")
                elif path.lower().endswith('.html'):
                    with open(path, 'r', encoding='utf-8', errors='replace') as f:
                        doc_content = f.read()
                        for content in self.process_content(doc_content, non_html_log_file):
                            if content:
                                yield self.replace_curly_quotes(content)
                else:
                    print(f"Unsupported file type: {path}. Only JSON and HTML files are allowed.")
                    continue


    def html(self, fileids=None):
        """
        Iterates over HTML documents, yielding content for each document.

        Parameters:
            fileids (str or None): Specific file identifier(s) or None.

        Yields:
            str: Parsed content from each HTML document.
        """
        for doc in self.docs(fileids):
            try:
                yield doc
            except Exception as e:
                print("Could not parse HTML: {}".format(e))
                continue


    def replace_curly_quotes(self, text):
        """
        Replaces curly quotes with straight quotes in the provided text.

        Parameters:
            text (str): The text to process.

        Returns:
            str: The text with curly quotes replaced by straight quotes.
        """
        quote_replacements = {
            u"\u2018": "'",  # Left single quotation mark
            u"\u2019": "'",  # Right single quotation mark
            u"\u201C": '"',  # Left double quotation mark
            u"\u201D": '"',  # Right double quotation mark
        }
        
        for curly_quote, straight_quote in quote_replacements.items():
            text = text.replace(curly_quote, straight_quote)
        
        return text

    def remove_non_printable_chars(self, text):
        """
        Removes non-printable characters from the provided text.

        Parameters:
            text (str): The text to process.

        Returns:
            str: The text with non-printable characters removed.
        """
        non_printable_pattern = re.compile(r'[\x00-\x1F\x7F-\x9F]+')
        cleaned_text = re.sub(non_printable_pattern, '', text)
        return cleaned_text.strip()


    def get_invalid_character_names(self, text):
        """
        Retrieves the names of non-printable characters in the text.

        Parameters:
            text (str): The text to analyze.

        Returns:
            set: A set of names of non-printable characters found in the text.
        """
        char_names = set()
        non_printable_pattern = re.compile(r'[\x00-\x1F\x7F-\x9F]')
        invalid_chars = non_printable_pattern.findall(text)
        for char in invalid_chars:
            try:
                name = unicodedata.name(char)
            except ValueError:
                name = "UNKNOWN CONTROL CHARACTER"
            char_names.add(name)
        return char_names

    def paras(self, parser_type='lxml', fileids=None):
            """
            Extracts paragraphs from HTML content based on specified tags.

            Parameters:
                parser_type (str): Parser type for BeautifulSoup (default is 'lxml').
                fileids (str or None): Specific file identifier(s) or None.

            Yields:
                str: Extracted paragraph text.
            """
            for html in self.html(fileids):
                # Check if html content looks like an HTML string
                if not isinstance(html, str) or "<" not in html:
                    print(f"Skipping non-HTML content: {html}")
                    continue
                
                soup = BeautifulSoup(html, parser_type)
                
                # Join tags into a CSS selector if `self.tags` is a list
                tag_selector = ",".join(self.tags) if isinstance(self.tags, list) else self.tags
                
                for element in soup.select(tag_selector):
                    text = element.text.strip()
                    yield text

    def sents(self, fileids=None):
            """
            Splits paragraphs into sentences.

            Parameters:
                fileids (str or None): Specific file identifier(s) or None.

            Yields:
                str: Extracted sentence text.
            """
            for paragraph in self.paras(fileids=fileids):
                for sentence in nltk.sent_tokenize(paragraph): 
                    yield sentence

    def words(self, fileids=None): 
            """
            Splits sentences into individual words.

            Parameters:
                fileids (str or None): Specific file identifier(s) or None.

            Yields:
                str: Extracted word token.
            """
            for sentence in self.sents(fileids=fileids):
                for token in nltk.wordpunct_tokenize(sentence):
                    yield token
                    
    def validate_paragraph(self, paragraph):
        """
        Validates a paragraph, checking for non-printable characters and whitespace-only content.

        Parameters:
            paragraph (str): The paragraph to validate.

        Returns:
            bool or str: True if valid, otherwise a string explaining the issues.
        """
        reasons = []
        if not paragraph.strip():
            reasons.append("Only whitespace")

        invalid_char_names = self.get_invalid_character_names(paragraph)
        
        if invalid_char_names:
            reason = f"Contains non-printable characters: {', '.join(invalid_char_names)}"
            reasons.append(reason)

        return True if not reasons else ', '.join(reasons)


    def get_token_frequency(self):
        """Returns token frequency statistics."""
        return {token: count for token, count in self.token_frequency_stats.items()}

    def get_error_statistics(self):
        """Return collected error statistics from the wrapped processing."""
        character_stats = {f"{code} ({name})": count for (code, name), count in self.error_character_stats.items()}
        other_error_stats = dict(self.parsing_errors_stats)

        return {
            "character_errors": character_stats,
            "parsing_errors": other_error_stats,
            "foreign_sentence_count": self.foreign_sentence_count,
            "mixed_language_sentence_count": self.mixed_language_sentence_count,
            "paragraph_language_counts": dict(self.paragraph_language_counts)
        }

    def process_all_and_collect_stats(self, content, non_html_log_file):
        """Process entire content and collect stats without changing the original `CorpusReader` methods."""
        list(self.process_content(self, content, non_html_log_file))  # Process content using the wrapper
        # Return error statistics after processing the entire content
        return self.get_error_statistics()
    
    def get_error_statistics(self):
        character_stats = {f"{code} ({name})": count for (code, name), count in self.error_character_stats.items()}
        other_error_stats = dict(self.parsing_errors_stats)
        sentence_length_mean = sum(self.sentence_length_stats) / len(self.sentence_length_stats) if self.sentence_length_stats else 0
        paragraph_length_mean = sum(self.paragraph_length_stats) / len(self.paragraph_length_stats) if self.paragraph_length_stats else 0
        sentence_length_median = statistics.median(self.sentence_length_stats) if self.sentence_length_stats else 0
        paragraph_length_median = statistics.median(self.paragraph_length_stats) if self.paragraph_length_stats else 0
        sentence_length_mode = statistics.mode(self.sentence_length_stats) if self.sentence_length_stats else 0
        paragraph_length_mode = statistics.mode(self.paragraph_length_stats) if self.paragraph_length_stats else 0
        sentence_length_stdev = statistics.stdev(self.sentence_length_stats) if len(self.sentence_length_stats) > 1 else 0
        paragraph_length_stdev = statistics.stdev(self.paragraph_length_stats) if len(self.paragraph_length_stats) > 1 else 0

        return {
            "character_errors": character_stats,
            "parsing_errors": other_error_stats,
            "foreign_sentence_count": self.foreign_sentence_count,
            "mixed_language_sentence_count": self.mixed_language_sentence_count,
            "paragraph_language_counts": dict(self.paragraph_language_counts),
            "unique_token_count": len(self.unique_tokens),
            "special_character_usage": dict(self.special_character_usage),
            "sentence_length_mean": sentence_length_mean,
            "paragraph_length_mean": paragraph_length_mean,
            "sentence_length_median": sentence_length_median,
            "paragraph_length_median": paragraph_length_median,
            "sentence_length_mode": sentence_length_mode,
            "paragraph_length_mode": paragraph_length_mode,
            "sentence_length_stdev": sentence_length_stdev,
            "paragraph_length_stdev": paragraph_length_stdev
        }


    def write_statistics(self, output_parquet_path, output_csv_path):
        """Write collected statistics to a Parquet file."""
        # Collect statistics
        error_stats = self.get_error_statistics()
        token_frequency = self.get_token_frequency()

        # Prepare data for writing to Parquet
        data = []

        # Add Token Frequency Statistics
        for token, count in token_frequency.items():
            data.append({'Statistic': f"Token Frequency: {token}", 'Value': count})

        # Add Character Error Statistics
        for char_info, count in error_stats['character_errors'].items():
            data.append({'Statistic': f"Character Error: {char_info}", 'Value': count})

        # Add Parsing Errors
        for error, count in error_stats['parsing_errors'].items():
            data.append({'Statistic': f"Parsing Error: {error}", 'Value': count})

        # Add Foreign and Mixed Language Sentence Counts
        data.append({'Statistic': "Foreign Sentence Count", 'Value': error_stats['foreign_sentence_count']})
        data.append({'Statistic': "Mixed Language Sentence Count", 'Value': error_stats['mixed_language_sentence_count']})

        # Add Paragraph Language Counts
        for lang, count in error_stats['paragraph_language_counts'].items():
            data.append({'Statistic': f"Paragraph Language Count: {lang}", 'Value': count})

        # Convert to DataFrame
        df = pd.DataFrame(data)

        # Write DataFrame to Parquet file
        df.to_parquet(output_parquet_path, index=False)
        df.to_csv(output_csv_path, index=False)


    def generate(self, fileids=None, log_file_path=None, non_html_log_path=None):
        """
        Processes documents, logging invalid paragraphs and returning valid ones.

        Parameters:
            fileids (str or None): Specific file identifier(s) or None.
            log_file_path (str): Path for logging invalid paragraphs.
            non_html_log_path (str): Path for logging non-HTML content.

        Returns:
            tuple: A tuple containing:
                - doc_dict (list): List of valid paragraphs.
                - error_dict (list): List of invalid paragraphs.
                - count (int): Number of valid paragraphs added after cleaning.
                - all_paragraph_count (int): Total number of valid paragraphs.
        """
        doc_dict = []
        error_dict = []
        count = 0
        all_paragraph_count = 0 

        # Open two log files: one for invalid paragraphs and one for non-HTML content
        with open(log_file_path, 'a') as invalid_log_file, open(non_html_log_path, 'a') as non_html_log_file:
            for idx, html_content in enumerate(self.docs(fileids=fileids, non_html_log_file=non_html_log_file)):
                html_content = self.replace_curly_quotes(html_content)
                validation_result = self.validate_paragraph(html_content)

                if isinstance(validation_result, bool) and validation_result:  
                    # Valid paragraph
                    all_paragraph_count += 1
                    doc_dict.append(html_content)
                else:
                    # Invalid paragraph; log to the invalid paragraphs file
                    if not isinstance(validation_result, bool):
                        invalid_log_file.write(f"Invalid Paragraph {count}: {validation_result}\n")
                        cleaned_html_content = self.remove_non_printable_chars(html_content)

                        if isinstance(self.validate_paragraph(cleaned_html_content), bool):
                            count += 1
                            doc_dict.append(cleaned_html_content)
                        else:
                            error_dict.append(cleaned_html_content)

        return doc_dict, self.get_error_statistics()

In [55]:
#corpus_path = os.path.join("topic-modeling", "data", "docs-to-process", "PROJECT_FOLDER")
corpus_path = r"C:\utma\data\docs-to-process\2024"

_corpus = DocumentParser(corpus_path)
# print filenames
_corpus.fileids()

['2024_html.json']

In [57]:
# Define generic paths for log files
base_path = r"C:\utma\data\docs-to-process\2024\log"
os.makedirs(base_path, exist_ok=True)

log_file_path = os.path.join(base_path, "paragraph_error.log")
non_html_log_path = os.path.join(base_path, "non_html_content.log")

# Run the generate function with generic paths
corpus_tuple, get_error_statistics = _corpus.generate(
    log_file_path=log_file_path,
    non_html_log_path=non_html_log_path
)

In [66]:
pp.pprint(get_error_statistics)

{'character_errors': {},
 'foreign_sentence_count': 0,
 'mixed_language_sentence_count': 0,
 'paragraph_language_counts': {'latin': 260},
 'paragraph_length_mean': 50.01538461538462,
 'paragraph_length_median': 22.0,
 'paragraph_length_mode': 1,
 'paragraph_length_stdev': 65.257078151617,
 'parsing_errors': {},
 'sentence_length_mean': 20.906752411575564,
 'sentence_length_median': 16.0,
 'sentence_length_mode': 1,
 'sentence_length_stdev': 28.420598539092097,
 'special_character_usage': {'#': 4,
                             '%': 26,
                             '&': 2,
                             '(': 206,
                             ')': 212,
                             '*': 58,
                             ',': 986,
                             '-': 246,
                             '.': 1008,
                             '/': 250,
                             ':': 94,
                             ';': 310,
                             '<': 4,
                             '=': 38

In [74]:
output_JSON_path = r'C:\utma\data\GrandUnifiedProject\statistics\2024_stats.JSON'

# Convert nested dictionaries to JSON strings
data_for_JSON = {key: (json.dumps(value) if isinstance(value, dict) else value)
                for key, value in get_error_statistics.items()}

with open(output_JSON_path, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    
    def write_nested_dict(d, parent_key=""):
        for key, value in d.items():
            if isinstance(value, dict):  # Nested dictionary
                write_nested_dict(value, parent_key=f"{parent_key}{key}.")
            else:
                writer.writerow([f"{parent_key}{key}", value])
    
    write_nested_dict(get_error_statistics)

print(f"Data written to {output_JSON_path}")

Data written to C:\utma\data\GrandUnifiedProject\statistics\2024_stats.JSON


In [None]:
output_parquet_path = r'C:\utma\data\GrandUnifiedProject\statistics\2010_2014_stats.parquet'
output_csv_path = r'C:\utma\data\GrandUnifiedProject\statistics\2010_2014_stats.csv'


_corpus.write_statistics(output_parquet_path, output_csv_path)

{'character_errors': {},
 'foreign_sentence_count': 0,
 'mixed_language_sentence_count': 0,
 'paragraph_language_counts': {},
 'paragraph_length_mean': 0,
 'paragraph_length_median': 0,
 'paragraph_length_mode': 0,
 'paragraph_length_stdev': 0,
 'parsing_errors': {},
 'sentence_length_mean': 0,
 'sentence_length_median': 0,
 'sentence_length_mode': 0,
 'sentence_length_stdev': 0,
 'special_character_usage': {},
 'unique_token_count': 0}


In [None]:
# Set flag to determine if lemmatization (reducing words to their base form) will be performed during preprocessing.
INCLUDE_STOPWORDS = True
LEMMATIZATION = False

In [None]:
texts_out = []       # List to store processed text (filtered and lemmatized tokens) for each paragraph
inner_text = []      # Temporary list to hold tokens for the current paragraph

# Revised processing to retain all parts of speech and keep stop words
for paras in tqdm(corpus_tuple, total=len(corpus_tuple)):
    doc = nlp(paras)  # Process the paragraph with spaCy NLP pipeline

    for token in doc:
        # Consider all tokens, not just content words, and reduce character length threshold if necessary
        if len(token.text) > 1:  # Example threshold adjustment to keep even short words if relevant
            # Optionally include or remove stop words
            if INCLUDE_STOPWORDS or (token.text.lower() not in stop_words and token.lemma_.lower() not in stop_words):
                inner_text.append(token.text if not LEMMATIZATION else token.lemma_)

# Append processed tokens of the current paragraph to texts_out if not empty
if len(inner_text) > 0:
    texts_out.append(inner_text)
inner_text = []  # Reset inner_text for the next paragraph