# Document Preprocessing Notebook for Unified Topic Modeling and Analysis (UTMA) Pipeline

This notebook performs document preprocessing for the Unified Topic Modeling and Analysis (UTMA) pipeline, ensuring documents are prepared for seamless ingestion by `utma.py` and related scripts. The UTMA pipeline integrates topic modeling, diachronic analysis, and visualization, allowing for adaptable and detailed analysis across diverse textual corpora. This notebook ensures document standardization, facilitating compatibility with UTMA's topic modeling and evaluation stages.

### Notebook Objectives
- **Load and Transform Documents**: Imports and structures text from HTML and JSON files, preparing them for UTMA's topic analysis.
- **Data Cleansing**: Standardizes text by removing curly quotes, non-printable characters, and other inconsistencies.
- **Content Structuring**: Extracts text using BeautifulSoup and regex, arranging it into paragraphs and sentences.
- **Output Preparation**: Produces a processed corpus in a format optimized for `utma.py` ingestion, supporting downstream analysis and visualization.

### Dependencies and Related Components
This notebook is designed to work alongside the following scripts:
- **`utma.py`**: Coordinates topic modeling, diachronic analysis, and evaluation.
- **`alpha_eta.py`**: Supports hyperparameter tuning for model optimization.
- **`process_futures.py`**: Manages asynchronous processing, essential for handling large datasets efficiently.
- **`topic_model_trainer.py`**: Defines and trains the topic models used in UTMA.
- **`visualization.py`**: Generates visualizations for model insights and evaluation.
- **`write_to_postgres.py`**: Facilitates data persistence into PostgreSQL, supporting structured data retrieval.
- **`utils.py`**: Provides utility functions to enhance efficiency and consistency across the UTMA pipeline.

### Workflow Overview
1. **Document Loading**: Reads HTML or JSON files, detecting encoding where necessary.
2. **Content Extraction**: Extracts structured content, normalizing punctuation and replacing curly quotes for text consistency.
3. **Data Output**: Saves the processed content in a structured format for direct use in `utma.py`.

Running this notebook prepares a clean, standardized corpus compatible with the UTMA pipeline, optimizing input quality for topic modeling and diachronic analysis.


# Environment Setup Instructions

> ⚠️ **Important:** Before running the package installation steps in this notebook, you must first create the Conda environment from the Anaconda Prompt.
>
> 1. **Open Anaconda Prompt**.
> 2. **Create the Environment**: Run the following command in Anaconda Prompt:
>    ```bash
>    conda create --name UTMA python=3.12.0
>    ```
> 3. **Verify Environment Creation**: Check that the `UTMA` environment was created by listing all environments:
>    ```bash
>    conda env list
>    ```
>
> 4. **Exit the Notebook** and restart it in the newly created `UTMA` environment. Then, return to this notebook to proceed with the installation steps and install packages in `UTMA` using the provided requirements file.


In [None]:
# Ensure that you're using the newly created environment
%pip install -r requirements.txt
!python -m spacy download en_core_web_lg

In [None]:
# Standard Library Imports
import os            # Provides functions for interacting with the operating system, e.g., file handling.
import re            # Supports regular expressions for text manipulation and pattern matching.
import csv           # Facilitates reading from and writing to CSV files.
import json          # Enables reading from and writing to JSON files, often used for structured data.
from time import time  # Allows timing operations for performance measurement.
import logging       # Provides logging functionalities to monitor code execution.
import codecs        # Handles different text encodings, important for text data processing.
import multiprocessing  # Supports parallel processing, enhancing performance on large datasets.
import pprint as pp  # Pretty-prints data structures, useful for debugging and visualization.
from hashlib import md5  # Provides MD5 hashing, useful for generating unique IDs or checking data integrity.

# Encoding and Parsing Imports
import chardet        # Detects character encoding of text files, allowing for accurate reading of various encodings.
import unicodedata    # Handles Unicode character information, useful for identifying and removing non-printable characters.
from bs4 import BeautifulSoup  # Parses HTML content, enabling extraction of specific tags (e.g., <p> tags) for processing.


# NLTK Imports
import nltk                             # Natural Language Toolkit, a suite for text processing.
from nltk.corpus import stopwords       # Provides lists of stop words to remove from text.
from nltk.corpus.reader.api import CorpusReader  # Base class for reading and structuring corpora.
from nltk import sent_tokenize, pos_tag, wordpunct_tokenize  # Tokenizers and POS tagger for sentence processing.
stop_words = stopwords.words('english')  # Initializing English stop words list for filtering out common words.

# Gensim Imports
import gensim                           # Library for topic modeling and word vector creation.
from gensim.models import Word2Vec, ldamulticore  # Word2Vec for word embeddings, ldamulticore for topic modeling.
from gensim.models.phrases import Phrases, Phraser  # Constructs multi-word phrases (e.g., bigrams) from tokenized text.
import gensim.corpora as corpora        # Handles creation of dictionaries and corpora from text data.
from gensim.utils import simple_preprocess  # Preprocesses text into a list of tokens.

# SpaCy Import (specific model)
import en_core_web_lg                   # SpaCy's large English NLP model for advanced text processing.
nlp = en_core_web_lg.load(disable=['parser','ner'])  # Loads model, with parsing and named entity recognition disabled.

# Readability Import
from readability.readability import Unparseable  # Exception handling for parsing errors in HTML.
from readability.readability import Document as Paper  # Extracts readable content from HTML, discarding noise.

# BeautifulSoup and HTML5lib
import bs4                                # BeautifulSoup for parsing HTML and XML documents.
import html5lib                           # Parser for HTML5, used by BeautifulSoup for web scraping.

# Data Processing and Scientific Libraries
import numpy as np                        # Supports efficient numerical operations on large arrays and matrices.
import pandas as pd                       # Data analysis library for handling structured data (e.g., DataFrames).
from sklearn.manifold import TSNE         # Dimensionality reduction for visualizing high-dimensional data.
from matplotlib import pyplot as plt      # Plotting library for creating visualizations.
from tqdm import tqdm                     # Adds progress bars to loops, useful for monitoring lengthy operations.


In [None]:
# Load custom stop words based on document context
with open('config/custom_stopwords.json', 'r') as file:
    custom_stopwords = json.load(file)

# Add CDC MMWR-specific stop words, or any additional terms the user finds appropriate
# This approach enables the user to include context-specific stop words, not limited to MMWR.
# Users can add any terms they consider irrelevant or noise within the custom_stopwords.json file, 
# making this process adaptable to various document types or specific text analysis needs.
stop_words.update(custom_stopwords.get("cdc_mmwr", []))

In [None]:
DOC_ID = r'.*[\d\w\-.]+\.(html|json)$'  # Regular expression pattern to identify document filenames ending in .html or .json.

TAGS = ['p']  # List of HTML tags to extract content from; 'p' is commonly used to denote paragraphs in HTML.

# Set flag to determine if lemmatization (reducing words to their base form) will be performed during preprocessing.
LEMMATIZATION = True

In [None]:
import codecs
import json
from bs4 import BeautifulSoup
import re
import nltk
from time import time
from hashlib import md5
import chardet
import unicodedata
from nltk.corpus.reader.api import CorpusReader  # Only import CorpusReader now

class HTMLReader(CorpusReader):
    
    def __init__(self, root, tags=TAGS, fileids=DOC_ID, **kwargs):
        CorpusReader.__init__(self, root, fileids)
        self.tags = tags

    def resolve(self, fileids=None):
        return fileids  # Simplified as we no longer need categories

    def detect_encoding(self, file_path):
        with open(file_path, 'rb') as f:
            raw_data = f.read()
        return chardet.detect(raw_data)['encoding']
    
    def process_content(self, content, non_html_log_file):
        """
        Process content based on whether it contains HTML tags.
        If HTML tags are detected, extract <p> tags; otherwise, treat it as plain text.
        """
        content = content.replace("\\n", "\n").strip()  # Clean escape sequences

        # Check for specific HTML tags (like <p>) to determine if it's HTML
        if "<p>" in content or "</p>" in content:
            # Process as HTML
            soup = BeautifulSoup(content, 'html.parser')
            paragraphs = soup.find_all('p')
            for p in paragraphs:
                yield p.get_text()
        else:
            # Log non-HTML content to the specified log file
            non_html_log_file.write(f"Skipping non-HTML content: {content[:100]}...\n")  # Write first 100 chars as a preview
            yield None  # Skip processing this as valid content

    def docs(self, fileids=None, pattern=None, non_html_log_file=None):
        fileids = self.resolve(fileids)
        
        # Compile the regular expression pattern, or use a default if none is provided
        if pattern is not None:
            regex = re.compile(pattern, re.IGNORECASE)
        else:
            regex = re.compile(r'.*([\d\w\-.]+)\.(html|json)$', re.IGNORECASE)
            
        for path, encoding in self.abspaths(fileids, include_encoding=True):
            if regex.search(path):
                encoding = self.detect_encoding(path)
                
                # Check if the file is JSON by extension
                if path.lower().endswith('.json'):
                    with codecs.open(path, 'r', encoding=encoding) as f:
                        try:
                            # Load JSON content
                            data = json.load(f)
                            # Process each HTML snippet in JSON directly for <p> tags
                            if isinstance(data, list):
                                for item in data:
                                    if isinstance(item, str):
                                        # Process each content item with HTML or plain text
                                        for content in self.process_content(item, non_html_log_file):
                                            if content:
                                                yield self.replace_curly_quotes(content)
                            else:
                                print(f"Error: {path} does not contain a list of HTML strings.")
                        except json.JSONDecodeError:
                            print(f"Error: {path} is not a valid JSON file.")
                
                elif path.lower().endswith('.html'):
                    # Process as regular HTML file
                    with codecs.open(path, 'r', encoding=encoding) as f:
                        doc_content = f.read()
                        # Process as HTML or plain text
                        for content in self.process_content(doc_content, non_html_log_file):
                            if content:
                                yield self.replace_curly_quotes(content)
                else:
                    print(f"Unsupported file type: {path}. Only JSON and HTML files are allowed.")
                    continue

    def html(self, fileids=None):
        for doc in self.docs(fileids):
            try:
                yield doc
            except Exception as e:
                print("Could not parse HTML: {}".format(e))
                continue

    def replace_curly_quotes(self, text):
        quote_replacements = {
            u"\\u2018": "'",  # Left single quotation mark
            u"\\u2019": "'",  # Right single quotation mark
            u"\\u201C": '"',  # Left double quotation mark
            u"\\u201D": '"',  # Right double quotation mark
        }
        
        for curly_quote, straight_quote in quote_replacements.items():
            text = text.replace(curly_quote, straight_quote)
        
        return text

    def remove_non_printable_chars(self, text):
        non_printable_pattern = re.compile(r'[\\x00-\\x1F\\x7F-\\x9F]+')
        cleaned_text = re.sub(non_printable_pattern, '', text)
        return cleaned_text.strip()

    def get_invalid_character_names(self, text):
        char_names = set()
        non_printable_pattern = re.compile(r'[\\x00-\\x1F\\x7F-\\x9F]')
        invalid_chars = non_printable_pattern.findall(text)
        for char in invalid_chars:
            try:
                name = unicodedata.name(char)
            except ValueError:
                name = "UNKNOWN CONTROL CHARACTER"
            char_names.add(name)
        return char_names
        
    def validate_paragraph(self, paragraph):
        reasons = []
        if not paragraph.strip():
            reasons.append("Only whitespace")

        invalid_char_names = self.get_invalid_character_names(paragraph)
        
        if invalid_char_names:
            reason = f"Contains non-printable characters: {', '.join(invalid_char_names)}"
            reasons.append(reason)

        return True if not reasons else ', '.join(reasons)
    
    para_dict = dict()
    def paras(self, parser_type='lxml', fileids=None):
        for html in self.html(fileids):
            # Check if html content looks like an HTML string
            if not isinstance(html, str) or "<" not in html:
                print(f"Skipping non-HTML content: {html}")
                continue
            
            soup = BeautifulSoup(html, parser_type)
            
            # Join tags into a CSS selector if `self.tags` is a list
            tag_selector = ",".join(self.tags) if isinstance(self.tags, list) else self.tags
            
            for element in soup.select(tag_selector):
                text = element.text.strip()
                yield text

    sent_dict = dict()
    def sents(self, fileids=None):
        for paragraph in self.paras(fileids=fileids):
            for sentence in nltk.sent_tokenize(paragraph): 
                yield sentence

    word_dict = dict()
    def words(self, fileids=None): 
        for sentence in self.sents(fileids=fileids):
            for token in nltk.wordpunct_tokenize(sentence):
                yield token

    def generate(self, fileids=None, log_file_path=None, non_html_log_path=None):
        doc_dict = []
        error_dict = []
        count = 0
        all_paragraph_count = 0 

        # Open two log files: one for invalid paragraphs and one for non-HTML content
        with open(log_file_path, 'a') as invalid_log_file, open(non_html_log_path, 'a') as non_html_log_file:
            for idx, html_content in enumerate(self.docs(fileids=fileids, non_html_log_file=non_html_log_file)):
                html_content = self.replace_curly_quotes(html_content)
                validation_result = self.validate_paragraph(html_content)

                if isinstance(validation_result, bool) and validation_result:  
                    # Valid paragraph
                    all_paragraph_count += 1
                    doc_dict.append(html_content)
                else:
                    # Invalid paragraph; log to the invalid paragraphs file
                    if not isinstance(validation_result, bool):
                        invalid_log_file.write(f"Invalid Paragraph {count}: {validation_result}\n")
                        cleaned_html_content = self.remove_non_printable_chars(html_content)

                        if isinstance(self.validate_paragraph(cleaned_html_content), bool):
                            count += 1
                            doc_dict.append(cleaned_html_content)
                        else:
                            error_dict.append(cleaned_html_content)

        return doc_dict, error_dict, count, all_paragraph_count



> ⚠️ **Important:** Verify File Path

In [None]:
_corpus = HTMLReader(r'C:\topic-modeling\data\docs-to-process\2015')
_corpus.fileids()

> ⚠️ **Important:** Verify file output location and filenames

In [None]:
corpus_tuple, errors, count, all_paragraph_count = _corpus.generate(
    log_file_path=r"C:\topic-modeling\data\docs-to-process\2015\paragraph_error.log",
    non_html_log_path=r"C:\topic-modeling\data\docs-to-process\2015\non_html_content.log"
)

print(count)
print(all_paragraph_count)
print(len(corpus_tuple))
# 2010: 977438, 2011:

In [None]:
for txt in errors:
    print((txt))

In [None]:
from time import time
import spacy

texts_out = []
inner_text = []

# number of stopwords found
stopword_count = nltk.FreqDist()

pp.pprint(f"Executing POS/LEMMATIZATION({LEMMATIZATION})")

t = time()
for paras in tqdm(corpus_tuple): #, total=9168220):
    doc = nlp(paras)
    
    for token in doc:
        if token.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV']:
            if len(token.text) >= 5:
                if token.text.lower() not in stop_words and token.lemma_.lower() not in stop_words: 
                    if LEMMATIZATION == False:
                        inner_text.append(token.text) 
                    else:
                        inner_text.append(token.lemma_) 
                else:
                    if LEMMATIZATION == False:
                        stopword_count[token.text] += 1
                    else:
                        stopword_count[token.lemma_] += 1

    if len(inner_text) > 0:
        texts_out.append(inner_text)
    inner_text = []

#pp.pprint(texts_out)
pp.pprint('Time to finish spaCy filter: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
texts_for_processing =[]
for sent in texts_out:
    if len(sent) > 5:
        texts_for_processing.append(sent)
del texts_out

In [None]:
print(len(texts_for_processing))
for text in texts_for_processing:
    print(text)

> ⚠️ **Important:** Verify file output location and filename

In [None]:
import json

def write_tokenized_sentences_to_jsonl(sentences, output_file):
    with open(output_file, 'w', encoding='utf-8') as f:
        for sentence in sentences:
            json_line = json.dumps(sentence, ensure_ascii=False)
            f.write(json_line + '\n')

write_tokenized_sentences_to_jsonl(texts_for_processing, r'C:\topic-modeling\data\processed-docs\2015\2015.jsonl')

> ⚠️ **Important:** Verify file output location and filename

In [None]:
fliename2 = r"C:\topic-modeling\data\processed-docs\2015\2015.json')"
with open(fliename2, 'w') as jsonfile:
    json.dump(texts_for_processing, jsonfile, indent=1, ensure_ascii=False)

In [None]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(texts_for_processing)

# freqDist object for bigrams
bigram_freq = nltk.FreqDist()

# print bigrams
for ngrams, _ in bigram.vocab.items():
    #unicode_ngrams = ngrams.decode('utf-8')
    if '_' in ngrams:
        bigram_freq[ngrams]+=1
        print(ngrams)

# add bigrams to texts_out to be included in corpus
for idx in range(len(texts_for_processing)):
    for token in bigram[texts_for_processing[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            texts_for_processing[idx].append(token)

> ⚠️ **Important:** Verify file output location and filename

In [None]:
fliename3 = r"C:\topic-modeling\data\processed-docs\2015\2015-w-bigrams.json"
with open(fliename3, 'w') as jsonfile:
    json.dump(texts_for_processing, jsonfile, ensure_ascii=False)

> ⚠️ **Important:** Verify file output location and filename

In [None]:
write_tokenized_sentences_to_jsonl(texts_for_processing, r"C:\topic-modeling\data\processed-docs\2015\2015-w-bigrams.jsonl")