# Document Preprocessing for Unified Topic Modeling and Analysis (UTMA) Pipeline

This notebook serves as the preprocessing stage for the Unified Topic Modeling and Analysis (UTMA) pipeline. The UTMA pipeline, primarily driven by `utma.py`, is designed to perform topic modeling, diachronic analysis, and visualization on diverse textual corpora. This notebook prepares documents for efficient intake by `utma.py`, ensuring compatibility with the UTMA pipeline's requirements and maintaining consistency across different data sources.

### Notebook Objectives
- **Load and Process Documents**: Reads in documents from various formats (e.g., HTML, JSON) and performs necessary transformations to standardize the content.
- **Curly Quote Replacement and Validation**: Cleans up common issues such as curly quotes and non-printable characters to ensure text integrity.
- **Extract and Organize Content**: Utilizes BeautifulSoup and regex to extract relevant content based on tag configurations, organizing text into paragraphs and sentences.
- **Output for UTMA Intake**: Produces a structured, cleaned corpus that the `utma.py` script can ingest directly, enabling downstream analysis and visualization steps in the UTMA pipeline.

### Dependencies and Related Scripts
This notebook works in conjunction with several key components of the UTMA project:
- **utma.py**: The main script for orchestrating topic modeling, diachronic analysis, and evaluation.
- **alpha_eta.py**: Supports tuning of hyperparameters such as alpha and eta for topic modeling.
- **process_futures.py**: Manages asynchronous processing, crucial for efficient handling of large corpora.
- **topic_model_trainer.py**: Defines and trains the topic models.
- **visualization.py**: Generates visualizations for model analysis.
- **write_to_postgres.py**: Handles the persistence of processed data into PostgreSQL databases for structured storage and querying.
- **utils.py**: Provides utility functions used across the UTMA pipeline for consistency and efficiency.

### Workflow Overview
1. **Document Loading**: Import HTML or JSON files, detecting encoding where necessary.
2. **Content Extraction and Preprocessing**: Extracts paragraphs and sentences, normalizes punctuation, and replaces curly quotes to enhance text consistency.
3. **Output for UTMA Intake**: Saves the preprocessed content in a structured format ready for direct intake by `utma.py`.

By running this notebook, you’ll prepare a clean, standardized corpus compatible with the UTMA pipeline, ensuring optimal input quality for topic modeling and analysis.


In [None]:
# Prompt the user for permission to check and install required packages
response = input("Do you want to check that the required packages, including the correct versions, are installed? (yes/no): ").lower()
check = response in ["yes", "y"]

In [None]:
import importlib
import subprocess
import sys
import nltk
# Required libraries with their respective install names and versions (if needed)
libraries = {
    "nltk": {"install_name": "nltk", "version": "3.6.5", "nltk_data": "stopwords"},
    "bs4": {"install_name": "beautifulsoup4"},
    "gensim": {"install_name": "gensim", "version": "4.0.1"},
    "readability.readability": {"install_name": "readability-lxml"},
    "html5lib": {"install_name": "html5lib"},
    "tqdm": {"install_name": "tqdm", "version": "4.62.3"},
    "sklearn.manifold": {"install_name": "scikit-learn"},
    "matplotlib": {"install_name": "matplotlib", "version": "3.4.3"},
    "numpy": {"install_name": "numpy", "version": "1.21.2"},
    "pandas": {"install_name": "pandas", "version": "1.3.3"},
    "spacy": {"install_name": "spacy", "spacy_model": "en_core_web_lg"},
}

def check_and_install(package_name, install_name=None, spacy_model=None, version=None, nltk_data=None):
    """
    Checks if a package is installed; if not, prompts the user for permission to install it.
    Supports specifying version, spaCy model downloads, and NLTK datasets.
    """
    try:
        importlib.import_module(package_name)
        if spacy_model:  # Special case for spaCy models
            import spacy
            if not spacy.util.is_package(spacy_model):
                raise ImportError(f"{spacy_model} model not found")
        if nltk_data:  # Check for required NLTK data (e.g., stopwords)
            nltk.data.find(f"corpora/{nltk_data}")
    except (ImportError, LookupError):
        response = input(f"'{package_name}' or required data is missing. Would you like to install it? (yes/no): ").strip().lower()
        if response in ["yes", "y"]:
            if install_name:
                install_command = [sys.executable, "-m", "pip", "install"]
                if version:
                    install_command.append(f"{install_name}=={version}")
                else:
                    install_command.append(install_name)
                subprocess.check_call(install_command)
            if spacy_model:
                subprocess.check_call([sys.executable, "-m", "spacy", "download", spacy_model])
            if nltk_data:
                nltk.download(nltk_data)
        else:
            print(f"Skipping installation of '{package_name}' or required data. The program may not run as expected.")

def check_libraries(check=True):
    if not check:
        print("Skipping package check and installation as per user input.")
        return
    
    for package_name, details in libraries.items():
        check_and_install(
            package_name,
            install_name=details.get("install_name"),
            spacy_model=details.get("spacy_model"),
            version=details.get("version"),
            nltk_data=details.get("nltk_data")
        )

# Call to check and install libraries if needed
check_libraries(check=check)

# Now you can use the stopwords
stop_words = stopwords.words('english')


In [None]:

DOC_ID = r'.*[\d\w\-.]+\.(html|json)$'



TAGS = ['p']


# stop words
stop_words = stopwords.words('english')

# observed MMWR Journal findings

stop_words.extend(['icon', 'website', 'mmwr', 'citation', 'author', 'report', 'formatting', "format",'regarding',
                   'system', 'datum', 'link', 'linking', 'federal', 'data', 'tract', 'census', 'study',"question",
                   'conduct', 'report', 'including', 'top', 'summary', 'however', 'name', 'known', 'figure', 'return', 
                   'page', 'view', 'affiliation', 'pdf', 'law', 'version', 'list', 'endorsement', "review",
                   'article', 'download', 'reference', 'publication', 'discussion', 'table', 'vol', "message",
                   'information', 'web', 'notification', 'policy', 'policie', #spaCy lemmatization can make errors with pluralization(e.g. rabie for rabies)
                   'acknowledgment', 'altmetric', 'health',
                   'abbreviation', 'figure', "service","imply","current","source",
                   "trade","address", "addresses","program","organization" ,"provided", "copyrighted", "copyright",
                   "already", "topic", "art", 'e.g', 'eg',
                   'generated', 'proofs', 'automated', 'process', 'conversion', 'result', 'character', 'translation', 'errors', 'referred', 'electronic', 'original', 'printable', 'official',
                   'Use', 'of', 'trade', 'names', 'and', 'commercial', 'sources', 'is', 'for', 'identification', 'only', 'and', 'does', 'not', 'imply', 'endorsement', 'by', 'the', 'Department', 'of', 
                   'Health', 'and', 'Human', 'Services', 'References', 'to', 'CDC', 'sites', 'on', 'the', 'Internet', 'are', 'provided', 'as', 'a', 'service', 'to', 'MMWR', 'readers', 'and', 'do', 
                   'not', 'constitute', 'or', 'imply', 'endorsement', 'of', 'these', 'organizations', 'or', 'their', 'programs', 'by', 'CDC', 'or', 'the', 'Department', 'of', 'Health', 'and', 
                   'Human', 'Services', 'CDC', 'is', 'not', 'responsible', 'for', 'the', 'content', 'of', 'pages', 'found', 'at', 'these', 'sites', 'URL', 'addresses', 'listed', 'in', 'MMWR', 
                   'were', 'current', 'as', 'of', 'the', 'date', 'of', 'publication'])
# pretrained model for POS tagging/filtering
nlp = en_core_web_lg.load( disable=['parser','ner'])

TAGS = ['p']

# set encoding for CorpusReader class
ENCODING = 'latin1'

# SET DIR PATHS
JSON_OUT = r"C:\topic-modeling\data\documents\2015"

# set value to determine if lemmatization will be performed
LEMMATIZATION = True

In [None]:
import codecs
import json
from bs4 import BeautifulSoup
import re
import nltk
from time import time
from hashlib import md5
import chardet
import unicodedata
from nltk.corpus.reader.api import CorpusReader  # Only import CorpusReader now

class HTMLReader(CorpusReader):
    
    def __init__(self, root, tags=TAGS, fileids=DOC_ID, **kwargs):
        CorpusReader.__init__(self, root, fileids)
        self.tags = tags

    def resolve(self, fileids=None):
        return fileids  # Simplified as we no longer need categories

    def detect_encoding(self, file_path):
        with open(file_path, 'rb') as f:
            raw_data = f.read()
        return chardet.detect(raw_data)['encoding']
    
    def docs(self, fileids=None, pattern=None):
        fileids = self.resolve(fileids)
        
        # Compile the regular expression pattern, or use a default if none is provided
        if pattern is not None:
            regex = re.compile(pattern, re.IGNORECASE)
        else:
            regex = re.compile(r'.*([\d\w\-.]+)\.(html|json)$', re.IGNORECASE)
            
        for path, encoding in self.abspaths(fileids, include_encoding=True):
            if regex.search(path):
                encoding = self.detect_encoding(path)
                
                # Check if the file is JSON by extension
                if path.lower().endswith('.json'):
                    with codecs.open(path, 'r', encoding=encoding) as f:
                        try:
                            # Load JSON content
                            data = json.load(f)
                            # If the JSON contains a list of HTML strings or lists of strings, process each one
                            if isinstance(data, list):
                                for item in data:
                                    if isinstance(item, str):  # Single HTML string
                                        #print(f"Yielding HTML from JSON: {item[:30]}...")  # Debug print
                                        yield self.replace_curly_quotes(item)
                                    elif isinstance(item, list):  # Nested list of strings
                                        for html_content in item:
                                            if isinstance(html_content, str):
                                                #print(f"Yielding nested HTML from JSON: {html_content[:30]}...")  # Debug print
                                                yield self.replace_curly_quotes(html_content)
                            else:
                                print(f"Error: {path} does not contain a list of HTML strings.")
                        except json.JSONDecodeError:
                            print(f"Error: {path} is not a valid JSON file.")
                
                elif path.lower().endswith('.html'):
                    # Process as regular HTML file
                    with codecs.open(path, 'r', encoding=encoding) as f:
                        doc_content = f.read()
                        #print(f"Yielding HTML from file: {doc_content[:30]}...")  # Debug print
                        yield self.replace_curly_quotes(doc_content)
                else:
                    print(f"Unsupported file type: {path}. Only JSON and HTML files are allowed.")
                    continue

    def html(self, fileids=None):
        for doc in self.docs(fileids):
            try:
                yield doc
            except Exception as e:
                print("Could not parse HTML: {}".format(e))
                continue

    def replace_curly_quotes(self, text):
        quote_replacements = {
            u"\\u2018": "'",  # Left single quotation mark
            u"\\u2019": "'",  # Right single quotation mark
            u"\\u201C": '"',  # Left double quotation mark
            u"\\u201D": '"',  # Right double quotation mark
        }
        
        for curly_quote, straight_quote in quote_replacements.items():
            text = text.replace(curly_quote, straight_quote)
        
        return text

    def remove_non_printable_chars(self, text):
        non_printable_pattern = re.compile(r'[\\x00-\\x1F\\x7F-\\x9F]+')
        cleaned_text = re.sub(non_printable_pattern, '', text)
        return cleaned_text.strip()

    def get_invalid_character_names(self, text):
        char_names = set()
        non_printable_pattern = re.compile(r'[\\x00-\\x1F\\x7F-\\x9F]')
        invalid_chars = non_printable_pattern.findall(text)
        for char in invalid_chars:
            try:
                name = unicodedata.name(char)
            except ValueError:
                name = "UNKNOWN CONTROL CHARACTER"
            char_names.add(name)
        return char_names
        
    def validate_paragraph(self, paragraph):
        reasons = []
        if not paragraph.strip():
            reasons.append("Only whitespace")

        invalid_char_names = self.get_invalid_character_names(paragraph)
        
        if invalid_char_names:
            reason = f"Contains non-printable characters: {', '.join(invalid_char_names)}"
            reasons.append(reason)

        return True if not reasons else ', '.join(reasons)
    
    para_dict = dict()
    def paras(self, parser_type='lxml', fileids=None):
        for html in self.html(fileids):
            # Check if html content looks like an HTML string
            if not isinstance(html, str) or "<" not in html:
                print(f"Skipping non-HTML content: {html}")
                continue
            
            soup = BeautifulSoup(html, parser_type)
            
            # Join tags into a CSS selector if `self.tags` is a list
            tag_selector = ",".join(self.tags) if isinstance(self.tags, list) else self.tags
            
            for element in soup.select(tag_selector):
                text = element.text.strip()
                yield text



    sent_dict = dict()
    def sents(self, fileids=None):
        for paragraph in self.paras(fileids=fileids):
            for sentence in nltk.sent_tokenize(paragraph): 
                yield sentence

    word_dict = dict()
    def words(self, fileids=None): 
        for sentence in self.sents(fileids=fileids):
            for token in nltk.wordpunct_tokenize(sentence):
                yield token

    def generate(self, fileids=None, log_file_path=None):
        doc_dict = []
        error_dict = []
        count = 0
        all_paragraph_count = 0 

        with open(log_file_path, 'a') as log_file:
            for idx, html_content in enumerate(self.paras(fileids=fileids)):
                html_content = self.replace_curly_quotes(html_content)
                validation_result = self.validate_paragraph(html_content)

                if isinstance(validation_result, bool) and validation_result:  
                    all_paragraph_count += 1
                    doc_dict.append(html_content)
                else:
                    if not isinstance(validation_result, bool):  # If we have a string with failure reasons...
                        log_file.write(f"Invalid Paragraph {count}: {validation_result}\\n")

                        cleaned_html_content = self.remove_non_printable_chars(html_content)

                        if isinstance(self.validate_paragraph(cleaned_html_content), bool):
                            count += 1
                            doc_dict.append(cleaned_html_content)
                        else:
                            error_dict.append(cleaned_html_content)

        return doc_dict, error_dict, count, all_paragraph_count


In [None]:
_corpus = HTMLReader(r'C:\topic-modeling\data\documents\2015')
#print(_corpus.categories())
_corpus.fileids()

In [None]:
corpus_tuple, errors, count, all_paragraph_count= _corpus.generate(log_file_path=r"C:\topic-modeling\data\documents\2015\paragraph_error.log")
print(count)
print(all_paragraph_count)
print(len(corpus_tuple))
# 2010: 977438, 2011:

In [None]:
for txt in errors:
    print((txt))

In [None]:
from time import time
import spacy

texts_out = []
inner_text = []

# number of stopwords found
stopword_count = nltk.FreqDist()

pp.pprint(f"Executing POS/LEMMATIZATION({LEMMATIZATION})")

t = time()
for paras in tqdm(corpus_tuple): #, total=9168220):
    doc = nlp(paras)
    
    for token in doc:
        if token.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV']:
            if len(token.text) > 5:
                if token.text.lower() not in stop_words and token.lemma_.lower() not in stop_words: 
                    if LEMMATIZATION == False:
                        inner_text.append(token.text) 
                    else:
                        inner_text.append(token.lemma_) 
                else:
                    if LEMMATIZATION == False:
                        stopword_count[token.text] += 1
                    else:
                        stopword_count[token.lemma_] += 1

    if len(inner_text) > 0:
        texts_out.append(inner_text)
    inner_text = []

#pp.pprint(texts_out)
pp.pprint('Time to finish spaCy filter: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
texts_for_processing =[]
for sent in texts_out:
    if len(sent) > 5:
        texts_for_processing.append(sent)
del texts_out

In [None]:
import json

def write_tokenized_sentences_to_jsonl(sentences, output_file):
    with open(output_file, 'w', encoding='utf-8') as f:
        for sentence in sentences:
            json_line = json.dumps(sentence, ensure_ascii=False)
            f.write(json_line + '\n')

write_tokenized_sentences_to_jsonl(texts_for_processing, r'C:\topic-modeling\data\documents\2015\2015.jsonl')

In [None]:
fliename2 = r"C:\topic-modeling\data\documents\2015\2015.json')"
with open(fliename2, 'w') as jsonfile:
    json.dump(texts_for_processing, jsonfile, indent=1, ensure_ascii=False)

In [None]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(texts_for_processing)

# freqDist object for bigrams
bigram_freq = nltk.FreqDist()

# print bigrams
for ngrams, _ in bigram.vocab.items():
    #unicode_ngrams = ngrams.decode('utf-8')
    if '_' in ngrams:
        bigram_freq[ngrams]+=1
        print(ngrams)

# add bigrams to texts_out to be included in corpus
for idx in range(len(texts_for_processing)):
    for token in bigram[texts_for_processing[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            texts_for_processing[idx].append(token)

In [None]:
#pp.pprint(texts_out)

In [None]:
fliename3 = r"C:\topic-modeling\data\documents\2015\2015-w-bigrams.json"
with open(fliename3, 'w') as jsonfile:
    json.dump(texts_for_processing, jsonfile, ensure_ascii=False)

In [None]:
write_tokenized_sentences_to_jsonl(texts_for_processing, r"C:\topic-modeling\data\documents\2015\2015-w-bigrams.jsonl")