In [1]:
"""
This code was written using chatGPT to a limited degree(~20%) -- practically, a Liliputian "Hello, world" script.

Sources: Official Gensim and Gensim communities, Dask documentation and Dask communities were used for review, along with a Brobdingnagian crawl of virtual-space(e.g. blogs, personal sites, stack exchange, etc.) which were 
on more occasions than not *several* years to the n^th degree old.

authors: alan hamm(pqn7)
         bertha(chatCDC)
         
apr 2024
"""

'\nThis code was written using chatGPT to a limited degree(~20%) -- practically, a Liliputian "Hello, world" script.\n\nSources: Official Gensim and Gensim communities, Dask documentation and Dask communities were used for review, along with a Brobdingnagian crawl of virtual-space(e.g. blogs, personal sites, stack exchange, etc.) which were \non more occasions than not *several* years to the n^th degree old.\n\nauthors: alan hamm(pqn7)\n         bertha(chatCDC)\n         \napr 2024\n'

## TopicFutures by Alan Hamm and Bertha

The provided script is a comprehensive Python program that utilizes several libraries to perform \
Latent Dirichlet Allocation (LDA) topic modeling on text data. The script includes functionality for \
data preprocessing, model training, evaluation, and visualization, as well as handling distributed computing with Dask.

At the beginning of the script, various libraries are imported including pyLDAvis.gensim for interactive topic model \
visualization, torch for deep learning operations and GPU acceleration, gensim for LDA modeling and coherence \
computation, and dask.distributed for parallel and distributed computing.\

The script sets up directory paths for logging, models, and visuals based on a given decade (DECADE). It checks if these \
directories exist and contain data; if they do, it archives their contents into a ZIP file before removing the old \
subdirectories. New directories are then created for the current run.

Logging configuration is established to record messages in a log file. Bokeh deprecation warnings are suppressed to avoid \
cluttering the output with irrelevant messages.

Parameters such as alpha (document-topic density) and beta (word-topic density) are defined as lists of possible values \
that will be used during LDA model training. These parameters influence the sparsity or density of topics in documents \
or words associated with topics.

A function named futures_create_lda_datasets() is defined to load data from a JSON file, shuffle it, split it into training \
and evaluation datasets based on a specified ratio (train_ratio), and return them as delayed objects ready for parallel processing with Dask.

Another function called save_model_and_log() takes care of saving trained LDA models along with their metadata into \
specified directories. It also logs this information into CSV files.

The core function train_model() performs the actual training of LDA models using Gensim's LdaModel. It processes text documents \
in batches to create dictionary mappings and trains an LDA model per batch. Model performance metrics like convergence score, \
perplexity score, and coherence score are calculated during this process.

The main execution block initializes a Dask cluster with specified worker configurations such as number of cores (CORES) and \
memory limits (RAM_MEMORY_LIMIT). A Dask client is created to manage tasks across workers. Training and evaluation datasets \
are prepared by calling the aforementioned functions. These datasets are scattered across workers for efficient parallel processing.

A series of nested loops iterate over combinations of topic numbers (n_topics), alpha values (alpha_values), \
and beta values (beta_values) to submit training tasks for both the training and evaluation datasets. These tasks are \
submitted to the Dask client, which distributes them across the available workers.

The script employs a progress bar from tqdm to visualize the progress of model creation and saving. It uses a batch processing \
approach where it waits for a certain number of futures (asynchronous task results) to complete before processing their \
results. If some futures do not complete within a specified timeout (TIMEOUT), they are recorded as failed and an attempt is \
made to retry them with an extended timeout (EXTENDED_TIMEOUT).

Once all models have been trained or reattempted, any remaining incomplete models' parameters are logged for review, indicating \
that these models did not successfully complete even after a second attempt.

Throughout the script, various utility functions such as os, json, random, csv, and others are used for file operations, \
data manipulation, random shuffling of data, and logging results in CSV format.

It's important to note that this script assumes certain global variables like DECADE, DATA_SOURCE, TRAIN_RATIO, CORES,\
THREADS_PER_CORE, etc., are defined elsewhere in the code or environment since they are referenced but not explicitly \
defined within the provided code snippet.

Overall, this script is designed for robust LDA topic modeling with extensive parameter exploration while leveraging distributed \
computing resources efficiently. It includes error handling mechanisms such as retries for failed tasks and comprehensive\
logging which aids in debugging and optimizing model performance.

The script is structured to handle large-scale topic modeling tasks in a distributed computing environment. After setting up\
 the Dask client and workers, it proceeds to create training and evaluation datasets from a specified data source (DATA_SOURCE) \
 using the futures_create_lda_datasets function. The resulting datasets are then scattered across the Dask cluster's workers \
 for parallel processing.

For model training, the script defines a train_model function that takes several parameters including the number of topics, \
alpha and beta values, and the dataset. This function processes text documents in batches, updating a global dictionary with \
each batch and training an LDA model using Gensim's LdaModel. It computes various performance metrics for each batch such as \
convergence score, perplexity score, and coherence score.

The main execution loop iterates over different combinations of model hyperparameters (number of topics, alpha values, beta values) \
and submits two sets of futures to the Dask client: one for training models on the training data (train_futures) and another \
for evaluating models on the evaluation data (eval_futures). These futures are monitored for completion, with progress tracked by a tqdm progress bar.

If any futures do not complete within the given timeout period (TIMEOUT), they are added to a list of failed model parameters \
(failed_model_params) for later analysis. The script includes functionality to retry processing these incomplete futures with \
an extended timeout (EXTENDED_TIMEOUT). Once all models have been processed or reattempted, any remaining incomplete models' \
parameters are logged using both standard output and a performance logger (perf_logger).

Finally, upon completion or failure of all tasks, the script closes the Dask client and provides an overview of which model parameters \
did not complete successfully after retries. This information can be used to diagnose potential issues in model training or resource \
allocation within the distributed computing setup.

In summary, this script is designed as an end-to-end solution for performing LDA topic modeling at scale. It incorporates best \
practices such as error handling through retries, logging important events and metrics for post-analysis, utilizing distributed\
 computing resources effectively via Dask, and providing user feedback through progress bars. Users looking to employ this script \
 should ensure they have set up their environment correctly with all necessary variables defined and have access to sufficient \
 computational resources managed by Dask.

In [2]:
import pyLDAvis.gensim  # Library for interactive topic model visualization
from tqdm import tqdm  # Creates progress bars to visualize the progress of loops or tasks
from gensim.models import LdaModel  # Implements LDA for topic modeling using the Gensim library
from gensim.corpora import Dictionary  # Represents a collection of text documents as a bag-of-words corpus
from gensim.models import CoherenceModel  # Computes coherence scores for topic models
import pyLDAvis
import IProgress 

import os  # Provides functions for interacting with the operating system, such as creating directories
import itertools  # Provides various functions for efficient iteration and combination of elements
import numpy as np  # Library for numerical computing in Python, used for array operations and calculations
from time import time, sleep # Measures the execution time of code snippets or functions
import pprint as pp  # Pretty-printing library, used here to format output in a readable way
import pandas as pd
import logging # Logging module for generating log messages
import sys # Provides access to some variables used or maintained by the interpreter and to functions that interact with the interpreter 
import shutil # High-level file operations such as copying and removal 
import zipfile # Provides tools to create, read, write, append, and list a ZIP file
from tqdm.notebook import tqdm  # Creates progress bars in Jupyter Notebook environment
from json import load
import random
import logging
import csv
import pprint as pp
from pandas.api.types import CategoricalDtype
from typing import Union, List
import math
from scipy import stats

from dask.distributed import as_completed
import dask   # Parallel computing library that scales Python workflows across multiple cores or machines 
from dask.distributed import Client, LocalCluster, wait   # Distributed computing framework that extends Dask functionality 
from dask.diagnostics import ProgressBar   # Visualizes progress of Dask computations
from dask.distributed import progress
from dask.delayed import Delayed # Decorator for creating delayed objects in Dask computations
#from dask.distributed import as_completed
from dask.bag import Bag
from dask import delayed
import dask.config
#from dask.distributed import wait
from dask.distributed import performance_report, wait, as_completed #,print
from distributed import get_worker
import gc
import hashlib
import pickle



In [3]:
import logging
from datetime import datetime

DECADE_TO_PROCESS ='2010s'
LOG_DIRECTORY = f"C:/_harvester/data/lda-models/{DECADE_TO_PROCESS}_html/log/"
# Ensure the LOG_DIRECTORY exists
os.makedirs(LOG_DIRECTORY, exist_ok=True)

# Get the current date and time
now = datetime.now()

# Format the date and time as per your requirement
# Note: %w is the day of the week as a decimal (0=Sunday, 6=Saturday)
#       %Y is the four-digit year
#       %m is the two-digit month (01-12)
#       %H%M is the hour (00-23) followed by minute (00-59) in 24hr format
log_filename = now.strftime('log-%w-%m-%Y-%H%M.log')
LOGFILE = os.path.join(LOG_DIRECTORY,log_filename)

# Configure logging to write to a file with this name
logging.basicConfig(
    filename=LOGFILE,
    filemode='a',  # Append mode if you want to keep adding to the same file during the day
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    level=logging.INFO
)

# Now when you use logging.info(), logging.debug(), etc., it will write to that log file.

In [4]:
# Dask dashboard throws deprecation warnings w.r.t. Bokeh
import warnings
from bokeh.util.deprecation import BokehDeprecationWarning

# Disable Bokeh deprecation warnings
warnings.filterwarnings("ignore", category=BokehDeprecationWarning)
# Filter out the specific warning message
# Set the logging level for distributed.utils_perf to suppress warnings
logging.getLogger('distributed.utils_perf').setLevel(logging.ERROR)
warnings.filterwarnings("ignore", module="distributed.utils_perf")

In [5]:

# Define the range of number of topics for LDA and step size
START_TOPICS = 100
END_TOPICS = 205
STEP_SIZE = 5

# define the decade that is being modelled 
DECADE = DECADE_TO_PROCESS

# In the case of this machine, since it has an Intel Core i9 processor with 8 physical cores (16 threads with Hyper-Threading), 
# it would be appropriate to set the number of workers in Dask Distributed LocalCluster to 8 or slightly lower to allow some CPU 
# resources for other tasks running on your system.
CORES = 10
MAXIMUM_CORES = 12

THREADS_PER_CORE = 2

RAM_MEMORY_LIMIT = "12GB" 

# Specify the local directory path
DASK_DIR = '/_harvester/tmp-dask-out'

# specify the number of passes for Gensim LdaModel
PASSES = 15

# specify the number of iterations
ITERATIONS = 50

# Number of documents to be iterated through for each update. 
# Set to 0 for batch learning, > 1 for online iterative learning.
UPDATE_EVERY = 5

# Log perplexity is estimated every that many updates. 
# Setting this to one slows down training by ~2x.
EVAL_EVERY = 10

RANDOM_STATE = 75

PER_WORD_TOPICS = True

# number of documents to extract from the JSON source file when testing and developing
NUM_DOCUMENTS = 25

# the number of documents to read from the JSON source file per batch
FUTURES_BATCH_SIZE = 300

# Constants for adaptive batching and retries
# Number of futures to process per iteration
BATCH_SIZE = 500 # number of documents
MAX_BATCH_SIZE = 650 
INCREASE_FACTOR = 1.2  # Increase batch size by n% upon success
DECREASE_FACTOR = .10 # Decrease batch size by p% upon failure or timeout
MAX_RETRIES = 5        # Maximum number of retries per task
BASE_WAIT_TIME = 1     # Base wait time in seconds for exponential backoff


# Load data from the JSON file
DATA_SOURCE = "C:/_harvester/data/tokenized-sentences/10s/tokenized_min_three_word-w-bigrams-08312024.json"
TRAIN_RATIO = .80

TIMEOUT = None #"90 minutes"

EXTENDED_TIMEOUT = None #"120 minutes"

CPU_UTILIZATION_THRESHOLD = 85 # ie 95%
MEMORY_UTILIZATION_THRESHOLD = .6 # per worker

# Enable serialization optimizations
dask.config.set(scheduler='distributed', serialize=True)
dask.config.set({'logging.distributed': 'error'})
dask.config.set({"distributed.scheduler.worker-ttl": None})
#dask.config.set({"distributed.scheduler.worker-ttl": None})

<dask.config.set at 0x17b5761eea0>

## Technical Documentation for LDA Model Data Management System
### Overview
The LDA Model Data Management System is designed to efficiently store and manage large volumes of text data and associated \
metadata generated by Latent Dirichlet Allocation (LDA) models. The system allows for quick access to text data based on \
queries of the metadata, facilitating dynamic generation of pyLDAvis objects or other topic analysis visualizations. 

### System Structure
The system comprises a top-level directory with several subdirectories designated for logs, visuals, metadata, and compressed \
text data. Each large body of text is stored as an individual ZIP file to save space, while metadata is stored in a Parquet \
file for efficient querying.

**Directory Structure**
* ROOT_DIR: The base directory containing all data related to the LDA models. \
* LOG_DIR: A subdirectory within ROOT_DIR that stores log files.\
* IMAGE_DIR: A subdirectory within ROOT_DIR that stores visualization files such as images or charts.\
* METADATA_DIR: A subdirectory within ROOT_DIR that stores metadata in a Parquet file.\
* TEXTS_ZIP_DIR: A subdirectory within ROOT_DIR where each text file is saved as an individual ZIP archive.\

**File Formats**
* **Parquet**: Used for storing metadata due to its efficiency in storage size and speed when querying columns.
* **ZIP**: Used for compressing individual text files to minimize disk space usage.

### Functions
**save_text_to_zip**(text_data) \
Saves a given string of text data into a ZIP file within the TEXTS_ZIP_DIR.
**Parameters**:
* text_data (str): The string content representing the body of text to be saved.
**Returns**:
* (str): The path to the created ZIP file containing the text data.

**add_model_data_to_metadata**(model_data) \
Adds new model data entries to the existing metadata Parquet file. If no Parquet file exists, it creates one.
**Parameters**:
* model_data (dict): A dictionary containing model-related information including texts and various scores like convergence, perplexity, coherence, etc.
**Side Effects**:
* Updates or creates a Parquet file at METADATA_DIR/metadata.parquet.

**get_text_from_zip**(zip_path)
Reads and returns the content of a specified text from its corresponding ZIP archive.
**Parameters**:
* zip_path (str): The path to the ZIP archive containing the text data.
**Returns**:
* (str): The text content extracted from the ZIP file.

**load_texts_for_analysis**(metadata_path, coherence_threshold=0.7) 
Loads metadata from a Parquet file and retrieves texts that meet specified criteria, such as a minimum coherence score.\
**Parameters**:
* metadata_path (str): The path to the metadata Parquet file.
* coherence_threshold (float, optional): The threshold for filtering records based on their coherence score. Defaults to 0.7.
**Returns**:
* (list of str): A list of text contents that meet the specified criteria.
    
### Usage
To use this system, follow these steps:

1. Ensure that all necessary directories (LOG_DIR, IMAGE_DIR, METADATA_DIR, TEXTS_ZIP_DIR) are created within the top-level directory (ROOT_DIR).
2. When new model data is generated, create a dictionary with keys corresponding to metadata fields and a 'text' key containing a list of large bodies of text.
3. Call add_model_data_to_metadata(new_model_data) to save the text data into individual ZIP files and update or create the metadata Parquet file with references to these ZIP files.
4. To retrieve texts for analysis based on metadata queries, call load_texts_for_analysis(parquet_file_path). You can specify a different coherence threshold if needed.
5. For any specific text retrieval based on its ZIP archive path, use get_text_from_zip(zip_path)

![image.png](attachment:image.png)

## Maintenance
The system requires minimal maintenance:
* Periodically check the available disk space in case the volume of stored texts grows significantly.
* Backup important data regularly, especially the Parquet file containing metadata and references to text files.
* Update directory paths and

In [6]:
import gc
def garbage_collection(development: bool, location: str):
    if development:
        # Enable debugging flags for leak statistics
        gc.set_debug(gc.DEBUG_LEAK)

    # Before calling collect, get a count of existing objects
    before = len(gc.get_objects())

    # Perform garbage collection
    collected = gc.collect()

    # After calling collect, get a new count of existing objects
    after = len(gc.get_objects())

    # Print or log before and after counts along with number collected
    logging.info(f"Garbage Collection at {location}:")
    logging.info(f"  Before GC: {before} objects")
    logging.info(f"  After GC: {after} objects")
    logging.info(f"  Collected: {collected} objects\n")


In [7]:
import os
import pandas as pd
import zipfile

# Define the top-level directory and subdirectories
DECADE = "2010s"  # Replace with your actual decade value
ROOT_DIR = f"C:/_harvester/data/lda-models/{DECADE}_html"
LOG_DIR = os.path.join(ROOT_DIR, "log")
IMAGE_DIR = os.path.join(ROOT_DIR, "visuals")
METADATA_DIR = os.path.join(ROOT_DIR, "metadata")
TEXTS_ZIP_DIR = os.path.join(ROOT_DIR, "texts_zip")

# Ensure that all necessary directories exist
os.makedirs(LOG_DIR, exist_ok=True)
os.makedirs(IMAGE_DIR, exist_ok=True)
os.makedirs(METADATA_DIR, exist_ok=True)
os.makedirs(TEXTS_ZIP_DIR, exist_ok=True)

# Function to save text data to a zip file and return the path
def save_text_to_zip(text_data):
    # Generate a unique filename based on current timestamp
    timestamp_str = pd.Timestamp.now().strftime('%Y%m%d%H%M%S%f')
    text_zip_filename = f"{timestamp_str}.zip"
    
    # Write the text content to a zip file within TEXTS_ZIP_DIR
    zip_path = os.path.join(TEXTS_ZIP_DIR, text_zip_filename)
    with zipfile.ZipFile(zip_path, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr("text.txt", text_data)
    
    return zip_path

# Function to save text data and model to single ZIP file
def save_to_zip(time, text_data, ldamodel):
    # Generate a unique filename based on current timestamp
    timestamp_str = hashlib.md5(time.strftime('%Y%m%d%H%M%S%f').encode()).hexdigest()
    text_zip_filename = f"{timestamp_str}.zip"
    
    # Write the text content and model to a zip file within TEXTS_ZIP_DIR
    zip_path = os.path.join(TEXTS_ZIP_DIR, text_zip_filename)
    with zipfile.ZipFile(zip_path, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
        zf.writestr(f"doc_{text_zip_filename}.txt", text_data)
        ldamodel_bytes = pickle.dumps(ldamodel)
        zf.writestr(f"model_{text_zip_filename}.pkl", ldamodel_bytes)
    
    return zip_path

# method to deserialize and return the LDA model object
def load_pkl_from_zip(zip_path):
    with zipfile.ZipFile(zip_path, 'r') as zf:
        pkl_files = [file for file in zf.namelist() if file.endswith('.pkl')]
        if len(pkl_files) == 0:
            raise ValueError("No pkl files found in the ZIP archive.")
        
        pkl_file = pkl_files[0]
        pkl_bytes = zf.read(pkl_file)
        loaded_pkl = pickle.loads(pkl_bytes)
    
    return loaded_pkl

# Function to add new model data to metadata Parquet file
def add_model_data_to_metadata(model_data):
    #print("we are in the add_model_data_to_metadata method()")
    # Save large body of text to zip and update model_data reference
    texts_zipped = []
    
    #for text_list in model_data['text']:
    for text_list in model_data['text']:
        combined_text = ''.join([''.join(sent) for sent in text_list])  # Combine all sentences into one string
        zip_path = save_to_zip(model_data['time'], combined_text, model_data['lda_model'])
        texts_zipped.append(zip_path)
    # Update model data with zipped paths
    model_data['text'] = texts_zipped
     # Ensure other fields are not lists, or if they are, they should have only one element per model
    for key, value in model_data.items():
        #if isinstance(value, list) and key != 'text':
        if isinstance(value, list) and key not in ['text', 'top_words']:
            assert len(value) == 1, f"Field {key} has multiple elements"
            model_data[key] = value[0]  # Unwrap single-element list
               
    # Define the expected data types for each column
    expected_dtypes = {
        'type': str,
        'batch_size': int,
        'text': object,  # Use object dtype for lists of strings (file paths)
        'text_sha256': str,
        'text_md5': str,
        'convergence': 'float32',
        'perplexity': 'float32',
        'coherence': 'float32',
        'topics': int,
        # Use pd.Categorical.dtype for categorical columns
        # Ensure alpha and beta are already categorical when passed into this function
        # They should not be wrapped again with CategoricalDtype here.
        'alpha_str': str,
        'n_alpha': 'float32',
        'beta_str': str,
        'n_beta': 'float32',
        'passes': int,
        'iterations': int,
        'update_every': int,
        'eval_every': int,
        'chunksize': int,
        'random_state': int,
        'per_word_topics': bool,
        'top_words': object,
        'lda_model': object,
        # Enforce datetime type for time
        'time': 'datetime64[ns]',
    }   

    
    try:
        #df_new_metadata = pd.DataFrame({key: [value] if not isinstance(value, list) else value 
        #                                for key, value in model_data.items()}).astype(expected_dtypes)
        # Create a new DataFrame without enforcing dtypes initially
        df_new_metadata = pd.DataFrame({key: [value] if not isinstance(value, list) else value 
                                        for key, value in model_data.items()})
        
        # Apply type conversion selectively
        #for col_name in ['convergence', 'perplexity', 'coherence', 'n_beta', 'n_alpha']:
        for col_name in ['convergence', 'perplexity', 'coherence', 'n_beta', 'n_alpha']:
            df_new_metadata[col_name] = df_new_metadata[col_name].astype('float64')
            
        df_new_metadata['topics'] = df_new_metadata['topics'].astype(int)
        #df_new_metadata['time'] = pd.to_datetime(df_new_metadata['time'])
        df_new_metadata['batch_size'] = BATCH_SIZE
    except ValueError as e:
        # Initialize an error message list
        error_messages = [f"Error converting model_data to DataFrame with enforced dtypes: {e}"]
        
        
        # Iterate over each item in model_data to collect its key, expected dtype, and actual value
        for key, value in model_data.items():
            expected_dtype = expected_dtypes.get(key, 'No expected dtype specified')
            actual_dtype = type(value).__name__
            error_messages.append(f"Column: {key}, Expected dtype: {expected_dtype}, Actual dtype: {actual_dtype}, Value: {value}")
        
        # Join all error messages into a single string
        full_error_message = "\n".join(error_messages)

        logging.error(full_error_message)

        raise ValueError("Data type mismatch encountered during DataFrame conversion. Detailed log available.")
    
    # Path to the metadata Parquet file
    parquet_file_path = os.path.join(METADATA_DIR, "metadata.parquet")
    
    # Check if the Parquet file already exists
    if os.path.exists(parquet_file_path): 
        # If it exists, read the existing metadata and append the new data 
        df_metadata = pd.read_parquet(parquet_file_path) 
        df_metadata = pd.concat([df_metadata, df_new_metadata], ignore_index=True) 
    else: 
        # If it doesn't exist, use the new data as the starting point 
        df_metadata = df_new_metadata

    # drop lda model from dataframe
    df_metadata = df_metadata.drop('lda_model', axis=1)

    # Save updated metadata DataFrame back to Parquet file
    df_metadata.to_parquet(parquet_file_path)
    del df_metadata, model_data
    #garbage_collection(False, 'Cleaned add_model_data_to_metadata(...)')
    #print("\nthis is the value of the parquet file")
    #print(df_metadata)


# Function to read a specific text from its zip file based on metadata query
def get_text_from_zip(zip_path): 
    with zipfile.ZipFile(zip_path, 'r') as zf: 
        return zf.read('text.txt').decode('utf-8')

# Example usage: Load metadata and retrieve texts based on some criteria
def load_texts_for_analysis(metadata_path, coherence_threshold=0.7): 
    # Load the metadata into a DataFrame 
    df_metadata = pd.read_parquet(metadata_path)

    # Filter metadata based on some criteria (e.g., coherence > threshold)
    filtered_metadata = df_metadata[df_metadata['coherence'] > coherence_threshold]

    # Retrieve and decompress associated texts from their zip files
    texts = [get_text_from_zip(zip_path) for zip_path in filtered_metadata['text']]

    return texts

In [8]:
PERFORMANCE_TRAIN_LOG = os.path.join(LOG_DIR, "train_model_performance.html")
# INCLUDE EVAL AND TRAINING DATA OUTPUT FILEPATHS HERE

In [9]:

num_topics = len(range(START_TOPICS, END_TOPICS + 1, STEP_SIZE))

# Calculate numeric_alpha for symmetric prior
numeric_symmetric = 1.0 / num_topics
# Calculate numeric_alpha for asymmetric prior (using best judgment)
numeric_asymmetric = 1.0 / (num_topics + np.sqrt(num_topics))
# Create the list with numeric values
numeric_alpha = [numeric_symmetric, numeric_asymmetric] + np.arange(0.01, 1, 0.3).tolist()
numeric_beta = [numeric_symmetric] + np.arange(0.01, 1, 0.3).tolist()


# The parameter `alpha` in Latent Dirichlet Allocation (LDA) represents the concentration parameter of the Dirichlet 
# prior distribution for the topic-document distribution.
# It controls the sparsity of the resulting document-topic distributions.

# A lower value of `alpha` leads to sparser distributions, meaning that each document is likely to be associated with fewer topics.
# Conversely, a higher value of `alpha` encourages documents to be associated with more topics, resulting in denser distributions.

# The choice of `alpha` affects the balance between topic diversity and document specificity in LDA modeling.
alpha_values = ['symmetric', 'asymmetric']
alpha_values += np.arange(0.01, 1, 0.3).tolist()

# In Latent Dirichlet Allocation (LDA) topic analysis, the beta parameter represents the concentration 
# parameter of the Dirichlet distribution used to model the topic-word distribution. It controls the 
# sparsity of topics by influencing how likely a given word is to be assigned to a particular topic.

# A higher value of beta encourages topics to have a more uniform distribution over words, resulting in more 
# general and diverse topics. Conversely, a lower value of beta promotes sparser topics with fewer dominant words.

# The choice of beta can impact the interpretability and granularity of the discovered topics in LDA.
beta_values = ['symmetric']
beta_values += np.arange(0.01, 1, 0.3).tolist()


In [10]:
from decimal import Decimal
def calculate_numeric_alpha(alpha_str, num_topics=num_topics):
    if alpha_str == 'symmetric':
        return Decimal('1.0') / num_topics
    elif alpha_str == 'asymmetric':
        return Decimal('1.0') / (num_topics + Decimal(num_topics).sqrt())
    else:
        # Use Decimal for arbitrary precision
        return Decimal(alpha_str)

def calculate_numeric_beta(beta_str, num_topics=num_topics):
    if beta_str == 'symmetric':
        return Decimal('1.0') / num_topics
    else:
        # Use Decimal for arbitrary precision
        return Decimal(beta_str)

def validate_alpha_beta(alpha_str, beta_str):
    valid_strings = ['symmetric', 'asymmetric']
    if isinstance(alpha_str, str) and alpha_str not in valid_strings:
        logging.error(f"Invalid alpha_str value: {alpha_str}. Must be 'symmetric', 'asymmetric', or a numeric value.")
        raise ValueError(f"Invalid alpha_str value: {alpha_str}. Must be 'symmetric', 'asymmetric', or a numeric value.")
    if isinstance(beta_str, str) and beta_str not in valid_strings:
        logging.error(f"Invalid beta_str value: {beta_str}. Must be 'symmetric', or a numeric value.")
        raise ValueError(f"Invalid beta_str value: {beta_str}. Must be 'symmetric', or a numeric value.")

The method futures_create_lda_datasets is designed to read a JSON file containing a list of lists of tokens and yield batches \
of these tokens for training and evaluation purposes. However, there are several reasons why this method might not return \ 
exactly N lists of tokens from the data structure: \

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Data Length: The method assumes that the JSON file contains at least N records (lists of tokens). \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If the actual number of records in the JSON file is less than 25, then data[:25] will return fewer than 25 lists. \

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Batch Size: The method divides the data into batches according to the batch_size parameter. If batch_size \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;is greater than or equal to 25, then only one batch will be yielded (which could be either 'train' or 'eval' type depending on train_ratio). \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If batch_size is smaller than 25, multiple batches will be yielded until all available data has been used. \

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Train Ratio: The train_ratio parameter determines how many samples are used for training versus evaluation. Depending on \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;this ratio and the batch size, you may get a different number of 'train' and 'eval' batches.\

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Loop Conditions: The while loop continues yielding batches until both train_count < num_train_samples and eval_count < num_samples conditions are no longer true. \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;This means that if there are fewer than 25 records after applying the train ratio split, you might not get exactly 25 lists even if your initial dataset had more than that. \

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Indexing Error: There's an error in your cumulative count update logic within each loop iteration: \

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For training: You should increment by len(train_data_batch) instead of doubling up with both incrementing \
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;by len(train_data_batch) and then adding it again with cumulative_count += train_count. \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For evaluation: You should increment by just len(eval_data_batch), but instead, you're \
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;adding both eval count (cumulative_count += eval_count) which leads to an incorrect cumulative count. \

To ensure that you get exactly 25 lists (if available), consider adjusting your code as follows:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Make sure your input JSON file contains at least 25 records. \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Set your batch size appropriately so that it divides into your  dataset size evenly. \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Adjust your train/eval split so that it aligns with your goal of getting exactly 25 lists. \

In [11]:
"""
!!! DO NOT EXECUTE THIS CELL OR ANY CELL USING IT WITHOUT FIRSST
!!! UPDATING THE OUTPUT FILEPATH FOR THE TRAINING AND EVAL DATA
"""
import os
from json import load
import random

def get_num_records(filename):
    with open(filename, 'r') as jsonfile:
        data = load(jsonfile)
        data = data
        num_samples = len(data)  # Count the total number of samples
    return num_samples

import os
import json
from random import shuffle

def load(jsonfile):
    return json.load(jsonfile)

def futures_create_lda_datasets(filename, train_ratio, batch_size=FUTURES_BATCH_SIZE):
    with open(filename, 'r') as jsonfile:
        data = load(jsonfile)
        print(f"the number of records read from the JSON file: {len(data)}")
        num_samples = len(data)  # Count the total number of samples
        print(f"the number of documents sampled from the JSON file: {len(data)}\n")
        
        # Shuffle data indices since we can't shuffle actual lines in a file efficiently
        indices = list(range(num_samples))
        shuffle(indices)
        
        num_train_samples = int(num_samples * train_ratio)  # Calculate number of samples for training
        
        cumulative_count = 0  # Initialize cumulative count
        # Initialize counters for train and eval datasets
        train_count = 0
        eval_count = num_train_samples
        
        # Yield batches as dictionaries for both train and eval datasets along with their sample count
        while train_count < num_train_samples or eval_count < num_samples:
            if train_count < num_train_samples:
                # Yield a training batch
                train_indices_batch = indices[train_count:train_count + batch_size]
                train_data_batch = [data[idx] for idx in train_indices_batch]
                if len(train_data_batch) > 0:
                    yield {
                        'type': 'train',
                        'data': train_data_batch,
                        'indices_batch': train_indices_batch,
                        'cumulative_count': train_count,
                        'num_samples': num_train_samples,
                        'whole_dataset': data[:num_train_samples]
                    }
                    train_count += len(train_data_batch)
                    cumulative_count += train_count
            
            if (eval_count < num_samples or train_count >= num_train_samples):
                # Yield an evaluation batch
                #print("we are in the method to create the futures trying to create the eval data.")
                #print(f"the eval count is {eval_count} and the train count is {train_count} and the num train samples is {num_train_samples}\n")
                eval_indices_batch = indices[eval_count:eval_count + batch_size]
                eval_data_batch = [data[idx] for idx in eval_indices_batch]
                #print(f"This is the size of the eval_data_batch from the create futures method {len(eval_data_batch)}\n")
                if len(eval_data_batch) > 0:
                    yield {
                        'type': 'eval',
                        'data': eval_data_batch,
                        'indices_batch': eval_indices_batch,
                        'cumulative_count': eval_count - num_train_samples,
                        'num_samples': num_samples - num_train_samples,
                        'whole_dataset': data[num_train_samples:]
                    }
                    eval_count += len(eval_data_batch)
                    cumulative_count += eval_count
                
    #garbage_collection(False,'futures_create_lda_datasets(...)')

In [12]:
# create training and eval dictionaries used in train_model(...) method
def create_dictionary(filename):
    with open(filename, 'r') as jsonfile:
        data = load(jsonfile)
        num_samples = len(data)  # Count the total number of samples
        logging.info(f"The min five with bigrams has {num_samples} sentences")
        return data
#minfivedict = create_dictionary(DATA_SOURCE)

In [13]:
def create_vis(ldaModel, filename, corpus, dictionary):
    LOGFILE = os.path.join(IMAGE_DIR,filename)

    pyLDAvis.disable_notebook()
    vis = pyLDAvis.gensim.prepare(ldaModel, corpus, dictionary)

    pyLDAvis.save_html(vis, LOGFILE)


In [14]:
import hashlib
import re
# specify the chunk size for LdaModel object
# Number of documents to be used in each training chunk
CHUNKSIZE = (get_num_records(DATA_SOURCE)//5)
def train_model(n_topics: int, alpha_str: list, beta_str: list, data: list, train_eval: str, chunksize=CHUNKSIZE):
        models_data = []
        coherehce_score_list = []
        corpus_batch = []
        zipped_texts = []
        time_of_method_call = pd.to_datetime('now')

        #print("this is an investigation into the full datafile")
        #pp.pprint(full_datafile)
        # Convert the Delayed object to a Dask Bag and compute it to get the actual data
        try:
            streaming_documents = dask.compute(*data)
            #print("these are the streaming documents")
            #print(streaming_documents)
            #garbage_collection(False, 'train_model(): streaming_documents = dask.compute(*data)')
        except Exception as e:
            logging.error(f"Error computing streaming_documents data: {e}")
            raise
        #print(f"This is the dtype for 'streaming_documents' {type(streaming_documents)}.\n")  # Should output <class 'tuple'>
        #print(streaming_documents[0][0])     # Check the first element to see if it's as expected

        # Select documents for current batch
        batch_documents = streaming_documents
        
        # Create a new Gensim Dictionary for the current batch
        try:
            dictionary_batch = Dictionary(list(batch_documents))
            #print("The dictionary was cretaed.")
        except TypeError:
            print("Error: The data structure is not correct.")
        #else:
        #    print("Dictionary created successfully!")

        #if isinstance(batch_documents[0], list) and all(isinstance(doc, list) for doc in batch_documents[0]):
        #bow_out = dictionary_batch.doc2bow(batch_documents[0])
        flattened_batch = [item for sublist in batch_documents for item in sublist]
        #bow_out = dictionary_batch.doc2bow(flattened_batch)
        #else:
        #    raise ValueError(f"Expected batch_documents[0] to be a list of token lists. Instead received {type(batch_documents[0])} with value {batch_documents[0]}\n")

        # Iterate over each document in batch_documents
        number_of_documents = 0
        for doc_tokens in batch_documents:
            # Create the bag-of-words representation for the current document using the dictionary
            bow_out = dictionary_batch.doc2bow(doc_tokens)
            # Append this representation to the corpus
            corpus_batch.append(bow_out)
            number_of_documents += 1
        logging.info(f"There was a total of {number_of_documents} documents added to the corpus_batch.")
            
        #logger.info(f"HERE IS THE TEXT for corpus_batch using LOGGER: {corpus_batch}\n")
        #except Exception as e:
        #    logger.error(f"An unexpected error occurred with BOW_OUT: {e}")
                
        #if isinstance(texts_out[0], list):
        #    texts_batch.append(texts_out[0])
        #else:
        #    logging.error("Expected texts_out to be a list of strings (words), got:", texts_out[0])
        #    raise ValueError("Expected texts_out to be a list of strings (words), got:", texts_out[0])
                
        n_alpha = calculate_numeric_alpha(alpha_str)
        n_beta = calculate_numeric_beta(beta_str)
        try:
            #logger.info("we are inside the try block at the beginning")
            lda_model_gensim = LdaModel(corpus=corpus_batch,
                                                id2word=dictionary_batch,
                                                num_topics=n_topics,
                                                alpha= float(n_alpha),
                                                eta= float(n_beta),
                                                random_state=RANDOM_STATE,
                                                passes=PASSES,
                                                iterations=ITERATIONS,
                                                update_every=UPDATE_EVERY,
                                                eval_every=EVAL_EVERY,
                                                chunksize=chunksize,
                                                per_word_topics=True)
            #logger.info("we are inside the try block after the constructor")

                                          
        except Exception as e:
            logging.error(f"An error occurred during LDA model training: {e}")
            raise  # Optionally re-raise the exception if you want it to propagate further      

        ldamodel_bytes = pickle.dumps(lda_model_gensim)

        #coherence_score = None  # Assign a default value
        with np.errstate(divide='ignore', invalid='ignore'):
            try:
                #coherence_model_lda = CoherenceModel(model=lda_model_gensim, processes=math.floor(CORES*(2/3)), dictionary=dictionary_batch, texts=batch_documents[0], coherence='c_v') 
                coherence_model_lda = CoherenceModel(model=lda_model_gensim, processes=math.floor(CORES*(1/3)), dictionary=dictionary_batch, texts=batch_documents, coherence='c_v') 
                coherence_score = coherence_model_lda.get_coherence()
                coherehce_score_list.append(coherence_score)
            except Exception as e:
                logging.error("there was an issue calculating coherence score. value 'Inf' has been assigned.\n")
                coherence_score = float('inf')
                coherehce_score_list.append(coherence_score)
                #sys.exit()

        try:
            convergence_score = lda_model_gensim.bound(corpus_batch)
        except Exception as e:
            logging.error("there was an issue calculating convergence score. value 'Inf' has been assigned.\n")
            convergence_score = float('inf')
                    
        try:
            perplexity_score = lda_model_gensim.log_perplexity(corpus_batch)
        except RuntimeWarning as e:
            logging.info("there was an issue calculating perplexity score. value 'Inf' has been assigned.\n")
            perplexity_score = float('inf')
            #sys.exit()
        
        # Initialize top_words as an empty list
        top_words = []

        # Assuming lda_model_gensim is already trained and available
        for topic_str, word_probs in lda_model_gensim.show_topics(num_topics=15, num_words=25):
            # Extract words from the formatted string using regular expressions
            # The pattern assumes that words are within double quotes
            words = re.findall(r'"(.*?)"', word_probs)
            # Extend the top_words list with these individual words
            top_words.extend(words)
        #pp.pprint(top_words)
            
       #print(f"type: {train_eval}, coherence: {coherence_score}, n_topics: {n_topics}, n_alpha: {n_alpha}, alpha_str: {alpha_str}, n_beta: {n_beta}, beta_str: {beta_str}")
        logging.info(f"type: {train_eval}, coherence: {coherence_score}, n_topics: {n_topics}, n_alpha: {n_alpha}, alpha_str: {alpha_str}, n_beta: {n_beta}, beta_str: {beta_str}\n      batch documents: {batch_documents}\n")     

        # transform list of tokens comprising the doc into a single string
        string_result = ' '.join(map(str, flattened_batch))


        # Convert numeric beta value to string if necessary
        if isinstance(beta_str, float):
            beta_str = str(beta_str)
                
        # Convert numeric alpha value to string if necessary
        if isinstance(alpha_str, float):
            alpha_str = str(alpha_str)

        current_increment_data = {
                'type': train_eval, 
                'batch_size': BATCH_SIZE,
                'text': [string_result],
                'text_sha256': hashlib.sha256(string_result.encode()).hexdigest(),
                'text_md5': hashlib.md5(string_result.encode()).hexdigest(),
                'convergence': convergence_score,
                'perplexity': perplexity_score,
                'coherence': coherence_score,
                'topics': n_topics,
                'alpha_str': [alpha_str],
                'n_alpha': calculate_numeric_alpha(alpha_str),
                'beta_str': [beta_str],
                'n_beta': calculate_numeric_beta(beta_str),
                'passes': PASSES,
                'iterations': ITERATIONS,
                'update_every': UPDATE_EVERY,
                'eval_every': EVAL_EVERY,
                'chunksize': CHUNKSIZE,
                'random_state': RANDOM_STATE,
                'per_word_topics': PER_WORD_TOPICS,
                'top_words': [top_words],
                'lda_model': ldamodel_bytes,
                'time': time_of_method_call
        }

        models_data.append(current_increment_data)
        #garbage_collection(False, 'train_model(): convergence and perplexity score calculations')
        del batch_documents, streaming_documents, lda_model_gensim, current_increment_data, dictionary_batch

        return models_data
         

In [15]:
# Define a delayed version of the train_model function
@dask.delayed
def delayed_train_model(n_topics, alpha_value, beta_value, scattered_data, train_eval_type):
    # Call the train_model function here
    train_model(n_topics, alpha_value, beta_value, scattered_data, train_eval_type)

In [16]:
"""
                    - The `process_completed_future` function is called when all futures in a batch complete within the specified timeout. It 
                        can be used to continue with your program using both completed training and evaluation futures.
                    - The `retry_processing` function is called when there are incomplete futures after iterating through a batch of 
                        data. It can be used to retry processing with those incomplete futures.
                    - The code checks if there are any remaining futures in the lists after completing all iterations. If so, it 
                        waits for them to complete and handles them accordingly.
"""

# List to store parameters of models that failed to complete even after a retry
failed_model_params = []

# Mapping from futures to their corresponding parameters (n_topics, alpha_value, beta_value)
future_to_params = {}
def process_completed_futures(completed_train_futures, completed_eval_futures, log_dir):
    #print("we are in the process_completed_futures method()")
    # Process training futures
    for future in completed_train_futures:
        try:
            # Retrieve the result of the training future
            #if isinstance(future.result(), list):
            models_data = future.result()  # This should be a list of dictionaries
            if not isinstance(models_data, list):
                models_data = list(future.result())  # This should be a list of dictionaries
            #logging.info(f"this is the value of the TRAIN MODELS_DATA within the process_completed method: {models_data}")
            #else:
            #    models_data = list(future.result())
            #print("this is the value of models data:", models_data)
            
        except TypeError as e:
            logging.error(f"Error occurred during training: {e}")
            #sys.exit()
        else:
            # Iterate over each model's data and save it
            for model_data in models_data:
                # Check if models_data is a non-empty list before iterating
                if isinstance(models_data, list) and models_data:
                    for model_data in models_data:
                        #logging.info(f"this is the value of model TRAIN data: {model_data}")
                        #save_model_and_log(model_data=model_data, log_dir=log_dir, train_or_eval=True)
                        add_model_data_to_metadata(model_data)
                else:
                    # Handle the case where models_data is not as expected
                    logging.info(f"Received unexpected result from TRAIN future: {models_data}")

    # Process evaluation futures
    for future in completed_eval_futures:
        try:
            # Retrieve the result of the training future
            #if isinstance(future.result(), list):
            models_data = future.result()  # This should be a list of dictionaries
            if not isinstance(models_data, list):
                models_data = list(future.result())  # This should be a list of dictionaries
            #logging.info(f"this is the value of the EVAL MODELS_DATA within the process_completed method: {models_data}")
            #else:
            #    models_data = list(future.result())
            #print("this is the value of models data:", models_data)
        except TypeError as e:
            logging.error(f"Error occurred during evaluation: {e}")
            sys.exit()
        else:
            # Iterate over each model's data and save it
            for model_data in models_data:
                # Check if models_data is a non-empty list before iterating
                if isinstance(models_data, list) and models_data:
                    for model_data in models_data:
                        #logging.info(f"this is the value of model EVAL data: {model_data}")
                        #save_model_and_log(model_data=model_data, log_dir=log_dir, train_or_eval=False)
                        add_model_data_to_metadata(model_data)
                else:
                    # Handle the case where models_data is not as expected
                    logging.info(f"Received unexpected result from EVAL future: {models_data}")
                
    #garbage_collection(False, 'process_completed_futures(...)')
    del completed_eval_futures, completed_train_futures, models_data

            
# Function to retry processing with incomplete futures
def retry_processing(incomplete_train_futures, incomplete_eval_futures, timeout=None):
    #print("we are in the retry_processing method()")
    # Retry processing with incomplete futures using an extended timeout
    # Process completed ones after reattempting
    #done_train = [f for f in done if f in train_futures]
    #done_eval = [f for f in done if f in eval_futures]
    # Wait for completion of eval_futures
    done_eval, not_done_eval = wait(incomplete_eval_futures, timeout=timeout)  # return_when='FIRST_COMPLETED'
    #print(f"This is the size of the done_eval list: {len(done_eval)} and this is the size of the not_done_eval list: {len(not_done_eval)}")

    # Wait for completion of train_futures
    done_train, not_done_train = wait(incomplete_train_futures, timeout=timeout)  # return_when='FIRST_COMPLETED'
    #print(f"This is the size of the done_train list: {len(done_train)} and this is the size of the not_done_train list: {len(not_done_train)}")

    done = done_train.union(done_eval)
    not_done = not_done_eval.union(not_done_train)
                
    #print(f"WAIT completed in {elapsed_time} minutes")
    #print(f"This is the size of DONE {len(done)}. And this is the size of NOT_DONE {len(not_done)}\n")
    #print(f"this is the value of done_train {done_train}")

    completed_train_futures = [f for f in done_train]
    #print(f"We have completed the TRAIN list comprehension. The size is {len(completed_train_futures)}")
    #print(f"This is the length of the TRAIN completed_train_futures var {len(completed_train_futures)}")
            
    completed_eval_futures = [f for f in done_eval]
    #print(f"We have completed the EVAL list comprehension. The size is {len(completed_eval_futures)}")
    #print(f"This is the length of the EVAL completed_eval_futures var {len(completed_eval_futures)}")

    #logging.info(f"This is the size of completed_train_futures {len(completed_train_futures)} and this is the size of completed_eval_futures {len(completed_eval_futures)}")
    if len(completed_eval_futures) > 0 or len(completed_train_futures) > 0:
        process_completed_futures(completed_train_futures, completed_eval_futures, LOG_DIR) 
    
    # Record parameters of still incomplete futures for later review
    failed_model_params.extend(future_to_params[future] for future in not_done)
    print("We have exited the retry_preprocessing() method.")
    logging.info(f"There were {len(not_done_eval)} EVAL documents that couldn't be processed in retry_processing().")
    logging.info(f"There were {len(not_done_train)} TRAIN documents that couldn't be processed in retry_processing().")

    #garbage_collection(False, 'retry_processing(...)')

In [17]:
# Dictionary to keep track of retries for each task
task_retries = {}

# Function to perform exponential backoff
def exponential_backoff(attempt):
    return BASE_WAIT_TIME * (2 ** attempt)

# Function to handle failed futures and potentially retry them
def handle_failed_future(future, future_to_params, train_futures, eval_futures, client):
    print("We are in the handle_failed_future() method.\n")
    params = future_to_params[future]
    attempt = task_retries.get(params, 0)
    
    if attempt < MAX_RETRIES:
        print(f"Retrying task {params} (attempt {attempt + 1}/{MAX_RETRIES})")
        wait_time = exponential_backoff(attempt)
        sleep(wait_time)  
        
        task_retries[params] = attempt + 1
        
        new_future_train = client.submit(train_model, *params)
        new_future_eval = client.submit(train_model, *params)
        
        future_to_params[new_future_train] = params
        future_to_params[new_future_eval] = params
        
        train_futures.append(new_future_train)
        eval_futures.append(new_future_eval)
    else:
        print(f"Task {params} failed after {MAX_RETRIES} attempts. No more retries.")

    #garbage_collection(False,'handle_failed_future')

## Asynchronous Execution as said by Brunhilda:

Asynchronous execution allows you to execute tasks concurrently, without waiting for each task to complete before moving on \
to the next one. This can improve the overall efficiency and speed of your program.

In the given code snippet, asynchronous execution is achieved using Dask's as_completed function. This function takes a list \
of futures (representing tasks) and returns an iterator that yields futures as they complete.

Here's how it works:

&nbsp;&nbsp;&nbsp;&nbsp;(1) First, you submit all your training and evaluation tasks using client.submit(). These tasks are represented by futures. \
&nbsp;&nbsp;&nbsp;&nbsp;(2) You add callback functions (callback_train and callback_eval) to these futures using the add_done_callback() method. These callbacks will be executed when their respective futures complete.\
&nbsp;&nbsp;&nbsp;&nbsp;(3) You create two lists, train_futures and eval_futures, to store the futures for training and evaluation models respectively.\
&nbsp;&nbsp;&nbsp;&nbsp;(4) After submitting all the tasks, you enter a loop where you iterate over the range of values for n_topics, alpha_value, and beta_value.\
&nbsp;&nbsp;&nbsp;&nbsp;(5) Inside this loop, you submit the training and evaluation tasks for each combination of parameters using client.submit(). These new futures are added to their respective lists.\
&nbsp;&nbsp;&nbsp;&nbsp;(6) Next, you use the as_completed function to iterate over both lists of futures (train_futures and eval_futures). This function returns an iterator that yields completed futures as they become available.\
&nbsp;&nbsp;&nbsp;&nbsp;(7) As each future completes, its associated callback function (callback_train or callback_eval) is executed.\
&nbsp;&nbsp;&nbsp;&nbsp;(8) Inside these callback functions, you retrieve the result of the completed future using .result(). You can then save the trained or evaluated model using the provided save_model_and_log function.\
&nbsp;&nbsp;&nbsp;&nbsp;(9) The loop continues until all combinations of parameters have been processed.\
&nbsp;&nbsp;&nbsp;&nbsp;(10) Finally, after all models have been saved and logged, you close the Dask client. 

By utilizing asynchronous execution with Dask's as_completed, your program can process multiple tasks concurrently while still ensuring that each model is saved once its associated task has completed.

In [18]:
from tqdm import tqdm
if __name__=="__main__":

    cluster = LocalCluster(
            n_workers=CORES,
            threads_per_worker=THREADS_PER_CORE,
            processes=False,
            memory_limit=RAM_MEMORY_LIMIT,
            local_directory=DASK_DIR,
            #dashboard_address=None,
            dashboard_address=":8787",
            protocol="tcp",
    )


    # Create the distributed client
    client = Client(cluster)

    client.cluster.adapt(minimum=CORES, maximum=MAXIMUM_CORES)
    
    # Get information about workers from scheduler
    workers_info = client.scheduler_info()["workers"]

    # Iterate over workers and set their memory limits
    for worker_id, worker_info in workers_info.items():
        worker_info["memory_limit"] = RAM_MEMORY_LIMIT

    # Verify that memory limits have been set correctly
    #for worker_id, worker_info in workers_info.items():
    #    print(f"Worker {worker_id}: Memory Limit - {worker_info['memory_limit']}")

    # Check if the Dask client is connected to a scheduler:
    if client.status == "running":
        print("Dask client is connected to a scheduler.")
        # Scatter the embedding vectors across Dask workers
    else:
        print("Dask client is not connected to a scheduler.")
        print("The system is shutting down.")
        client.close()
        cluster.close()
        sys.exit()

    # Check if Dask workers are running:
    if len(client.scheduler_info()["workers"]) > 0:
        print("Dask workers are running.")
    else:
        print("No Dask workers are running.")
        print("The system is shutting down.")
        client.close()
        cluster.close()
        sys.exit()

    print("Creating training and evaluation samples...")
    
    started = time()
    
    scattered_train_data_futures = []
    scattered_eval_data_futures = []

    total_num_samples = get_num_records(DATA_SOURCE)

    whole_train_dataset = None
    whole_eval_dataset = None

    with tqdm(total=total_num_samples) as pbar:
        # Process each batch as it is generated
        for batch_info in futures_create_lda_datasets(DATA_SOURCE, TRAIN_RATIO):
            if batch_info['type'] == 'train':
                # Handle training data
                #print("We are inside the IF/ELSE block for producing TRAIN scatter.")
                try:
                    scattered_future = client.scatter(batch_info['data'])
                    scattered_train_data_futures.append(scattered_future)
                except Exception as e:
                    print("there was an issue with creating the TRAIN scattered_future list")
                
                if whole_train_dataset is None:
                    whole_train_dataset = batch_info['whole_dataset']
            elif batch_info['type'] == 'eval':
                # Handle evaluation data
                #print("We are inside the IF/ELSE block for producing EVAL scatter.")
                try:
                    scattered_future = client.scatter(batch_info['data'])
                    scattered_eval_data_futures.append(scattered_future)
                except Exception as e:
                    print("there was an issue with creating the EVAL scattererd_future list.")
                    print(e)
                    
                
                if whole_eval_dataset is None:
                    whole_eval_dataset = batch_info['whole_dataset']

            # Update the progress bar with the cumulative count of samples processed
            #pbar.update(batch_info['cumulative_count'] - pbar.n)
            pbar.update(len(batch_info['data']))

        pbar.close()  # Ensure closure of the progress bar

    print(f"Completed creation of training and evaluation documents in {round((time() - started)/60,2)} minutes.\n")
   
    print("Data scatter complete...\n")
    #garbage_collection(False, 'scattering training and eval data')
    #del scattered_future
    #del whole_train_dataset, whole_eval_dataset # these variables are not used at all

    train_futures = []  # List to store futures for training
    eval_futures = []  # List to store futures for evaluation
   
    num_topics = len(range(START_TOPICS, END_TOPICS + 1, STEP_SIZE))
    num_alpha_values = len(alpha_values)
    num_beta_values = len(beta_values)

    TOTAL_MODELS = (num_topics * num_alpha_values * num_beta_values) * 2

    #progress_bar = tqdm(total=TOTAL_MODELS, desc="Creating and saving models")

    train_eval = ['eval', 'train']

    # Create a list of all combinations of n_topics, alpha_value, beta_value, and train_eval
    combinations = list(itertools.product(range(START_TOPICS, END_TOPICS + 1, STEP_SIZE), alpha_values, beta_values, train_eval))

    # Separate the combinations into two lists based on 'train' and 'eval'
    train_combinations = [combo for combo in combinations if combo[-1] == 'train']
    eval_combinations = [combo for combo in combinations if combo[-1] == 'eval']

    # Calculate the sample size for each category
    sample_size = min(len(train_combinations), len(eval_combinations))

    # Select random combinations from each category
    random_train_combinations = random.sample(train_combinations, sample_size)
    random_eval_combinations = random.sample(eval_combinations, sample_size)

    # Combine the randomly selected train and eval combinations
    random_combinations = random_eval_combinations+ random_train_combinations
    sample_size = max(1, int(len(combinations) * 0.375))

    # Select random_combinations conditionally
    random_combinations = random.sample(combinations, sample_size) if sample_size < len(combinations) else combinations
    progress_bar = tqdm(total=len(random_combinations), desc="Creating and saving models")
    print(f"The random sample combinations contains {len(random_combinations)}")

    # Determine which combinations were not drawn by using set difference
    undrawn_combinations = list(set(combinations) - set(random_combinations))

    print(f"this leaves {len(undrawn_combinations)} remaining\n")

    # Create empty lists to store all future objects for training and evaluation
    train_futures = []
    eval_futures = []
    
    # Convert total memory from GB to bytes (1 GB = 1024^3 bytes)
    TOTAL_MEMORY_BYTES = 128 * (1024 ** 3)

    # Iterate over the combinations and submit tasks
    for n_topics, alpha_value, beta_value, train_eval_type in random_combinations:

        # determine if throttling is needed
        logging.info("\nEvaluating if adaptive throttling is necessary (method exponential backoff)...")
        started, throttle_attempt = time(), 0

        while throttle_attempt < MAX_RETRIES and not all(worker['metrics']['cpu'] < CPU_UTILIZATION_THRESHOLD for worker in client.scheduler_info()['workers'].values()):
            logging.info(f"Adaptive throttling (attempt {throttle_attempt} of {MAX_RETRIES-1}")
            #logging.info(f"for LdaModel hyperparameters combination -- type: {train_eval_type}, topic: {n_topics}, ALPHA: {alpha_value} and ETA {beta_value}")
            sleep(exponential_backoff(throttle_attempt))
            throttle_attempt += 1

        logging.info(f"Adaptive throttling (method: exponential backoff) {'completed in {:.2f} seconds'.format(time() - started) if throttle_attempt else 'was not necessary...'}\n")

        #logging.info(f"for LdaModel hyperparameters combination -- type: {train_eval_type}, topic: {n_topics}, ALPHA: {alpha_value} and ETA {beta_value}")
        # Submit a future for each scattered data object in the training list
        #if train_eval_type == 'train':
        # Submit a future for each scattered data object in the training list
        for scattered_data in scattered_train_data_futures:
            future = client.submit(train_model, n_topics, alpha_value, beta_value, scattered_data, 'train')
            train_futures.append(future)
            logging.info(f"The training value is being appended to the train_futures list. Size: {len(train_futures)}")

        # Submit a future for each scattered data object in the evaluation list
        #if train_eval_type == 'eval':
        for scattered_data in scattered_eval_data_futures:
            future = client.submit(train_model, n_topics, alpha_value, beta_value, scattered_data, 'eval')
            eval_futures.append(future)
            logging.info(f"The evaluation value is being appended to the eval_futures list. Size: {len(eval_futures)}")
        #garbage_collection(False, 'client.submit(train_model(...) train and eval)')


        # Map the created futures to their parameters so we can identify them later if needed
        for future in train_futures:
            future_to_params[future] = ('train',n_topics, alpha_value, beta_value)

        # Do the same for eval_futures
        for future in eval_futures:
            future_to_params[future] = ('eval', n_topics, alpha_value, beta_value)

        #train_futures.append(all_train_futures)
        #eval_futures.append(all_eval_futures)
        #print(f"This is the size of the eval_futures {len(eval_futures)}")
        #print(f"this is the eval futures: {eval_futures}\n\n")
            
        # Check if it's time to process futures based on BATCH_SIZE
        #if int(len(train_futures)/3) >= (BATCH_SIZE % 10):
        train_eval_count = train_futures + eval_futures
        if int(len(train_eval_count)) >= (BATCH_SIZE % 10):
            print("In holding pattern until WAIT completes.")
            started = time()
                
            #done, not_done = wait(train_futures + eval_futures, timeout=None)        # Wait for all reattempted futures with an extended timeout (e.g., 120 seconds)

            # Process completed ones after reattempting
            #done_train = [f for f in done if f in train_futures]
            #done_eval = [f for f in done if f in eval_futures]
            # Wait for completion of eval_futures
            done_eval, not_done_eval = wait(eval_futures, timeout=None)  # return_when='FIRST_COMPLETED'
            print(f"This is the size of the done_eval list: {len(done_eval)} and this is the size of the not_done_eval list: {len(not_done_eval)}")

            # Wait for completion of train_futures
            done_train, not_done_train = wait(train_futures, timeout=None)  # return_when='FIRST_COMPLETED'
            print(f"This is the size of the done_train list: {len(done_train)} and this is the size of the not_done_train list: {len(not_done_train)}")

            done = done_train.union(done_eval)
            not_done = not_done_eval.union(not_done_train)
                
            elapsed_time = round(((time() - started) / 60), 2)
            print(f"WAIT completed in {elapsed_time} minutes")
            print(f"This is the size of DONE {len(done)}. And this is the size of NOT_DONE {len(not_done)}\n")
            #print(f"this is the value of done_train {done_train}")

            completed_train_futures = [f for f in done_train]
            print(f"We have completed the TRAIN list comprehension. The size is {len(completed_train_futures)}")
            print(f"This is the length of the TRAIN completed_train_futures var {len(completed_train_futures)}")
            
            completed_eval_futures = [f for f in done_eval]
            print(f"We have completed the EVAL list comprehension. The size is {len(completed_eval_futures)}")
            print(f"This is the length of the EVAL completed_eval_futures var {len(completed_eval_futures)}")

            logging.info(f"This is the size of completed_train_futures {len(completed_train_futures)} and this is the size of completed_eval_futures {len(completed_eval_futures)}")
            process_completed_futures(completed_train_futures, completed_eval_futures, LOG_DIR)
            progress_bar.update(len(done))

            # Handle failed futures using the previously defined function
            for future in not_done:
                failed_future_timer = time()
                print("Handling of failed future method has been initiated.")
                handle_failed_future(future, future_to_params, train_futures,  eval_futures, client)
                elapsed_time = round(((time() - started) / 60), 2)
                print(f"It took {elapsed_time} minutes to handle the failed future.")


            # If no tasks are pending (i.e., all have been processed), consider increasing BATCH_SIZE.
            completed_tasks = 0
            completed_tasks += len(done_train) + len(done_eval)

            # If no tasks are pending (i.e., all have been processed), consider increasing BATCH_SIZE.
            if completed_tasks >= len(train_futures) + len(eval_futures):
                BATCH_SIZE = int(math.ceil(BATCH_SIZE * INCREASE_FACTOR)) if int(math.ceil(BATCH_SIZE * INCREASE_FACTOR)) < MAX_BATCH_SIZE else MAX_BATCH_SIZE
                print(f"Increasing batch size to {BATCH_SIZE}")

            # If there are any tasks that were not done, consider decreasing BATCH_SIZE.
            else:
                BATCH_SIZE = max(1, int(BATCH_SIZE * DECREASE_FACTOR)) if max(1, int(BATCH_SIZE * DECREASE_FACTOR)) > 0 else BATCH_SIZE
                print(f"Decreasing batch size to {BATCH_SIZE}")

            # reset lists to empty for next iteration of models
            train_futures.clear()
            eval_futures.clear()

            #defensive programming to ensure utility model lists are empty
            done.clear()
            not_done.clear()
            done_train.clear()
            done_eval.clear()
            not_done_eval.clear()
            not_done_train.clear()
         
    #garbage_collection(False, "Cleaning WAIT -> done, not_done")     

    progress_bar.close()

    # After all loops have finished running...
    if len(train_futures) > 0 or len(eval_futures) > 0:
        print("we are in the first IF statement for retry_processing()")
        retry_processing(train_futures, eval_futures, TIMEOUT)


    # Now give one more chance with extended timeout only to those that were incomplete previously
    if len(failed_model_params) > 0:
        print("Retrying incomplete models with extended timeout...")
        
        # Create new lists for retrying futures
        retry_train_futures = []
        retry_eval_futures = []

        # Resubmit tasks only for those that failed in the first attempt
        for params in failed_model_params:
            n_topics, alpha_value, beta_value = params
            
            with performance_report(filename=PERFORMANCE_TRAIN_LOG):
                future_train_retry = client.submit(train_model, n_topics, alpha_value, beta_value, scattered_train_data_futures, 'train')
                future_eval_retry = client.submit(train_model, n_topics, alpha_value, beta_value, scattered_eval_data_futures, 'eval')

            retry_train_futures.append(future_train_retry)
            retry_eval_futures.append(future_eval_retry)

            # Keep track of these new futures as well
            future_to_params[future_train_retry] = params
            future_to_params[future_eval_retry] = params

        # Clear the list of failed model parameters before reattempting
        failed_model_params.clear()

        # Wait for all reattempted futures with an extended timeout (e.g., 120 seconds)
        done, not_done = wait(retry_train_futures + retry_eval_futures ) #, timeout=EXTENDED_TIMEOUT)

        # Process completed ones after reattempting
        process_completed_futures([f for f in done if f in retry_train_futures],
                                [f for f in done if f in retry_eval_futures],
                                LOG_DIR)
        
        progress_bar.update(len(done))

        # Record parameters of still incomplete futures after reattempting for later review
        for future in not_done:
            failed_model_params.append(future_to_params[future])

        # At this point `failed_model_params` contains the parameters of all models that didn't complete even after a retry

    #client.close()
    print("The training and evaluation loop has completed.")

    if len(failed_model_params) > 0:
        # You can now review `failed_model_params` to see which models did not complete successfully.
        logging.error("The following model parameters did not complete even after a second attempt:")
    #    perf_logger.info("The following model parameters did not complete even after a second attempt:")
        for params in failed_model_params:
            logging.error(params)
    #        perf_logger.info(params)
            
client.close()
cluster.close()

Dask client is connected to a scheduler.
Dask workers are running.
Creating training and evaluation samples...


  0%|          | 0/8495 [00:00<?, ?it/s]

the number of records read from the JSON file: 8495
the number of documents sampled from the JSON file: 8495



8599it [00:10, 826.43it/s]                          


Completed creation of training and evaluation documents in 0.17 minutes.

Data scatter complete...



Creating and saving models:   0%|          | 0/495 [00:00<?, ?it/s]

The random sample combinations contains 495
this leaves 825 remaining

In holding pattern until WAIT completes.


2024-09-06 16:36:51,117 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51022
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 16:36:51,259 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51022
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 7.52 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:   6%|▌         | 29/495 [07:40<2:03:18, 15.88s/it]

Increasing batch size to 600
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 2.58 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  12%|█▏        | 58/495 [10:22<1:11:31,  9.82s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 16:46:05,213 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51202
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2874, in get_data_from_worker
    response = await send_recv(
               ^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\core.py", line 101

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.94 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  18%|█▊        | 87/495 [16:30<1:15:45, 11.14s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 16:52:21,288 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51825
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 16:52:21,347 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:51825 -> tcp://127.0.0.1:52562
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 6.11 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  23%|██▎       | 116/495 [22:48<1:15:05, 11.89s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 16:59:15,540 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51046
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 6.34 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  29%|██▉       | 145/495 [29:16<1:12:26, 12.42s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.73 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  35%|███▌      | 174/495 [35:09<1:06:01, 12.34s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 17:08:41,313 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52089
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2874, in get_data_from_worker
    response = await send_recv(
               ^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\core.py", line 101

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 2.99 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  41%|████      | 203/495 [38:22<50:59, 10.48s/it]  

Increasing batch size to 650
In holding pattern until WAIT completes.




This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 7.26 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  47%|████▋     | 232/495 [45:47<52:42, 12.02s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.48 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  53%|█████▎    | 261/495 [51:22<46:19, 11.88s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.03 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  59%|█████▊    | 290/495 [56:30<39:15, 11.49s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 17:31:41,554 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51031
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 4.88 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  64%|██████▍   | 319/495 [1:01:31<32:42, 11.15s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 17:39:25,914 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:55334
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2874, in get_data_from_worker
    response = await send_recv(
               ^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\core.py", line 101

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 8.16 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  70%|███████   | 348/495 [1:09:48<31:47, 12.98s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 9.21 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  76%|███████▌  | 377/495 [1:19:07<29:16, 14.89s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 6.19 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  82%|████████▏ | 406/495 [1:25:29<21:18, 14.37s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 4.26 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  88%|████████▊ | 435/495 [1:29:57<12:49, 12.82s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:04:42,800 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:55779
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 4.38 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models:  94%|█████████▎| 464/495 [1:34:33<06:06, 11.82s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 6.5 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 100%|█████████▉| 493/495 [1:41:14<00:24, 12.43s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:17:18,903 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:56234
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 18:17:18,916 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:56234
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                            

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.9 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 522it [1:47:17, 12.46s/it]                         

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:24:29,396 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:58372
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2874, in get_data_from_worker
    response = await send_recv(
               ^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\core.py", line 101

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 7.43 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 551it [1:54:50, 13.40s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:28:28,642 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51040
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 18:28:28,669 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51040
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 2.88 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 580it [1:57:53, 11.28s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:33:49,307 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:60944
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2874, in get_data_from_worker
    response = await send_recv(
               ^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\core.py", line 101

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.73 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 609it [2:03:45, 11.53s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:39:14,973 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51205
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.23 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 638it [2:09:06, 11.39s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:45:09,607 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:59248
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.9 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 667it [2:15:10, 11.75s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:49:31,593 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:60373
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 3.92 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 696it [2:19:17, 10.77s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 18:58:46,098 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:54875
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 9.9 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 725it [2:29:20, 13.78s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:03:30,270 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:62257
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 19:03:30,327 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:62257 -> tcp://127.0.0.1:63934
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 3.51 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 754it [2:33:07, 11.99s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:10:09,086 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:63934
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 7.58 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 783it [2:40:56, 13.25s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:15:23,770 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:63133
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 19:15:23,791 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:63133 -> tcp://127.0.0.1:64301
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 3.86 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 812it [2:45:05, 11.85s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:21:04,690 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:64301
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.69 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 841it [2:50:57, 11.94s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:26:54,865 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:62774
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 467, in read_into
    self._try_inline_read()
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 835, in _try_inline_read
    self._check_closed()
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 998, in _check_closed
    raise Strea

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 5.57 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 870it [2:56:40, 11.90s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:30:47,871 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:61790
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 467, in read_into
    self._try_inline_read()
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 835, in _try_inline_read
    self._check_closed()
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 998, in _check_closed
    raise Strea

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 3.44 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 899it [3:00:21, 10.62s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.




This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 3.82 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 928it [3:04:24,  9.94s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:39:21,333 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:64943
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 19:39:21,354 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:64943
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 4.04 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


Creating and saving models: 957it [3:08:43,  9.64s/it]

Increasing batch size to 650
In holding pattern until WAIT completes.




This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 6.25 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


2024-09-06 19:46:13,063 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51034
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

Increasing batch size to 650
In holding pattern until WAIT completes.
This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0


2024-09-06 19:50:06,541 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:50506
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 366, in read_bytes_rw
    actual = await stream.read_into(chunk)  # type: ignore[arg-type]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\env

This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 2.64 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


2024-09-06 19:50:07,103 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:51317 -> tcp://127.0.0.1:52025
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 1778, in get_data
    response = await comm.read(deserializers=serializers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", li

Increasing batch size to 650
In holding pattern until WAIT completes.


2024-09-06 19:59:18,032 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51025
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2056, in gather_dep
    response = await get_data_from_worker(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 2888, in get_data_from_worker
    await comm.write("OK")
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
2024-09-06 19:59:18,413 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:51025
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                            

This is the size of the done_eval list: 6 and this is the size of the not_done_eval list: 0
This is the size of the done_train list: 23 and this is the size of the not_done_train list: 0
WAIT completed in 9.82 minutes
This is the size of DONE 29. And this is the size of NOT_DONE 0

We have completed the TRAIN list comprehension. The size is 23
This is the length of the TRAIN completed_train_futures var 23
We have completed the EVAL list comprehension. The size is 6
This is the length of the EVAL completed_eval_futures var 6


2024-09-06 20:00:42,806 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:52025 -> tcp://127.0.0.1:52261
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 861, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\iostream.py", line 1116, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\worker.py", line 1778, in get_data
    response = await comm.read(deserializers=serializers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File

FutureCancelledError: train_model-b5bdfd1ef1d23484a812e324257e220e cancelled for reason: unknown.

2024-09-06 20:01:17,092 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52261
Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\utils.py", line 1923, in wait_for
    return await fut
           ^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\tcp.py", line 546, in connect
    stream = await self.client.connect(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\tornado\tcpclient.py", line 279, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\pqn7\.conda\envs\lda\Lib\site-packages\distributed\comm\core.py", line 342, in connect
    comm = await wait_for(
         

The provided code snippet is part of a larger program that appears to be running machine learning model training and evaluation tasks in parallel using Dask, a flexible parallel computing library for analytic computing. The code manages the execution of these tasks, handling retries for incomplete tasks, and tracking failures.

The script begins by initializing an empty list called failed_model_params to store the parameters of models that fail to complete even after a retry. It also creates a dictionary named future_to_params to map "futures" (a representation of an asynchronous execution) to their corresponding model parameters.

Two functions are defined: process_completed_futures, which processes completed futures, and retry_processing, which attempts to reprocess incomplete futures with an extended timeout period.

The main part of the script sets up multiple training and evaluation tasks across different combinations of hyperparameters (n_topics, alpha_value, and beta_value). These tasks are submitted to a Dask client asynchronously using the client.submit method. Each task returns a future, which is then mapped to its parameters in the future_to_params dictionary for later reference.

The script uses batch processing controlled by a variable called BATCH_SIZE. Once enough futures have been accumulated, or when all loops have finished running, it waits for all futures within each batch to complete using Dask's wait function with a specified timeout (TIMEOUT). Completed futures are processed while those that remain incomplete are recorded in the failed_model_params list for further action.

After processing each batch, if there are any remaining futures (either from incomplete batches or from the final iteration), they are retried using the previously defined retry_processing function with the same timeout value.

If there are still models that failed after this first attempt, they get one more chance. The script prints out a message indicating it will retry these incomplete models with an extended timeout (EXTENDED_TIMEOUT). It resubmits these tasks and waits again for completion. Any models that remain incomplete after this second attempt are added back into the failed_model_params.

Finally, once all retries have been exhausted and progress has been tracked via a progress bar (tqdm), the Dask client is closed. The script prints out and logs information about any model parameters that did not complete successfully even after two attempts.

In summary, this code automates the process of submitting parallelized machine learning training and evaluation jobs over various hyperparameter combinations, handles timeouts by retrying incomplete jobs,

In [None]:
import pandas as pd
import pyarrow.parquet as pa

# Uncomment the next two lines if you want to view the file's schema.
# parquetFile = pa.ParquetFile('test.parquet')
# print(parquetFile.schema)

df = pd.read_parquet(r'C:\_harvester\data\lda-models\2010s_html\metadata\metadata.parquet')
df.to_csv(r'C:\_harvester\data\lda-models\2010s_html\metadata\metadata.csv', sep=';')