## Initialize Project

We'll initialize the project using the `HyFI.initialize` function. The function takes the following parameters:

- `project_name`: Name of the project to use.
- `project_description`: Description of the project that will be used.
- `project_root`: Root directory of the project.
- `project_workspace_name`: Name of the project's workspace directory.
- `global_hyfi_root`: Root directory of the global hyfi.
- `global_workspace_name`: Name of the global hierachical workspace directory.
- `num_workers`: Number of workers to run.
- `logging_level`: Log level for the log.
- `autotime`: Whether to automatically set time and / or keep track of run times.
- `retina`: Whether to use retina or not.
- `verbose`: Enables or disables logging

We'll check if we're running in Google Colab, and if so, we'll mount Google Drive.


In [1]:
import os
from thematos import HyFI

os.environ["HYFI_LOG_LEVEL"] = "DEBUG"

if HyFI.is_colab():
    HyFI.mount_google_drive()

project_dir = HyFI.DotEnvConfig().DOTENV_DIR

h = HyFI.initialize(
    project_name="thematos",
    project_root=project_dir,
    logging_level="INFO",
    verbose=True,
)

print("project_dir:", h.project.root_dir)
print("project_workspace_dir:", h.project.workspace_dir)

  from .autonotebook import tqdm as notebook_tqdm
INFO:hyfi.utils.notebooks:Google Colab not detected.
INFO:hyfi.utils.notebooks:Extension autotime not found. Install it first.
INFO:hyfi.joblib.joblib:initialized batcher with <hyfi.joblib.batch.batcher.Batcher object at 0x7fe3dc7765b0>
INFO:hyfi.main.config:HyFi project [thematos] initialized


project_dir: /raid/cis/yjlee/workspace/projects/thematos
project_workspace_dir: /raid/cis/yjlee/workspace/projects/thematos/workspace


In [2]:
from thematos.datasets import Corpus

data_file = (
    h.project.root_dir / "workspace/datasets/processed/khmer_tokenized/train.parquet"
)
data_load = {"data_file": str(data_file)}
c = Corpus(data_load=data_load, id_col="id", text_col="text")


INFO:hyfi.task.batch:Initalized batch: corpus(0) in /raid/cis/yjlee/workspace/projects/thematos/workspace/topic/corpus


In [3]:
from thematos.models import WordPrior

data_file = h.project.root_dir / "workspace/datasets/word_prior.yaml"


w = WordPrior(data_file=str(data_file), verbose=True)
w.priors

INFO:hyfi.task.batch:Initalized batch: corpus(0) in /raid/cis/yjlee/workspace/projects/thematos/workspace/topic/corpus


{0: ['central', 'bank', 'nbc'], 1: ['rates', 'interest']}

In [4]:
from thematos.models import LdaModel

lda = LdaModel(corpus=c, wordprior=w, verbose=True)
lda.model_args.k = 10


INFO:hyfi.batch.batch:Setting seed to 642562094
INFO:hyfi.batch.batch:Init batch - Batch name: model, Batch num: 0
INFO:hyfi.batch.batch:Init batch - Batch name: model, Batch num: 0
INFO:hyfi.task.batch:Initalized batch: model(0) in /raid/cis/yjlee/workspace/projects/thematos/workspace/topic/model


In [5]:
lda.model_args.model_dump()


{'tw': 1,
 'min_cf': 5,
 'min_df': 0,
 'rm_top': 0,
 'k': 10,
 'alpha': 0.1,
 'eta': 0.01,
 '_target_': 'thematos.models.config.LdaConfig',
 '_config_name_': 'lda',
 '_config_group_': '/model/config'}

In [None]:
lda.train()


In [7]:
lda.load(batch_num=0)


INFO:hyfi.task.batch:> Loading config for batch_name: model batch_num: 0
INFO:hyfi.task.batch:Loading config from /raid/cis/yjlee/workspace/projects/thematos/workspace/topic/model/configs/model(0)_config.yaml
INFO:hyfi.task.batch:Merging config with the loaded config
INFO:hyfi.task.batch:Updating config with config_kwargs: {}
INFO:hyfi.task.batch:Initalized batch: corpus(0) in /raid/cis/yjlee/workspace/projects/thematos/workspace/topic/corpus
INFO:hyfi.task.batch:Initalized batch: model(0) in /raid/cis/yjlee/workspace/projects/thematos/workspace/topic/model
INFO:hyfi.utils.iolibs:Processing [1] files from ['/raid/cis/yjlee/workspace/projects/thematos/workspace/topic/model/outputs/LDA_model(0)_k(10)-ll_per_word.csv']
INFO:hyfi.utils.datasets.load:Loading data from /raid/cis/yjlee/workspace/projects/thematos/workspace/topic/model/outputs/LDA_model(0)_k(10)-ll_per_word.csv
INFO:hyfi.utils.datasets.load: >> elapsed time to load data: 0:00:00.010361
INFO:hyfi.utils.iolibs:Processing [1] fil