# Baseline Optimization

The primary objective of this baseline optimization run is to establish a foundational performance benchmark for the Latent Dirichlet Allocation (LDA) model implemented via the Gensim library on a pre-processed corpus. This will serve as the starting point for subsequent parameter tuning and optimizations.

## Purpose

1. Establish Baseline Metrics: To determine the initial effectiveness of the LDA model in discovering coherent and meaningful topics from the corpus without any parameter tuning.
2. Parameter Sensitivity Analysis: To identify which parameters have the most significant impact on the model's performance, providing a focused direction for detailed tuning.
3. Computational Efficiency Evaluation: To assess the training time and resource utilization of the model under default settings, ensuring the scalability and practicality of further experiments.

## Metrics for Evaluation

- Topic Coherence (C_v): Measures the degree of semantic similarity between high scoring words in the topic. This metric will help in evaluating the interpretability and quality of the topics extracted by the model.
- Perplexity: Evaluates how well the probability distribution predicted by the model matches the actual distribution of the words in the documents. Lower values indicate better fitting models.
- Computation Time: Evaluates the time it takes to train the model for different parameter values. Lower computation time means that it was quicker to train the model, which is advantageous, but not indicative of better quality models.

## Methodology
1. Model Configuration
    - Number of Topics (num_topics): Start with a mid-range value, such as 10 or 20, to gauge the granularity of topics extracted from the corpus.
    - Learning Method: Use online learning with default parameters for initial runs to assess the model's adaptability to incremental data.
    - Iterations and Passes: Set to default values to evaluate out-of-the-box convergence behavior of the model.
2. Execution
    - Training the Model: Utilize the pre-processed corpus to train the LDA model using the specified configurations.
    - Logging and Monitoring: Record the training progress, including computation time and intermediate metric scores (coherence and perplexity) for each iteration.
    - Per genre & meta: The training will be conducted on both on the entire dataset (meta), as well as per-genre. This will allow the establishment of a baseline for all cases, and allow or the identification of baseline differences between each genre.
3. Post-Processing
    - Topic Examination: Review the topics generated by the model for relevance and coherence.
    - Metric Calculation: Compute the coherence and perplexity scores for the model.
4. Documentation
    - Reporting: Document all findings, including configurations, system utilization, and performance metrics.
    - Recommendations for Optimization: Based on the baseline results, recommend parameters for subsequent tuning phases.

## Expected Outcomes

The execution of this plan is expected to yield a comprehensive understanding of the baseline capabilities of the LDA model on the given corpus. The results will guide further fine-tuning of the model parameters, with a specific focus on enhancing topic quality and optimizing computational resources.

In [2]:
# Import packages
import time

from gensim.models import CoherenceModel, LdaModel
from gensim.corpora import Dictionary
import numpy as np
import pandas as pd

from src.utils import load_processed_dataset

# Import data
df = load_processed_dataset("../../../data/processed/song_lyrics_sampled_processed.csv")

# Separate into genres
meta = df.lyrics
country = df[df['tag'] == 'country'].lyrics
rap = df[df['tag'] == 'rap'].lyrics
rb = df[df['tag'] == 'rb'].lyrics
pop = df[df['tag'] == 'pop'].lyrics
rock = df[df['tag'] == 'rock'].lyrics

# Extract features
def bow_extract(texts):

    # Create dictionary
    dictionary = Dictionary(texts)

    # Create corpus
    corpus = [dictionary.doc2bow(doc) for doc in texts]

    # Return
    return dictionary, corpus

meta_dct, meta_corpus = bow_extract(meta)
country_dct, country_corpus = bow_extract(country)
rap_dct, rap_corpus = bow_extract(rap)
rb_dct, rb_corpus = bow_extract(rb)
pop_dct, pop_corpus = bow_extract(pop)
rock_dct, rock_corpus = bow_extract(rock)

def train_baseline_model(genre, dct, corpus, lyrics):
    print(f"Training baseline {genre} model...")
    
    baseline_data = {'genre': genre}
    
    start = time.time()
    model = LdaModel(corpus=corpus, id2word=dct, num_topics=20)
    end = time.time()
    
    print(f"Trained {genre} model after {end-start}...")
    
    model.save(f"../../../models/sampled/baseline/{genre}_baseline.model")
    
    baseline_data['time'] = end - start
    
    coh_model = CoherenceModel(model, texts=lyrics, dictionary=dct)
    baseline_data['coherence'] = coh_model.get_coherence()
    baseline_data['perplexity'] = model.log_perplexity(corpus)
    return baseline_data

baselines = []
baselines.append(train_baseline_model('meta', meta_dct, meta_corpus, meta))
baselines.append(train_baseline_model('country', country_dct, country_corpus, country))
baselines.append(train_baseline_model('rap', rap_dct, rap_corpus, rap))
baselines.append(train_baseline_model('rb', rb_dct, rb_corpus, rb))
baselines.append(train_baseline_model('pop', pop_dct, pop_corpus, pop))
baselines.append(train_baseline_model('rock', rock_dct, rock_corpus, rock))

baseline_df = pd.DataFrame(baselines)
baseline_df.to_csv("../../../data/optimization/baseline/baseline_performances.csv")

Training baseline meta model...
Trained meta model after 31.76366686820984...
Training baseline country model...
Trained country model after 7.315662384033203...
Training baseline rap model...
Trained rap model after 12.06016492843628...
Training baseline rb model...
Trained rb model after 7.64744234085083...
Training baseline pop model...
Trained pop model after 7.286738872528076...
Training baseline rock model...
Trained rock model after 8.38888430595398...
