# LDA Optimization 1

To ensure for the best model possible for each genre, I will iteratively optimize the model for the best possible performance in each genre. For this first set of optimization runs, I will cover a large search area over all of the appropriate parameters, to get a broad sense of which parameters matter most for the coherence score for the `gensim` `LdaModel`. 

After this, I will explore the outcomes of this optimization, visualizing which parameters have the largest impact on the coherence score. Taking this knowledge, I will further optimize the parameters in a successive optimization run.

In order to conduct this primary run, I will use these libraries:
- pandas
- sci-kit learn (sklearn)
- sci-kit optimize (skopt)
- gensim

Pandas, of course, is to load and prepare the data. I will follow this by using sci-kit learn's `TfidfVectorizer` and gensim's `Sparse2Corpus` and `Dictionary` to extract the features. Finally, I will use sci-kit optimize's `gp_minimize` to perform the optimization on gensim's `LdaModel`.

The search space for the optimization run will be performed on these parameters of the `LdaModel`:
- `num_topics`
- `decay`
- `gamma_threshold`
- `minimum_probability`
- `offset`
- `iterations`

The steps to perform this optimization are as follows:

1. Import the libraries
2. Import the data
3. Iterate over each genre:
    1. Obtain genre-specific lyrics
    2. Turn lyrics into series
    3. Clean the lyrics using the `clean_lyrics` function, from `clean.py`.
    4. Extract features into doc-term matrix using `TfidfVectorizer`.
    5. Create gensim corpus using `Sparse2Corpus`.
    6. Create gensim `Dictionary`.
    7. Initialize search space.
    8. Optimize gensim `LdaModel` on the search space using `gp_minimize`.
    9. Create results `DataFrame` and save results to `.csv` file using pandas.

To build this, I will create two functions:

1. A function to extract the features after cleaning.
2. A function to optimize the model.

The functions will then be called, after preparing the genres data, within the `for` loop of genres.


## Importing the libraries

As mentioned, I will first import the necessary libraries. I will also import the `clean_lyrics` function from `src`.

In [1]:
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel
import numpy as np
import pandas as pd
from skopt import gp_minimize
from skopt.space import Integer, Real
from skopt.utils import use_named_args

from src.clean.lyrics_cleaner_class import LyricsCleaner
from src.utils import txt_to_set, read_json_mapping

stop_words = txt_to_set("../../../data/vocab/stopwords.txt")
contractions = read_json_mapping("../../../data/vocab/contractions.json")
dropped_gs = read_json_mapping("../../../data/vocab/dropped_gs.json")

## Import data & Clean Data

The next step is to import and pre-process the data. I will use the class LyricsCleaner, which is based on the steps in `preprocessing.ipynb`, and imported from `src`.

In [4]:
# Import data
data = pd.read_csv("../../../data/raw/song_lyrics_sampled.csv")

# Clean data
cleaner = LyricsCleaner(stop_words=stop_words, verbose=True, contractions=contractions, dropped_gs=dropped_gs)
_, lyrics = cleaner.clean_lyrics(data.lyrics)

Starting cleaning...
Lowercasing lyrics...
Normalizing unicode...
Removing square brackets...
Removing regular brackets...
Removing newline characters...
Removing carriage return characters...
Removing adlibs...
Removing whitespace...
Removing punctuation...
Tokenizing lyrics...
Mapping vocabulary...
Tagging part-of-speech...
Filtering part-of-speech...
Lemmatizing part-of-speech...
Removing stopwords...
Finished cleaning lyrics!


## Extract features

Here, I will use Bag-of-Words to simply count the frequencies of terms across the lyrics. This method provides less-noisy and better human-interpretable topics than TF-IDF, as discovered in the analysis of the output of the LDA model when trained on both TF-IDF and BoW features (you can read more in `extraction_optimization.ipynb`).

In [5]:
# Make dictionary
dct = Dictionary(lyrics)

# Make corpus
corp = [dct.doc2bow(doc) for doc in lyrics]

## Optimization

Finally, it's time to optimize the LDA model. Here I will use `gp_minimize` to iterate over a large search space of parameters to put into the gensim `LdaModel`. I will score each model using the gensim `Coherence` model, and build a dataframe of the results. These results will be analyzed in `lda_analysis_1.ipynb`. After which, I will optimize the model further on the most-affective parameters for the model.

In [6]:


# Create search space
space = [Integer(5, 100, prior='log-uniform', name="num_topics"),
            Real(0.5, 1, prior='log-uniform', name='decay'),
            Real(0.001, 1, prior='log-uniform', name='gamma_threshold'),
            Real(0.01, 1, prior='log-uniform', name='minimum_probability'),
            Real(1, 2, prior='log-uniform', name="offset"),
            Integer(10, 100, prior='log-uniform', name='iterations')]

# Define objective function
@use_named_args(space)
def lda_optimizer(**params):
    model = LdaModel(corpus=corp, id2word=dct,
                        dtype=np.float64, alpha='auto', random_state=42, **params)
    coherence_model = CoherenceModel(
        model=model, texts=lyrics, dictionary=dct)
    return -coherence_model.get_coherence()

# Run optimizer
result = gp_minimize(lda_optimizer, space, n_calls=32,
                        random_state=0, verbose=True, initial_point_generator='sobol', n_jobs=3)

# Extract results into a DataFrame

params_df = pd.DataFrame(result.x_iters, columns=[
                            dim.name for dim in space])
params_df['score'] = -result.func_vals
params_df.to_csv("../../../data/optimization/lda_optimization1.csv")
print(params_df)



Iteration No: 1 started. Evaluating function at random point.
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 49.2510
Function value obtained: -0.3136
Current minimum: -0.3136
Iteration No: 2 started. Evaluating function at random point.
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 38.7844
Function value obtained: -0.4040
Current minimum: -0.4040
Iteration No: 3 started. Evaluating function at random point.
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 37.8364
Function value obtained: -0.3589
Current minimum: -0.4040
Iteration No: 4 started. Evaluating function at random point.
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 48.6699
Function value obtained: -0.3052
Current minimum: -0.4040
Iteration No: 5 started. Evaluating function at random point.
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 109.4192
Function value obtained: -0.2978
Current minimum: -0.4040
Iteration No: 6 sta