# LDA Optimization 1

To ensure for the best model possible for each genre, I will iteratively optimize the model for the best possible performance in each genre. For this first set of optimization runs, I will cover a large search area over all of the appropriate parameters, to get a broad sense of which parameters matter most for the coherence score for the `gensim` `LdaModel`. 

After this, I will explore the outcomes of this optimization, visualizing which parameters have the largest impact on the coherence score. Taking this knowledge, I will further optimize the parameters in a successive optimization run.

In order to conduct this primary run, I will use these libraries:
- pandas
- sci-kit learn (sklearn)
- sci-kit optimize (skopt)
- gensim

Pandas, of course, is to load and prepare the data. I will follow this by using sci-kit learn's `TfidfVectorizer` and gensim's `Sparse2Corpus` and `Dictionary` to extract the features. Finally, I will use sci-kit optimize's `gp_minimize` to perform the optimization on gensim's `LdaModel`.

The search space for the optimization run will be performed on these parameters of the `LdaModel`:
- `num_topics`
- `decay`
- `gamma_threshold`
- `minimum_probability`
- `offset`
- `iterations`

The steps to perform this optimization are as follows:

1. Import the libraries
2. Import the data
3. Iterate over each genre:
    1. Obtain genre-specific lyrics
    2. Turn lyrics into series
    3. Clean the lyrics using the `clean_lyrics` function, from `clean.py`.
    4. Extract features into doc-term matrix using `TfidfVectorizer`.
    5. Create gensim corpus using `Sparse2Corpus`.
    6. Create gensim `Dictionary`.
    7. Initialize search space.
    8. Optimize gensim `LdaModel` on the search space using `gp_minimize`.
    9. Create results `DataFrame` and save results to `.csv` file using pandas.

To build this, I will create two functions:

1. A function to extract the features after cleaning.
2. A function to optimize the model.

The functions will then be called, after preparing the genres data, within the `for` loop of genres.


## Importing the libraries

As mentioned, I will first import the necessary libraries. I will also import the `clean_lyrics` function from `src`.

In [None]:
from gensim.corpora import Dictionary
from gensim.matutils import Sparse2Corpus
from gensim.models import LdaModel, CoherenceModel
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from skopt import gp_minimize
from skopt.space import Integer, Real
from skopt.utils import use_named_args

from src.clean import clean_lyrics

## Import data

The next step is to import the data.

In [None]:
# Import data
data = pd.read_csv("../../../data/raw/song_lyrics_sampled.csv")

## Extract features

Here, I will use Term Frequency Inverse-Document Frequency (TFIDF) to extract the features of the corpus into a doc-term matrix. The purpose for the use of TFIDF over a simple Bag of Words (BoW) model, is so that the quality of topics may be improved within the LDA 