# Extraction Optimization

In this notebook, I will optimize the method of extraction for the LDA models topic coherence between two choices:
1. Term Frequency-Inverse Document Frequency (TF-IDF)
2. Bag of Words (BoW)

Although typically bag-of-words is the common form of method extraction for Latent Dirichlet Allocation, it may not be sufficient for my specific use case, requiring high-quality topics per genre. Bag of words simply identifies word frequencies, whereas TF-IDF penalizes words that may occur to often. There's potential, for this specific use case, that TF-IDF **might** lead to more coherent, better quality output.

However, this is not the only consideration to make. LDA expects a bag-of-words, rather than TF-IDF. As such, using TF-IDF may skew the natural frequency signal that LDA relies on, potentially leading to a loss of topical coherence. Beyond this, commonly used terms sometimes help in defining topics. TF-IDF may lead of underweighting in common terms, leading to their loss in contributing to topic definitions.

There doesn't appear to be much of a consensus online about which form of feature extraction for LDA is better for topic modelling, so in this notebook, I will try to identify which method of feature extraction is best for *my use case*. To do this, I will first pre-process the documents using the `clean_lyrics` function. This is an adapted form of the preprocessing found in `preprocessing` notebook. Next, I will create two separate functions, one using TFIDF and another using BoW, these will be the variables of my search space for the optimization. After this, I will create a tertiary function to conduct the optimization using sci-kit optimize. Finally, I will call the optimization function on my data.

## PreProcessing

Here, I will import my dataset and PreProcess it ready for the optimization. This takes several steps:
1. Import pandas and clean_lyrics.
2. Import dataset using `pd.read_csv()`.
3. Turn the lyrics into a `Series` object.
4. Call `clean_lyrics`.

So, first, importing:

In [5]:
# Import pandas
import pandas as pd


Now to import the data as a `DataFrame` using `pd.read_csv()`:

In [6]:
# Import dataset
data = pd.read_csv("../../../data/raw/song_lyrics_sampled.csv")

Turn the `DataFrame` object into a `Series` object of just the lyrics:

In [7]:
# Get Series of lyrics
data = data.lyrics

Clean the lyrics using the `clean_lyrics` function. Here, you'll notice that `remove_adlibs` and `remove_min_len_adlibs` are set to `False`. This is because I have identified that keeping them in actually adds to the coherence and quality of topics. Also, unlike the method of the `preprocessing` notebook, the function returns two items:
1. A `Series` of clean, tokenized lyrics.
2. A `Series` of cleaned lyrics in a string.

Returning both allows for the use of `c_v` coherence within Gensim's `CoherenceModel`.

In [8]:
from src.clean.lyrics_cleaner_class import LyricsCleaner
from src.utils import txt_to_set, read_json_mapping
stopwords = txt_to_set("../../../data/vocab/stopwords.txt")
contractions = read_json_mapping("../../../data/vocab/contractions.json")
dropped_gs = read_json_mapping("../../../data/vocab/dropped_gs.json")


# Clean the lyrics
cleaner = LyricsCleaner(stopwords, contractions, dropped_gs, verbose=True, remove_adlibs=False, remove_min_len_adlibs=False)
lyrics, text = cleaner.clean_lyrics(data)

Starting cleaning...
Lowercasing lyrics...
Normalizing unicode...
Removing square brackets...
Removing regular brackets...
Removing newline characters...
Removing carriage return characters...
Removing whitespace...
Removing punctuation...
Tokenizing lyrics...
Mapping vocabulary...
Tagging part-of-speech...
Filtering part-of-speech...
Lemmatizing part-of-speech...
Removing stopwords...


TypeError: remove_stopwords() takes 2 positional arguments but 7 were given

## Extraction Method 1 - Term Frequency-Inverse Document Frequency

Now I need to build the first extraction function for term frequency-inverse document frequency. In order to do this, I will use sci-kit learn's `TfidfVectorizer` class, along with gensim's `Sparse2Corpus` and `Dictionary` to turn the doc-term matrix into a corpus and dictionary that are valid for gensim's `LdaModel`, respectively.

The first step is to import these:

In [None]:
#import vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import gensim Sparse2Corp and Dictionary
from gensim.matutils import Sparse2Corpus
from gensim.corpora import Dictionary

Finally, I will build the function to extract features. First, it will vectorize the cleaned lyrics, before turning them into a corpus and dictionary:

In [None]:
def tfidf_extract(lyrics):
    # initialize vectorizer
    vec = TfidfVectorizer(lowercase=False)
    
    # Get doc-term matrix
    dtm = vec.fit_transform(lyrics)
    
    # Build corpus
    corpus = Sparse2Corpus(dtm, documents_columns=False)
    
    # Build dictionary from corpus and vocab
    vocab = vec.vocabulary_.items()
    dictionary = Dictionary.from_corpus(
        corpus, id2word={v: k for k, v in vocab})
    
    return dictionary, corpus

## Extraction Method 2 - Bag of Words

Here, I will create a function that implements the extraction of features into a bag of words. This method is far simpler and will only require me to use gensim's dictionary and a little bit of list comprehension:

In [None]:
def bow_extract(texts):
    dictionary = Dictionary(texts)
    corpus = [dictionary.doc2bow(doc) for doc in texts]
    return dictionary, corpus

## Optimization

Finally, it's time to implement the optimization on my dataset. Here I will:
1. Import `LdaModel` and `CoherenceModel` from gensim, and `gp_minimize` from sci-kit optimize.
2. Define the objective function to optimize the feature extraction method.
3. Run the optimization to find the best model based on mean coherence score for k-folds.

In [None]:
# Import LDAModel & CoherenceModel
from gensim.models import LdaModel, CoherenceModel

# Import from skopt
from skopt.space import Categorical, Integer
from skopt import gp_minimize
from skopt.utils import use_named_args

Now I must define the search space:

In [None]:
space = {'methods': ['tfidf', 'bow'],
                'num_topics': [1, 50, 100]}

Now for the objective function:

In [None]:

from sklearn.model_selection import KFold
import numpy as np

kf = KFold(n_splits=11, shuffle=True, random_state=42)
kf.get_n_splits(lyrics)

# Define objective function
def objective(**params):

    # Initialize results dictionary
    results = params.copy()

    # Initialize coherences list
    coherences = []

    # Iterate over splits
    for i, (train_i, _) in enumerate(kf.split(lyrics, text)):

        # Get lyrics & Text for splits
        train_lyrics = [lyrics[i] for i in train_i]
        train_text = [text[i] for i in train_i]

        print(f"Scoring folds {i+1}/11...")

        # Perform feature extraction based on params
        if params.get("method") == "tfidf":
            dictionary, corpus = tfidf_extract(train_lyrics)

        elif params.get("method") == "bow":
            dictionary, corpus = bow_extract(train_text)

        model = LdaModel(corpus=corpus, id2word=dictionary,
                            num_topics=params.get("num_topics"))

        # Score model
        coherence_model = CoherenceModel(
            model, texts=train_text, dictionary=dictionary)
        score = coherence_model.get_coherence()
        print(f"Score received for fold {i+1}: {score}\n")
        print(f"Params:\n-------")
        print(f"method: {params['method']}\nnum_topics: {params['num_topics']}\n")  # noqa

        # Append score to coherences
        coherences.append(score)

        # Add fold score to results
        results[f'fold {i+1} score'] = score

    # Add mean_score variable to observation
    results['mean score'] = np.mean(coherences)

    # Return results dict
    return results

Finally, it's time to implement the optimization:

In [None]:

# Iterate over search space
results_list = []
for method in space["methods"]:
    for num_topics in space["num_topics"]:
        results_list.append(
            objective(method=method, num_topics=num_topics))

# Extract results into a DataFrame
params_df = pd.DataFrame(results_list)

# Save to .csv
params_df.to_csv(f"../../../data/optimization/extraction_optimized.csv")