# Notebook 03: Preprocessing

## Contents:
1. [Summary](#section1)
2. [Tokenization](#section2)

## Summary <a name="section1"></a>
In this notebook I will load in the cleaned dataset and prepare the song lyrics for modeling. Steps in this process will include splitting the full lyrics string into individual words, creating input / output sequences, tokenizing the words, and converting the target feature from integer tokens to categorical one-hot vectors.

Loading necessary libraries.

In [1]:
import json, time, re, string, pickle
import pandas as pd
import numpy as np
from scipy import sparse
from scipy.sparse import coo_matrix

from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical

%run ../assets/sql_cred.py

Using TensorFlow backend.


Loading in the file labeling helper function.

In [2]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[0-z]+_[0-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Saved at: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Reading in the cleaned lyric data.

In [3]:
lyric_df = pd.read_csv('../assets/1549918191_lyric_df.csv', index_col='track_id')

In [4]:
lyric_df = lyric_df.dropna()

In [5]:
lyric_df.head()

Unnamed: 0_level_0,lyrics,clean_lyrics,rep_ratio,total_words_track,unique_words_track,mean_len_words_track,total_lines_track,unique_lines_track,mean_words_line,mean_unique_words_line
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0h7TlF8gKb61aSm874s3cV,\n\nIf your needle is near\nNeedle is near\nYo...,if your needle is near \n needle is near \n yo...,0.65,57,20,3.7,14,8,5.9,5.1
6koowTu9pFHPEcZnACLKbK,\n\n[Verse 1]\nBrown skin girl on the other si...,brown skin girl on the other side of the room ...,0.61,132,52,4.1,24,13,7.4,5.8
1JkhKUXAoNivi87ipmV3rp,"\n\n[Verse 1]\nIt's simple, I love it\nHaving ...",its simple i love it \n having you near me hav...,0.58,151,63,4.2,29,21,7.1,5.8
51lPx6ZCSalL2kvSrDUyJc,\n\n[Intro: Whistling]\n\n[Verse 1]\nA great b...,a great big bang and dinosaurs \n fiery rainin...,0.4,126,76,4.0,20,18,8.2,7.2
3vqlZUIT3rEmLaYKDBfb4Q,\n\n[Verse 1]\nIsn't she lovely\nIsn't she won...,isnt she lovely \n isnt she wonderful \n isnt ...,0.39,108,66,4.1,21,20,7.0,6.1


Reviewing the summary statistics for the data before preprocessing.

In [12]:
lyric_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_words_track,1781.0,282.639528,125.380335,13.0,195.0,267.0,346.0,1339.0
unique_words_track,1781.0,96.058956,37.770647,5.0,74.0,92.0,112.0,433.0
mean_len_words_track,1781.0,3.748175,0.264051,2.8,3.6,3.7,3.9,5.7
total_lines_track,1781.0,40.800674,17.889364,1.0,28.0,38.0,51.0,224.0
unique_lines_track,1781.0,27.646828,12.11209,1.0,20.0,25.0,33.0,189.0
mean_words_line,1781.0,9.069175,2.031524,4.2,7.9,8.8,9.9,55.4
mean_unique_words_line,1781.0,7.585345,1.677593,3.6,6.6,7.4,8.3,39.8


Defining a function to iterate through the dataset and perform the desired preprocessing steps. Since I intend to use word-level predictions for my model, I will split the lyrics into individual words and tokenize based on the resulting vocabulary. I will utilize a sliding window format to create input / output sets where a input sequence will be used to predict the next word following that sequence. The sliding window will then move forward one word and the process will be repeated until a document of our desired length is produced. In the context of preprocessing, I will create these sequences for each track prior to tokenizing. Once the tokenizer is trained on the complete corpus, it will be saved out to be used for tokenizing our input text for the model to predict on. The resulting split, sequenced, and tokenized input set will be reshaped corresponding to the shape required for the model input layer. 

## Tokenization <a name="section2"></a>

In [17]:
def tokenize_lyrics(
    df=lyric_df,
    lyrics_col=['clean_lyrics'],
    seq_len=4, 
    output_len=1,
    save_dir='../assets'):

    X = []
    y = []

    corpus = []

    print('Processing lyrics...')
    for _, track in df[lyrics_col].iterrows():
        lyrics = track[0]
        lyrics_spaced = re.sub(r'( +)', ' ', lyrics)
        lyrics_split = lyrics_spaced.split(' ')
        corpus.extend(lyrics_split)

        for i in range(len(lyrics_split) - seq_len):
                X.append(np.array(lyrics_split[i:i + seq_len]))
                y.extend(np.array(lyrics_split[i + seq_len:i + seq_len + output_len]))

    print('Fitting Tokenizer...')
    tokenizer = Tokenizer(oov_token=0)
    tokenizer.filters = tokenizer.filters.replace('\n', '')
    tokenizer.fit_on_texts(corpus)

    vocab_size = len(tokenizer.word_index) + 1
    print(f'Vocab size = {vocab_size}')
    
    formatted_name, now, file_description= filename_format_log(f'{save_dir}/LSTM315_300tokenizer.pkl')

    with open(formatted_name, 'wb+') as f:
        pickle.dump(tokenizer, f)
    print(f'Tokenizer saved to {formatted_name}.')          

    print('Indexing sequences...')
    X_indexed = [[tokenizer.texts_to_sequences([word])[0] for word in row] for row in X]
    y_indexed = [tokenizer.texts_to_sequences([word])[0] for word in y]

    print('Reshaping and converting to Categorical...')
    X_reshape = np.reshape(X_indexed, (len(X_indexed), seq_len))
    
    formatted_name, now, file_description= filename_format_log(f'{save_dir}/LSTM315_300Xreshape.npy')
    np.save(formatted_name, X_reshape)
        
    y_cat = to_categorical(y_indexed)
    
    formatted_name, now, file_description= filename_format_log(f'{save_dir}/LSTM315_300ycat.npz')
    y_cat_coo = coo_matrix(y_cat)
    sparse.save_npz(formatted_name, y_cat_coo)
    
    print(f'Lyrics successfully tokenized, sequenced, indexed, and saved out to {save_dir}.') 
    
    return X_reshape, y_cat, vocab_size

In [20]:
X_reshape, y_cat, vocab_size = tokenize_lyrics(
    df=lyric_df,
    lyrics_col=['clean_lyrics'],
    seq_len=4, 
    output_len=1,
    save_dir='../assets'
)

Processing lyrics...
Fitting Tokenizer...
Vocab size = 12335
Tokenizer saved to ../assets/1549597149_LSTM315_300tokenizer.pkl.
Indexing sequences...
Reshaping and converting to Categorical...
Lyrics successfully tokenized, sequenced, indexed, and saved out to ../assets.


# CONTINUE TO NOTEBOOK 04a: MODELING