# Language of Music - Preprocessing

This notebook describes and walks through the process of cleaning and normalizing lyrics for use in topic modelling.

It takes place in two parts:
- Loading the data
- Preprocessing the data

## Loading the Data

To load and manipulate the data, stored in a .csv format, I used Pandas DataFrames. This seems like the obvious choice when managing and manipulating data stored in a tabular format. It allows the user to open .csv files of various encodings, manage encoding errors, and directly access values within specific cells of a dataframe to manipulate, for a variety of data types.

Although this first part is specifically for loading the data, I also conducted a little bit of preliminary investigation into the raw data, to gain an understanding of what some of the steps might be required in preprocessing.

To start off, I import pandas.

In [43]:
# Import Pandas
import pandas as pd

Next, I use the `.read_csv()` function to read the dataset into a `DataFrame` object from the .csv file it is stored in.

*Please note: A sample of the full dataset has been used in the repository for brevity. Full dataset size is > 9GB*

In [44]:
# Read .csv as DataFrame object
df = pd.read_csv("../data/raw/song_lyrics_sampled.csv", index_col=0)

After loading the dataset, I had a look at the first five observations using `.head()`.

In [45]:
# Look at the first 5 rows of data
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,[Verse 1]\nWhen I was smaller wishing I was ta...,en
1,country,I've been a moonshiner for seventeen long year...,en
2,country,[Verse 1]\nSomewhere out there in the smoky ai...,en
3,country,[Verse 1]\nWe've paid in hell since Moscow bur...,en
4,country,"[Verse 1]\nI was fifteen, she was eighteen\nTh...",en


Looking at the data, we can identify four distinct columns:
1. This is the index column.
2. `tag` which appears to provide the genre.
3. `lyrics`. This is where the lyrics are held.
4. `language_cld3`. This is where the language of the lyrics are signified.

I can also identify two possible artifacts of each observation of lyrics which may need to be handled in preprocessing:
1. Annotations within the song lyrics, we can seen within square brackets, such as the `[Verse 1]`.
2. Newline escape characters `\n`.

Next I look at more information on the data stored in the dataset, using the `.info()` method.

In [46]:
# Look at the dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tag            50000 non-null  object
 1   lyrics         50000 non-null  object
 2   language_cld3  50000 non-null  object
dtypes: object(3)
memory usage: 1.5+ MB


It appears that the are no null values and 50,000 total observations. Each of the data types are `object`, which typically indicates that they have been identified as strings.

Just to get an idea of the split of genres across the dataset, I next use the `.value_counts()` method to count the number of observations for each genre in the `tag` column.

In [47]:
# Explore split of genres
df.value_counts(subset=["tag"])

tag    
country    10000
pop        10000
rap        10000
rb         10000
rock       10000
Name: count, dtype: int64

As we can see, each of the genres represent a fifth of the dataset. This makes sense, as it was a balanced sample I conducted for each genre. However, in the full dataset, the amounts vary significantly.

## PreProcessing

Now that I have explored the dataset, I will preprocess the lyrics using various natural language processing and string manipulation techniques:
- Lowercase strings
- unicode normalization
- removal of square brackets, and all content within them
- removal of regular brackets, and all content within them
- removal of new line escape characters `\n` and carriage return characters `\r`.
- removal of adlibs
- removal of unwanted white space
- removal of punctuation
- tokenization
- mapping of vocabulary
- part-of-speech tagging
- part-of-speech lemmatization
- part-of-speech filtering
- rejoining of words into a single string

The above is in chronological order, so we start with normalizing the casing to lowercase. This is done by dot notation on the desired column in our `DataFrame` (`lyrics`):

In [48]:
# lowercase lyrics
df.lyrics = df.lyrics.str.lower()

# view lowercased lyrics
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,[verse 1]\nwhen i was smaller wishing i was ta...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,[verse 1]\nsomewhere out there in the smoky ai...,en
3,country,[verse 1]\nwe've paid in hell since moscow bur...,en
4,country,"[verse 1]\ni was fifteen, she was eighteen\nth...",en


Next, I will use the unicodedata package to normalize the encoding of the lyrics. This will handle any particular cases where the characters may be accented.

In [49]:
# import unicodedata
import unicodedata

# normalize
df.lyrics = df.lyrics.apply(lambda x: unicodedata.normalize('NFKD', x))

# view normalized lyrics
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,[verse 1]\nwhen i was smaller wishing i was ta...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,[verse 1]\nsomewhere out there in the smoky ai...,en
3,country,[verse 1]\nwe've paid in hell since moscow bur...,en
4,country,"[verse 1]\ni was fifteen, she was eighteen\nth...",en


Now I will remove square brackets and normal brackets from our lyrics. This will remove the annotations in the lyrics, such as the `[verse 1]` we identified when loading our dataset. To do this, I will write a raw string pattern and use the `.replace()` method on the pandas series of lyrics, setting the regex flag to `True`. This will allow me to replace regex within each observation of the `Series`.

In [50]:

# Define pattern for square brackets
pattern = r"\[([^\[\]]*+(?:\[[^\[\]]*+])*+)\]"

# Replace matches in the pattern
df.lyrics = df.lyrics.str.replace(pattern, "", regex=True)

# View lyrics
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,\nwhen i was smaller wishing i was taller\ni w...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,\nsomewhere out there in the smoky air\nwhere ...,en
3,country,\nwe've paid in hell since moscow burned\nas c...,en
4,country,"\ni was fifteen, she was eighteen\nthe prettie...",en


In [51]:

# Define pattern for regular brackets
pattern = r"\([^)]*\)"

# Replace matches in the pattern
df.lyrics = df.lyrics.str.replace(pattern, "", regex=True)

# View lyrics
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,\nwhen i was smaller wishing i was taller\ni w...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,\nsomewhere out there in the smoky air\nwhere ...,en
3,country,\nwe've paid in hell since moscow burned\nas c...,en
4,country,"\ni was fifteen, she was eighteen\nthe prettie...",en


Next, I will remove new line escape characters (`'\n'`) and carriage return escape characters (`'\r'`).

In [52]:
# Replace new line characters
df.lyrics = df.lyrics.str.replace("\n", " ")

# Replace carriage return characters
df.lyrics = df.lyrics.str.replace("\r", " ")

# View head
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,when i was smaller wishing i was taller i wok...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,somewhere out there in the smoky air where th...,en
3,country,we've paid in hell since moscow burned as cos...,en
4,country,"i was fifteen, she was eighteen the prettiest...",en


Now it's time to handle the ad-libs. If you're unsure what ad-libs are, in this specific context, they are improvised phrases or words that might not add much to the actual meaning of the track. Phrases like 'yeah', 'uh-huh', 'woo-hoo'.

To achieve the removal of these, I will first define a set of adlibs. These are characters that can appear in any quantity or order, as long as they are the only characters within the word boundary. Then I will create a raw string expression, before finally using the `.replace()` function on the Series of lyrics.

In [53]:

# Create set of ad-libs
adlibs = {'ah', 'aw', 'anh', 'ay', 'ayo', 'ayoh', 'aye',
          'br', 'da', 'dae', 'do', 'er', 'goh', 'he',
          'ho', 'lad', 'ladi', 'ladium', 'li', 'm',
          'mh', 'na', 'nah', 'naw', 'noh', 'nouh', 'sh', 'uh',
          'woah', 'wo' 'h', 'wo', 'unh', 'uho',
          'umah', 'yo', 'yuh'}

# Create pattern
pattern = r'\b(?:' + '|'.join(adlibs) + r')\b'

# Replace
df.lyrics = df.lyrics.str.replace(pattern, "", regex=True)

# View top lyrics
df.head()


Unnamed: 0,tag,lyrics,language_cld3
0,country,when i was smaller wishing i was taller i wok...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,somewhere out there in the smoky air where th...,en
3,country,we've paid in hell since moscow burned as cos...,en
4,country,"i was fifteen, she was eighteen the prettiest...",en


Now, I will do this with adlibs of a required minimum length. These are similar to regular adlibs, however some adlibs may be a combination of characters, or a single character, which might be a legitimate word when smaller than a certain number of characters (e.g. 'he' is a real word, where as 'hehe' might be an adlib).

To achieve this, the adlibs are first defined in a set of sub-patterns, where the word characters are defined in the square brackets, and the minimum length in the curly braces.

In [54]:

# Define adlibs and minimum lengths
adlibs = {('\b[he]{3,}\b'), ('\b[hey]{4,}\b'), ('\b[i]{2,}\b'), ('\b[la]{3,}\b'),
        ('\b[na]{3,}\b'), ('\b[no]{5,}\b'), ('\b[ops]{4,}\b'), ('\b[bra]{4,}\b')}

# Create raw string pattern
pattern = fr"{'|'.join([sub_pattern for sub_pattern in adlibs])}"

# Replace substrings that match pattern
df.lyrics = df.lyrics.str.replace(pattern, "", regex=True)

# View
df.head()


Unnamed: 0,tag,lyrics,language_cld3
0,country,when i was smaller wishing i was taller i wok...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,somewhere out there in the smoky air where th...,en
3,country,we've paid in hell since moscow burned as cos...,en
4,country,"i was fifteen, she was eighteen the prettiest...",en


Next, I will remove any unwanted whitespace using two methods. First, I will use a regular expression. Second, I will use the `.strip()` function to remove any leading or trailing white space from the lyrics.

In [55]:
# Regex for whitespaces of 2 or more characters
pattern = r"\s{2,}"

# Replace pattern matches
df.lyrics = df.lyrics.str.replace(pattern, "", regex=True)

# Strip leading/trailing whitespace
df.lyrics = df.lyrics.str.strip()

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,when i was smaller wishing i was taller i woke...,en
1,country,i've been a moonshiner for seventeen long year...,en
2,country,somewhere out there in the smoky air where the...,en
3,country,we've paid in hell since moscow burned as coss...,en
4,country,"i was fifteen, she was eighteen the prettiest ...",en


Next I will remove all punctuation using a regular expression.

In [56]:
# Define punctuation regex
pattern = r"[^\w\s]"

# Replace pattern matches
df.lyrics = df.lyrics.str.replace(pattern, "", regex=True)

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,when i was smaller wishing i was taller i woke...,en
1,country,ive been a moonshiner for seventeen long years...,en
2,country,somewhere out there in the smoky air where the...,en
3,country,weve paid in hell since moscow burned as cossa...,en
4,country,i was fifteen she was eighteen the prettiest t...,en


Now I'll start to use some more advanced techniques of preprocessing. First is word tokenization. To do this, I will use `word_tokenize()` from the `nltk` library.

In [57]:
# Import word_tokenize
from nltk import word_tokenize

# Tokenize lyrics
df.lyrics = df.lyrics.apply(word_tokenize)

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,"[when, i, was, smaller, wishing, i, was, talle...",en
1,country,"[ive, been, a, moonshiner, for, seventeen, lon...",en
2,country,"[somewhere, out, there, in, the, smoky, air, w...",en
3,country,"[weve, paid, in, hell, since, moscow, burned, ...",en
4,country,"[i, was, fifteen, she, was, eighteen, the, pre...",en


Now that the lyrics have been tokenized, we can perform a few more preprocessing steps. To start, I will map contractions and dropped 'g's from the lyrics. This will make the same and/or similar words more identifiable.

I will retrieve the replacements from the `data/vocab/` directory's `.json` files. Using a custom function that I will apply to the lyrics, I will first replace contractions, followed by dropped 'g' letters.

In [58]:

# Import json to read .json files
import json

# Create function to read json files
def read_json_mapping(file):
    with open(file, encoding="utf-8-sig", errors="ignore") as f:
        return json.load(f)

# Create function to map tokens
def map_vocab(tokens):
    
    contractions = read_json_mapping("../data/vocab/contractions.json")
    dropped_gs = read_json_mapping("../data/vocab/contractions.json")
    
    # Initialize list of mapped
    # contractions
    mapped_contractions = []
    
    for token in tokens:
        
        # If the token is in the
        # CONTRACTIONS dictionary
        if token in contractions:
            
            # split the bi-gram and
            # add individual tokens
            # to mapped_contractions
            mapped_contractions.extend(contractions[token].split())
            
        else:  # If not a contraction
            # just add the token as it is to mapped_contractions
            mapped_contractions.append(token)

    # Map and return dropped g's
    return [dropped_gs.get(token, token) for token in mapped_contractions]

# Apply function to lyrics
df.lyrics = df.lyrics.apply(map_vocab)

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,"[when, i, was, smaller, wishing, i, was, talle...",en
1,country,"[i, have, been, a, moonshiner, for, seventeen,...",en
2,country,"[somewhere, out, there, in, the, smoky, air, w...",en
3,country,"[we, have, paid, in, he, will, since, moscow, ...",en
4,country,"[i, was, fifteen, she, was, eighteen, the, pre...",en


Next, I will tag the part-of-speech for each token in the lyrics. This will help with the two following steps, part-of-speech lemmatization and part-of-speech filtering. I will use the `pos_tag()` function from `nltk` to do this.

In [59]:
# import pos_tag
from nltk import pos_tag

# apply
df.lyrics = df.lyrics.apply(pos_tag)

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,"[(when, WRB), (i, NN), (was, VBD), (smaller, J...",en
1,country,"[(i, NNS), (have, VBP), (been, VBN), (a, DT), ...",en
2,country,"[(somewhere, RB), (out, IN), (there, RB), (in,...",en
3,country,"[(we, PRP), (have, VBP), (paid, VBN), (in, IN)...",en
4,country,"[(i, NN), (was, VBD), (fifteen, JJ), (she, PRP...",en


Next up is part-of-speech filtering. Although I might look into the efficacy of the filtering later, and optimize the filtering, to remove pos that optimizes the coherence in the LDA model, for now I will be keeping just nouns.

In [60]:
# Define set of kept pos groups
pos_groups = {'N'}

# Write function to remove tokens by part-of-speech
def filter_pos(tokens):
    return [(token, tag) for token, tag in tokens if tag[0] in pos_groups]

# Apply function
df.lyrics = df.lyrics.apply(filter_pos)

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,"[(i, NN), (wishing, NN), (i, NN), (taller, NN)...",en
1,country,"[(i, NNS), (moonshiner, NN), (years, NNS), (mo...",en
2,country,"[(smoky, NN), (air, NN), (night, NN), (blue, N...",en
3,country,"[(moscow, NN), (cossacks, NNS), (piece, NN), (...",en
4,country,"[(i, NN), (thing, NN), (i, NN), (life, NN), (i...",en


Now I will lemmatize the tokens. This will ensure words of a similar meaning are represented as the same word across the lyrics. This will allow for consistency in the topic modelling.

I will make this with the idea of a broader number of parts-of-speech being available, in case they are kept in a future run.

As such, I will need to first identify the word-net equivalent of the tagged part-of-speech. This will allow for the appropriate lemmatization of the words, so the meaning of the tokens can be kept as true to their original meaning as possible.

In [61]:

# Import lemmatizer and wordnet pos
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet

# Define mapping of wn pos
wn_pos = {'J': wordnet.ADJ, 'V': wordnet.VERB,
              'N': wordnet.NOUN, 'R': wordnet.ADV}

# Initialize lemmatizer
wn = WordNetLemmatizer()

# Function to lemmatize tokens by part-of-speech
def pos_lemmatize(tokens):
    return [wn.lemmatize(token, wn_pos.get(tag[0], wordnet.NOUN)) for token, tag in tokens]

# Apply function
df.lyrics = df.lyrics.apply(pos_lemmatize)

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,"[i, wishing, i, taller, i, morning, mama, hims...",en
1,country,"[i, moonshiner, year, money, whiskey, beer, i,...",en
2,country,"[smoky, air, night, blue, stranger, danger, dr...",en
3,country,"[moscow, cossack, piece, league, death, releas...",en
4,country,"[i, thing, i, life, i, sight, i, marietta, tow...",en


Lastly, before re-joining the lyrics into a single string, I will remove the stop words from the lyrics. I will be using `nltk` stopwords set, as well as some stopwords I have defined myself.

*Note: for the sake of demonstration and portfolio purposes, I have opted to remove some extreme profanity.*

In [62]:
# Import stop words
from nltk.corpus import stopwords

# Function to read .txt lines into a set
def txt_to_set(path: str) -> set[str]:
    with open(path, encoding='utf-8-sig', errors='ignore') as f:
        return {line for line in f.readlines()}

# Import and unionize custom stop words with stopwords
stopwords = set(stopwords.words('english')).union(txt_to_set("../data/vocab/stopwords.txt"))

# Function to remove stopwords
def remove_stopwords(tokens):
    return [token for token in tokens if token not in stopwords]

# Apply function
df.lyrics = df.lyrics.apply(remove_stopwords)

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,"[wishing, taller, morning, mama, himsaid, boy,...",en
1,country,"[moonshiner, year, money, whiskey, beer, hollo...",en
2,country,"[smoky, air, night, blue, stranger, danger, dr...",en
3,country,"[moscow, cossack, piece, league, death, releas...",en
4,country,"[thing, life, sight, marietta, town, north, at...",en


Finally, I will join the tokens of each lyrics back into a string and save for later.

In [63]:

# Use lambda and a string.join
# function to join all the
# tokens into a single string
df.lyrics = df.lyrics.str.join(" ")

# Save
df.to_csv("../data/processed/song_lyrics_sampled_clean.csv")

# View
df.head()

Unnamed: 0,tag,lyrics,language_cld3
0,country,wishing taller morning mama himsaid boy man cu...,en
1,country,moonshiner year money whiskey beer hollow gall...,en
2,country,smoky air night blue stranger danger drink fau...,en
3,country,moscow cossack piece league death release gran...,en
4,country,thing life sight marietta town north atlanta h...,en
