# GeniusTopics - WorkFlow
***

## Loading the Dataset

The first part of the process requires loading the dataset and garnering a better understanding of it. To do this, I will use the `pandas` library to import the dataset .csv file as a `pandas.DataFrame` object. I will then use the `.head()` method to view the first five records, then the `.info()` method to understand the values held within the `DataFrame` better.

In [14]:
# Import pandas library
import pandas as pd

# Read CSV file
df = pd.read_csv("song_lyrics_subset_10000.csv")

In [15]:
# View the data
df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
0,Killa Cam,rap,Cam'ron,2004,173166,"{""Cam\\'ron"",""Opera Steve""}","[Chorus: Opera Steve & Cam'ron]\r\nKilla Cam, ...",1,en,en,en
1,Can I Live,rap,JAY-Z,1996,468624,{},[Produced by Irv Gotti]\r\n\r\n[Intro]\r\nYeah...,3,en,en,en
2,Forgive Me Father,rap,Fabolous,2003,4743,{},Maybe cause I'm eatin\r\nAnd these bastards fi...,4,en,en,en
3,Down and Out,rap,Cam'ron,2004,144404,"{""Cam\\'ron"",""Kanye West"",""Syleena Johnson""}",[Produced by Kanye West and Brian Miller]\r\n\...,5,en,en,en
4,Fly In,rap,Lil Wayne,2005,78271,{},"[Intro]\r\nSo they ask me\r\n""Young boy\r\nWha...",6,en,en,en


In [16]:
# View the data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          39999 non-null  object
 1   tag            40000 non-null  object
 2   artist         40000 non-null  object
 3   year           40000 non-null  int64 
 4   views          40000 non-null  int64 
 5   features       40000 non-null  object
 6   lyrics         40000 non-null  object
 7   id             40000 non-null  int64 
 8   language_cld3  40000 non-null  object
 9   language_ft    40000 non-null  object
 10  language       40000 non-null  object
dtypes: int64(3), object(8)
memory usage: 3.4+ MB


**Notes:**
- Most columns appear to be strings, with the exception of `id`, `year` and `views`. These appear to be integers.
- One of the rows does not have a value in the `title` column. This will be removed for consistency.
- The `lyrics` column values appear to hold non-lyrical information within square brackets. These will need to be removed.

## Choosing Analysis

Of interest to me personally, I think it would be interesting to identify the most commonly referring themes between each genre of music. This could be useful in a variety of ways, including musical genre classification algorithms, by assigning a most-likely genre to a track, based on the underlying themes of the song.

In order to conduct this analysis, the workflow must undergo several operations:

### Pre-processing

This will allow for a reliable further-processing and subsequent analysis of the data, by cleaning and normalizing it.

In order to be conducive to topic modelling, the data will undergo these transformations:
- Cleaning
    - Strip values inside square brackets. This will remove strings denote where in the song structure the following lines belong. These maybe useful in the future, should more granular analysis be undertaken, but not currently necessary.
    - Strip punctuation. This will remove non-alphanumeric characters from the lyrics.
    - Strip white space. This will remove double-spaces, returns, tabs from the lyrics.
    - Strip stop words. This will remove words within the lyrics that are not conducive to the task of topic modelling.
- Normalization
    - Remove blank values.
    - Normalize case. This will ensure all equivalent alphanumeric characters (and words) are directly comparable.
    - Lemmatization. This will transform words to their english 'root' word, therefore allowing words with similar or same definitions to be given a singular identifying word, thus making it easier to process and better to analyze in the task of topic modelling.
    - Word tokenization. This will split each lyrics document inputted into a list of words.

### Quantification

Quantification will allow for information to be understood by the chosen model/s for analysis, by representing the lyrics numerically. This will ensure analysis can be conducted by most common ML models, to provide an objective topic model for each set of lyrics.

In the process of retrieving the most common topics representative of each musical genre, I decided the *Term Frequency-Inverse Document Frequency* method of quantifying terms within the lyrics of each genre would be the most beneficial, when retrieving the importance of words over each genre-specific corpus.

Instead of simply counting how often a term occurs within a corpus, it counts how often the term appears in each document *(term frequency)*, and how frequently it appears over the entire corpus given *(inverse document frequency)*. The term frequency is then multiplied by the inverse document frequency to give a final figure.

By applying this figure to each term within a corpus, words which appear more generally across a corpus are penalized, while words which are document-specific (and thus indicative of a more relevant term, rather than a general one) are given a greater weighting when analyzed. By doing this, the topic model will conduct analysis on terms not based on their frequency, but on their relevancy to the genre as a whole. In other words, this will allow me to find the most *relevant* topics for each music genre, not just the most *common*.

### Model

The chosen model for task of topic modelling is a Latent Dirichlet Allocation (LDA) model. Using LDA to extract *n* topics from a given corpus is a common technique that will allow for an analysis of the themes across a genre of music. I will display the topic by its most relevant word, along with a number of words associated with said topic, in the form of a word cloud.

## Pre-processing

First, I will remove the column with the missing `title` value, using the `.dropna()` method.

In [17]:
# Remove row with missing title value
df.dropna(subset=["title"], inplace=True)

# View counts
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39999 entries, 0 to 39999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          39999 non-null  object
 1   tag            39999 non-null  object
 2   artist         39999 non-null  object
 3   year           39999 non-null  int64 
 4   views          39999 non-null  int64 
 5   features       39999 non-null  object
 6   lyrics         39999 non-null  object
 7   id             39999 non-null  int64 
 8   language_cld3  39999 non-null  object
 9   language_ft    39999 non-null  object
 10  language       39999 non-null  object
dtypes: int64(3), object(8)
memory usage: 3.7+ MB


Now that I have removed the row with the missing `title` value, I will separate the dataset into lists of lyrics (documents) for each genre. To do this, I will use square-bracket notation to access the `tag` column, and create a new copy of each `DataFrame` for each of the desired `tag` values (genres). After this, I will create a `Series` object of the `lyrics` column within each genre, while using the `.to_list()` method to create a list of documents (lyrics) for each `tag`-split `DataFrame`.

In [18]:
# Separate the dataframes into different tags
rap_df = df[df['tag'] == 'rap']
rb_df = df[df['tag'] == 'rb']
pop_df = df[df['tag'] == 'pop']
rock_df = df[df['tag'] == 'rock']

In [19]:
# Retrieve a list of lyrics from each data frame
rap_docs = rap_df['lyrics'].to_list()
rb_docs = rb_df['lyrics'].to_list()
pop_docs = pop_df['lyrics'].to_list()
rock_docs = rock_df['lyrics'].to_list()

Now that I have separated the `df` into lists of lyrics for each genre, I will normalize the case of each document into lowercase. To do this, I will use the built-in `.casefold()` function, while iterating over each document in each list using list comprehension.

In [20]:
# Lowercase the lyrics
rap_lower = [doc.casefold() for doc in rap_docs]
rb_lower = [doc.casefold() for doc in rb_docs]
pop_lower = [doc.casefold() for doc in pop_docs]
rock_lower = [doc.casefold() for doc in rock_docs]

Next, I will use `regex.sub()` to remove patterns of square brackets containing alphanumeric characters within each document. As identified earlier, these appear to hold non-lyrical information that may be detrimental to the analysis.

*NOTE: I am using the `regex` library here, instead of `re`. `regex` allows for the required complexity to deal with any nested square brackets properly (should they occur). For the sake of brevity, `regex` will then be used for all subsequent regular expressions in the preprocessing.*

In [21]:
# Import regex
import regex

# Compile regex pattern for square brackets
pattern = regex.compile(r"\[([^[\]]*+(?:(?R)[^[\]]*+)*)\]")

# Remove square brackets
rap_no_sq_brackets = [regex.sub(pattern, "", doc) for doc in rap_lower]
rb_no_sq_brackets = [regex.sub(pattern, "", doc) for doc in rb_lower]
pop_no_sq_brackets = [regex.sub(pattern, "", doc) for doc in pop_lower]
rock_no_sq_brackets = [regex.sub(pattern, "", doc) for doc in rock_lower]

Now I have removed the non-lyrical information within the square brackets, I will deal with all the other punctuation held within the documents. Using the same method, but a different pattern, I will use `regex.sub()` to remove punctuation from each document. 

In [22]:
# Compile regex pattern for punctuation
pattern = regex.compile(r"[^\w\s]")

# Remove punctuation
rap_no_punc = [regex.sub(pattern, "", doc) for doc in rap_no_sq_brackets]
rb_no_punc = [regex.sub(pattern, "", doc) for doc in rb_no_sq_brackets]
pop_no_punc = [regex.sub(pattern, "", doc) for doc in pop_no_sq_brackets]
rock_no_punc = [regex.sub(pattern, "", doc) for doc in rock_no_sq_brackets]

Next up, I will be removing all white space greater than two characters long. This will help normalize the corpus by normalizing all white space to a single character. This is useful as the lyrics appear to have line breaks and other white space fairly regularly. Again, I will be using the `regex.sub()` method, with a pattern to remove white space over two characters long.

In [23]:
# Compile regex pattern to strip white space
pattern = regex.compile(r"\s{2,}")

# Strip white space
rap_no_white_space = [regex.sub(pattern, " ", doc) for doc in rap_no_punc]
rb_no_white_space = [regex.sub(pattern, " ", doc) for doc in rb_no_punc]
pop_no_white_space = [regex.sub(pattern, " ", doc) for doc in pop_no_punc]
rock_no_white_space = [regex.sub(pattern, " ", doc) for doc in rock_no_punc]

After removing the white space between words, I will remove white space at the beginning and end of each document, by using the in-build `strip()` function.

In [24]:
# Strip white space
rap_stripped = [doc.strip() for doc in rap_no_white_space]
rb_stripped = [doc.strip() for doc in rb_no_white_space]
pop_stripped = [doc.strip() for doc in pop_no_white_space]
rock_stripped = [doc.strip() for doc in rock_no_white_space]

In [25]:
from nltk.tokenize import word_tokenize

rap_tokenized = [word_tokenize(doc) for doc in rap_stripped]
rb_tokenized = [word_tokenize(doc) for doc in rb_stripped]
pop_tokenized = [word_tokenize(doc) for doc in pop_stripped]
rock_tokenized = [word_tokenize(doc) for doc in rock_stripped]

Now that the documents are cleansed of unwanted characters, I will remove stop words from each document. This will ensure that a lot of words that do not provide much insight into the themes of each genre will be removed, allowing for greater emphasis to be placed on the words that will provide greater insight into the topics discussed.

To do this, I will use the `NLTK` library.

In [26]:
# Import stopwords
from nltk.corpus import stopwords

# Retrieve english stopwords
stop_words = stopwords.words("english")

# Strip stopwords
rap_no_stops = [[word for word in doc if word not in stop_words] for doc in rap_tokenized]
rb_no_stops = [[word for word in doc if word not in stop_words] for doc in rb_tokenized]
pop_no_stops = [[word for word in doc if word not in stop_words] for doc in pop_tokenized]
rock_no_stops = [[word for word in doc if word not in stop_words] for doc in rock_tokenized]

After removing stop words, I am going to lemmatize the words. This will ensure words with a similar definition become a single word, so that the analysis can better identify recurring topics within each document and corpus. I will import the `WordNetLemmatizer` from the `NLTK` library to do this.

In [27]:
from nltk.corpus import wordnet
from collections import Counter

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  
  pos_counts = Counter()

  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return word, most_likely_part_of_speech

In [29]:
rap_tagged = [[get_part_of_speech(word) for word in doc] for doc in rap_no_stops]
rb_tagged = [[get_part_of_speech(word) for word in doc] for doc in rb_no_stops]
pop_tagged = [[get_part_of_speech(word) for word in doc] for doc in pop_no_stops]
rock_tagged = [[get_part_of_speech(word) for word in doc] for doc in rock_no_stops]

In [30]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Instantiate WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize
rap_lemmatized = [[lemmatizer.lemmatize(word, pos=tag) for word, tag in doc]for doc in rap_tagged]
rb_lemmatized = [[lemmatizer.lemmatize(word, pos=tag) for word, tag in doc]for doc in rb_tagged]
pop_lemmatized = [[lemmatizer.lemmatize(word, pos=tag) for word, tag in doc]for doc in pop_tagged]
rock_lemmatized = [[lemmatizer.lemmatize(word, pos=tag) for word, tag in doc]for doc in rock_tagged]