# GeniusTopics - WorkFlow
***

## Loading the Dataset

The first part of the process requires loading the dataset and garnering a better understanding of it. To do this, I will use the `pandas` library to import the dataset .csv file as a `pandas.DataFrame` object. I will then use the `.head()` method to view the first five records, then the `.info()` method to understand the values held within the `DataFrame` better.

In [2]:
# Import pandas library
import pandas as pd

# Read CSV file
df = pd.read_csv("song_lyrics_subset_10000.csv")

In [3]:
# View the data
df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
0,Killa Cam,rap,Cam'ron,2004,173166,"{""Cam\\'ron"",""Opera Steve""}","[Chorus: Opera Steve & Cam'ron]\nKilla Cam, Ki...",1,en,en,en
1,Can I Live,rap,JAY-Z,1996,468624,{},"[Produced by Irv Gotti]\n\n[Intro]\nYeah, hah,...",3,en,en,en
2,Forgive Me Father,rap,Fabolous,2003,4743,{},Maybe cause I'm eatin\nAnd these bastards fien...,4,en,en,en
3,Down and Out,rap,Cam'ron,2004,144404,"{""Cam\\'ron"",""Kanye West"",""Syleena Johnson""}",[Produced by Kanye West and Brian Miller]\n\n[...,5,en,en,en
4,Fly In,rap,Lil Wayne,2005,78271,{},"[Intro]\nSo they ask me\n""Young boy\nWhat you ...",6,en,en,en


In [4]:
# View the data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          39999 non-null  object
 1   tag            40000 non-null  object
 2   artist         40000 non-null  object
 3   year           40000 non-null  int64 
 4   views          40000 non-null  int64 
 5   features       40000 non-null  object
 6   lyrics         40000 non-null  object
 7   id             40000 non-null  int64 
 8   language_cld3  40000 non-null  object
 9   language_ft    40000 non-null  object
 10  language       40000 non-null  object
dtypes: int64(3), object(8)
memory usage: 3.4+ MB


## Choosing Analysis

Of interest to me personally, I think it would be interesting to identify the most commonly referring themes between each genre of music. This could be useful in a variety of ways, including musical genre classification algorithms, by assigning a most-likely genre to a track, based on the underlying themes of the song.

In order to conduct this analysis, the workflow must undergo several operations:

### PreProcessing

This will allow for a reliable further-processing and subsequent analysis of the data, by cleaning and normalizing it.

In order to be conducive to topic modelling, the data will undergo these transformations:
- Cleaning
    - Strip values inside square brackets. This will remove strings denote where in the song structure the following lines belong. These maybe useful in the future, should more granular analysis be undertaken, but not currently necessary.
    - Strip punctuation. This will remove non-alphanumeric characters from the lyrics.
    - Strip white space. This will remove double-spaces, returns, tabs from the lyrics.
    - Strip stop words. This will remove words within the lyrics that are not conducive to the task of topic modelling.
- Normalization
    - Normalize case. This will ensure all equivalent alphanumeric characters (and words) are directly comparable.
    - Lemmatization. This will transform words to their english 'root' word, therefore allowing words with similar or same definitions to be given a singular identifying word, thus making it easier to process and better to analyze in the task of topic modelling.
    - Word tokenization. This will split each lyrics document inputted into a list of words.