# Topic analysis based on Genius songs lyrics

Author : lievre.thomas@gmail.com

---

In this notebook, we will explore the genius data extract from [Kaggle](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information).

**The aim of this analysis is to retrieve topic from lyrics and retrieve main topics by year or decade.**

This notebook was carried out in the context of a class project imposed by the [text mining course (TDDE16)](https://www.ida.liu.se/~TDDE16/project.en.shtml) of Linköpings universitet.


## Few informations about Genius website

Genius is an American digital company founded on August 27, 2009, by Tom Lehman, lan Zechory, and Mahbod Moghadam. Originally launched as Rap Genius with a focus on hip-hop music, it was initially a crowdsourced website where people could fill in the lyrics of rap music and give an interpretation of the lyrics. Over the years the site has grown to contain several million annotated texts from all eras ( from [Wikipedia Genius page](https://en.wikipedia.org/wiki/Genius_(company))).


## Load the data in memory

Data are all contain in a big 9GB csv file (around 5 millions rows). It could be difficult to load all this data in our computer memory. To deal with this issue, I made a loading class to split the data in 6 pickles files to improve the compressness of the data which aim to improve the loading speed in the memory. Then the pickles are randomly draw to improve generality of the data. We currently assumed the data pickles batch are identically distributed (we will explore the data batches at the second part). The class below deal with all the process.

In [1]:
from utils.loader import Loader
from config import data_input_path, kaggle_output_dir

In [2]:
# initiate the file loader
loader = Loader(in_path = data_input_path, out_path = kaggle_output_dir)

# load pickle 3
df_raw = loader.load_pickle(3)

Batchs of data are randomly loaded in the memory. The number of batchs loaded depends on the memory capacity of the computer running the script. For the analysis, we will only works on the random samples loaded (All the data in Kaggle).  

# Exploring the coarse data

Let's visualize and explore the coarse data before a part of deeper analysis.

In [3]:
df_raw.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2000000,Roses,rock,Vitja,2017,399,{},The roses start to wither\nWhere the devil lay...,3019113,en,en,en
2000001,Keep On Pushin,rap,Problem,2017,692,"{""My Princess Aeryn""}",[Hook]\nImma keep on pushin\nImma keep on push...,3019114,en,en,en
2000002,Inside,pop,The jepettos,2017,1302,{},"[Intro]\nOoh, ooh, ah, ah (x2)\n\n[Verse 1]\nI...",3019115,en,en,en
2000003,Girls Like You,rb,PnB Rock,2017,60114,{},[Chorus]\nBaby it was real\nYeah we were the b...,3019117,en,en,en
2000004,Froideur,rap,N'Dirty Deh,2017,5813,"{""N\\'Dirty Deh""}","[Refrain]\nEt j'ai perdu la foi, mais c'est pa...",3019120,fr,fr,fr


For each songs, we've got several informations :
- title of the song
- the tag (genre)
- the artist singer name
- the release year
- the number of page views
- the featuring artists names
- the lyrics
- the genius identifier
- Lyrics language according to [CLD3](https://github.com/google/cld3). Not reliable results are NaN. CLD3 is a neural network model for language indentification.
- Lyrics language according to [FastText's langid](https://fasttext.cc/docs/en/language-identification.html). Values with low confidence (<0.5) are NaN. FastText's langid is library developped by Facebook’s AI Research lab for efficient learning of word representations and sentence classification. fastText has also published a fast and accurate tool for text-based language identification capable of recognizing more than 170 languages.
- Combines language_cld3 and language_ft. Only has a non NaN entry if they both "agree".

In [5]:
df_raw.dtypes

title            object
tag              object
artist           object
year              int64
views             int64
features         object
lyrics           object
id                int64
language_cld3    object
language_ft      object
language         object
dtype: object

In [6]:
# display the size
print('Data frame size (row x columns):', df_raw.size)
print('Data rows number: ', len(df_raw))
print('Number of unique songs (following genius id): ', len(df_raw.id.unique()))

Data frame size (row x columns): 11000000
Data rows number:  1000000
Number of unique songs (following genius id):  1000000


Genius id seems to be the unique rows identifier.

Let's vizualise size of the coarse data over years before preprocessing to compare batch distributions. One things to know before vizualise the data, the pickles are create by chunks reading. 

The last diplayed table gives us some information about the data. The csv file seems to be sort by id, so the pickle files are then sort too.

In [4]:
import os
import pandas as pd
import plotly.express as px

In [5]:
# get some information about the pickle data
def pickle_informations(loader: Loader):
    rows = []
    for i in range(1, len(os.listdir('data')) + 1):
        df = loader.load_pickle(i)
        rows.append(len(df))
        del df
    return rows

# get the rows
rows = pickle_informations(loader=loader)

# create the dataframe
df_data = pd.DataFrame(
    {'batch' : ['data ' + str(i) for i in range(1,len(rows) + 1)],
    'rows' : rows})

fig = px.bar(df_data, x="batch", y="rows")
fig.show()

Batch seems to have the same number of rows rexcept for the last one which is consistent because batch are create iteratively by 10e6 chunks over the csv The last batch could be seen as a rest.

In [9]:
from utils.plots import multi_barplot
import plotly.colors as col

# create the color list
colors = col.qualitative.Plotly

# 1990 - 2023
fig1 = multi_barplot(year1=1960, year2=1989, colors=colors, loader=loader)
fig1.show()
# 1960 - 1990
fig2 = multi_barplot(year1=1990, year2=2023, colors=colors, loader=loader)
fig2.show()

The first bar chart (1960 - 1989) shows an increasing numbers of data over years. Moreover batch seems to have quite similar distriutions over years. data_1 and data_2 batch quite outperform the 4 others. data_6 batch is weaker than the other due to its poor number of rows.
The data behaves similarly until 2012 as we can see on the second chart (1990-2023). After this year there is great increasing of the data retrieved. A minimum increase of at least 100% of the batch can be observed. An increase of up to 50 times the batch size for some.

# Data pre-processing

The aim of this part is to preprocess data in order to get suitable data for the analysis. let's focus on the year variable.

We will focus on English songs, to facilitate the analysis and the work of natural language processing algorithms.

In [10]:
# Retrieve only the texts identified as English language by both cld3 and fasttext langid
df = df_raw[df_raw.language == 'en']

Next, it can be quite interseting to check Nan values

In [25]:
# find which column contain nan value
df.columns[df.isna().any()].tolist()

['title']

In [26]:
# get all rows that contain NaN values
df_nan = df[df.isna().any(axis=1)]
df_nan

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2114245,,rock,Peter Dzubay,2017,52,{},"When I finally find what's real, why shouldn't...",3166588,en,en,en
2146772,,rap,OLlama,2017,2120,{},[Intro]\nToo many bitches; Too many bitches tr...,3210807,en,en,en
2158344,,rock,The Moth & The Flame,2011,1051,{},We used to be so similar\nWell that was wishfu...,3226446,en,en,en
2160846,,misc,Lawfermz,2017,3,{},(Intro)\nBeep beep beep beep beep beep.\nHello...,3230118,en,en,en
2210243,,rap,TripleYoThreat,2017,24,{},"TripleYoThreat, you ain't never gonna make it\...",3296318,en,en,en
2211581,,rap,Huntaps,2017,157,{},"[Hook]\nFuck 12, if you fuck with me I'll send...",3297867,en,en,en
2294150,,rap,Shaiza Maponyaza,2018,45,{},[intro]\n\nYah yah yah yah\nYah yah\nMan same ...,3438103,en,en,en
2309207,,rock,PAWS,2016,271,{},You're scared of history repeating on me\nCons...,3472732,en,en,en
2357786,,rap,J4ydizz1e,2018,55,{},[Intro: J4yDiZz1e]\n\nThis shit sounds just li...,3561848,en,en,en
2441948,,rap,Bass Santana,2018,1147,"{""Kin\\$oul"",""Skott Summerz""}",[Intro: Bass Santana]\nTo Bass be the glory\nF...,3681585,en,en,en


In [27]:
print('Number of untitled song:', len(df[df.isna().any(axis=1)]))

Number of untitled song: 21


Insofar as the title of the music is not to be taken into account in the learning of the topic modeling algorithms but the titles can be related to the topics in the next phase of analysis and the low number of songs without any title, I decide to delete this data for the moment.

In [28]:
# Delete rows containing NaN values
df = df.dropna()
len(df)

645573

Next, we also try to check for None values

In [29]:
df[df.isnull().any(axis=1)]

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


No None values in this dataframe.

Afterwards, let's look at the year variable, which is one of the important variables to take into account in our analysis because we want to extract the topics by decades.

In [30]:
years = df.year.unique()
print(years)

print('Number of unique years: ',len(years))

[2017 2016 1998 2002 2011 2013 2014 2015 2006 2009 2012 2005 2018 2008
 2003 2010 2007 1989 1986 1999 1905 1985 1994 2001 1899 1913 2000 2020
 1990 1984 1983 1979 2021 1995 2004 1825 1992 1993 2019 1980 1977 1966
 1988 1991 1997 1949 1957 1982 1674 1996 1974 1942 1850 1842    1 1964
 1972 1915 1976 1930 1987 1919 1911 1960 1978 1968 1975 1950 1981 1952
 1973 1951 1947 1954 1961 1892 1878 1934 1935 1970 1969 1944 1962 1909
 2022 1971 1927 1945 1936 1926 1938 1901   12 1900 1922 1929 1937 1953
 1820 1864 1965 1955 1540 1916 1963 1967 1918 1904 1943   14 1772 1788
 1780 1959 1886 1939 1871 1912 1956 1958 1880 1923 1910 1844 1897 1914
 1862 1933 1946   15 1877 1771   17 1948 1931 1898 1700 1888 1928  510
 1791 1895 1941 1920   25 1917 1908 1940 1859 1907 1902 2023 1857 1863
 1666 1893 1889 1861 1869 1925 1867 1872 1676 1675  420 1932  709 1818
 1830 1005 1866 1066  176 1300 1924 1251 1838 1760 1739   21 1758 1469
    2 1400 1853 1320 1200 1890 1860 1903   18 1220  130 1794 1896 1461
 1210 

We firstly want to know if the year variable format is suitable. It is highly likely that year are sometimes downsized (example : 92 instead of 1992).
Let's display the tag distribution for music with a release year below 215.

In [11]:
df_tag = df[df['year'] < 215].groupby(['tag']).size().reset_index(name='count')

fig = px.pie(df_tag, names="tag", values="count", title = "Outlier tag distribution")
fig.show()

It is rather surprising to observe that the majority style of music of this period (< 215) is rap music knowing that this style is known for the current emerging style. Of course, among this data their is a important part of outlier year.

In [12]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_rap = df[(df['year'] < 215) & (df['tag'] == 'rap')]
df_rap.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2111689,Issa Snack,rap,Phatboitheceleb,1,4182,{},Chorus: its ya birthday and you know I ain't f...,3163216,en,en,en
2268205,Oof sandman sans sans diss track,rap,Squeaky,15,3206,{},OOF POOF BOOTH LOSE LOOSE MOVE LUKE FRISK WRIS...,3387617,en,en,en
2218600,First Day Home,rap,RetcH,1,2334,{},"[Intro]\n\n[Chorus]\nHome, it's my first day h...",3307004,en,en,en
2381769,Papi,rap,DON-DON (US),1,2129,{},[INTRO]\nCall me papi\nCall me papi\nLead her\...,3596060,en,en,en
2715072,Same Love,rap,Macklemore,1,1767,"{""Macklemore & Ryan Lewis""}",When I was in the 3rd grade\nI thought that I ...,4080328,en,en,en
2969927,IWroteThisForTaylorSwift,rap,Vershad,1,1607,"{""Brooklyn Zoo""}","[Intro]\nJake: Sounds like your sad. Listen, ...",4484444,en,en,en
2268744,I like frisk,rap,Squeaky,1,1186,"{""Squeaky jr""}","Ok, ok, ok, ok, ok, ok, ok, yeah\nI like frisk...",3388825,en,en,en
2945081,Same Lil Bitch,rap,Shayla Gessler,1,1023,{},"Did me dirty, you been fuckin' with the Same L...",4441966,en,en,en
2018947,Dont Blame Me For You,rap,Kids These Days,1,868,"{""VIC MENSA"",""Macie Stewart"",""Liam Cunningham""}","[Verse 1: Vic Mensa]\nI'm larger than life, ha...",3044399,en,en,en
2431118,Robocop,rap,videogamedunkey,1,766,{​videogamedunkey},Now that I'm a Robocop\n\nNow I'm never gonna ...,3666068,en,en,en


If we search the release date of this track on google, we can find a release date from 4 May 2021 on the [Genius website](https://genius.com/Kanye-east-the-secrets-of-dababy-lyrics). Given the year that we find in our table and real one, we can assume some issue about the date format (1 instead 2021).

After few research on genius website, the most viewed songs of this above displayed list seems to be released on 2021 but more views decrease harder is the interpretation of date.

Let's check the second most popular tag 'pop' in this retrieve outliers data :

In [13]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_pop = df[(df['year'] < 215) & (df['tag'] == 'pop')]
df_pop.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2274779,Off White,pop,Zach Clayton,1,7894,{},"Chorus\nSwish on my kicks, off white\nStripes ...",3399877,en,en,en
2951623,I dont wanna see you cryin anymore,pop,Adam Melchor,1,7479,{},[Verse 1]\nI don't wanna see you cryin’ anymor...,4452665,en,en,en
2495830,Hello from the Dark Side,pop,RoyishGoodLooks,15,7019,{},[Chorus]\nHello from the Dark Side\nI must've ...,3763085,en,en,en
2385856,The Moons Detriment,pop,Shannon Lay,1,6698,{},If I were to know you\nI’d show you all the li...,3603840,en,en,en
2635909,Mostly to Yourself,pop,Noah Reid,1,3517,{},Well it's mostly in the mornin'\nWhen your eye...,3961929,en,en,en
2433617,Come September,pop,Anas Mitchell,1,1244,"{""Anaïs Mitchell""}","Autumn's ashes, summer's embers\nOn the sidewa...",3669873,en,en,en
2599147,Gone,pop,Ella Martine,1,946,{},[Verse 1]\nIt's hard we hurt each other\nTime ...,3910616,en,en,en
2433630,The Pursewarden Affair,pop,Anas Mitchell,1,732,"{""Anaïs Mitchell""}",Percy Pursewarden\nOpen up your door\nI haven'...,3669885,en,en,en
2318347,The Rose,pop,Common Holly,1,642,{},Well I know I was the rose\nBut now I feel lik...,3496110,en,en,en
2918072,U cut me off so the cuts r on me now..,pop,Sapphire2001,1,495,{},I can feel in the air tonight\nSmoke seeping f...,4401036,en,en,en


The titles recovered seem to be for the most part recent sounds, not very popular with a bad indexation of the years.

A case-by-case pre-processing of the data is too tedious compared to the amount of data to be processed. We will only use data with correctly formatted dates.

In [15]:
df = df[(df.year >= 1960) & (df.year <= 2024)]
len(df)

643004

We wish to analyze the texts by decade then let's add a decade column.

In [16]:
import math

df['decade'] = df['year'].map(lambda x : int(math.trunc(x / 10) * 10))
df.sort_values(by = 'year').head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
2374566,Aftermath,misc,Sylvia Plath,1960,159,{},Compelled by calamity's magnet\nThey loiter an...,3585539,en,en,en,1960
2374473,Full Fathom Five,misc,Sylvia Plath,1960,2226,{},"Old man, you surface seldom.\nThen you come in...",3585373,en,en,en,1960
2374453,Night Shift,misc,Sylvia Plath,1960,2140,{},"It was not a heart, beating.\nThat muted boom,...",3585346,en,en,en,1960
2374450,Spinster,misc,Sylvia Plath,1960,3269,{},Now this particular girl\nDuring a ceremonious...,3585343,en,en,en,1960
2809006,Panama Woman,rb,Mighty Sparrow,1960,29,{},"[Verse 1]\nChubby girl, you like too much mone...",4223131,en,en,en,1960
2809005,Maude,rb,Mighty Sparrow,1960,14,{},"[Verse 1]\nMaude, you mad\nHow you could throw...",4223130,en,en,en,1960
2459266,Im Waiting for Tomorrow,pop,Danny Rivers,1960,200,{},I'm waiting for tomorrow (tomorrow)\nAnd I'm p...,3706357,en,en,en,1960
2809003,Dont Touch Me,rb,Mighty Sparrow,1960,25,{},[Verse 1]\nBrethren and sistren\nWe are gather...,4223129,en,en,en,1960
2752281,Sing Again With The Chipmunks,pop,The Chipmunks,1960,106,{},"Come on, Chipmunks!\nWe're gonna sing again fo...",4136291,en,en,en,1960
2374475,Medallion,misc,Sylvia Plath,1960,2210,{},By the gate with star and moon\nWorked into th...,3585375,en,en,en,1960


## Text preprocessing

After the visualisation part let's focus more on the main data which are the lyrics.

In [17]:
df.iloc[0]["lyrics"]

"The roses start to wither\nWhere the devil lays to sleep\nAnd sorrow seems to be like joy\nWhere dying lights dance on last time\n\nYou had me at hello\nBut I couldn't see right through your twisted soul\nThis was the downfall of us all\n\n(Roses start to wither\nWhere the devil lays to sleep)\n\nToo close to hell\nI can't see\nWhat ate me up from the inside\nToo close to hell\nI will never see\nWhat ate me up from the inside out\n(From the inside out)\n\nI'm caught in a nightmare\nNo chance to make it out alive\nWhy have we left the place\nWhere we found love\nYou had me at hello\nBut I couldn't see right through your twisted soul\nThis was the downfall of us all\n\n(Roses start to wither\nWhere the devil lays to sleep)\n\nToo close to hell\nI can't see\nWhat ate me up from the inside\nToo close to hell\nI will never see\nWhat ate me up from the inside out\n\nSo sing along\nWe are a battle\nSo sing along\nWhy have you forsaken me?\n\nWhy have you forsaken me?\nWhy have you forsaken m

There is many undesirable characters like the line breaker '\n', figures or square, curly and simple brackets. So let's clean this data with regular expressions.

In [18]:
import re
from numpy.random import randint

def clean_text(text):
    # remove \n
    text = text.replace('\n', ' ')
    # remove punctuation
    text = re.sub(r'[,\.!?]', '', text)
    #removing text in square braquet
    text = re.sub(r'\[.*?\]', ' ', text)
    #removing numbers
    text = re.sub(r'\w*\d\w*',' ', text)
    #removing bracket
    text = re.sub(r'[()]', ' ', text)
    # convert all words in lower case
    text = text.lower()
    return text

In [19]:
# get the results of data cleaning
cleaned_text = df["lyrics"].apply(clean_text)

# convert cleaned text to list
docs = cleaned_text.to_list()
docs[0]

"the roses start to wither where the devil lays to sleep and sorrow seems to be like joy where dying lights dance on last time  you had me at hello but i couldn't see right through your twisted soul this was the downfall of us all   roses start to wither where the devil lays to sleep   too close to hell i can't see what ate me up from the inside too close to hell i will never see what ate me up from the inside out  from the inside out   i'm caught in a nightmare no chance to make it out alive why have we left the place where we found love you had me at hello but i couldn't see right through your twisted soul this was the downfall of us all   roses start to wither where the devil lays to sleep   too close to hell i can't see what ate me up from the inside too close to hell i will never see what ate me up from the inside out  so sing along we are a battle so sing along why have you forsaken me  why have you forsaken me why have you forsaken me why have you forsaken me why have you forsak

In [20]:
# update dataframe
df.update(cleaned_text)
df.head(3)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
2000000,Roses,rock,Vitja,2017,399,{},the roses start to wither where the devil lays...,3019113,en,en,en,2010
2000001,Keep On Pushin,rap,Problem,2017,692,"{""My Princess Aeryn""}",imma keep on pushin imma keep on pushin stay...,3019114,en,en,en,2010
2000002,Inside,pop,The jepettos,2017,1302,{},ooh ooh ah ah i heard you were the wo...,3019115,en,en,en,2010


In [21]:
df.iloc[0]['lyrics']

"the roses start to wither where the devil lays to sleep and sorrow seems to be like joy where dying lights dance on last time  you had me at hello but i couldn't see right through your twisted soul this was the downfall of us all   roses start to wither where the devil lays to sleep   too close to hell i can't see what ate me up from the inside too close to hell i will never see what ate me up from the inside out  from the inside out   i'm caught in a nightmare no chance to make it out alive why have we left the place where we found love you had me at hello but i couldn't see right through your twisted soul this was the downfall of us all   roses start to wither where the devil lays to sleep   too close to hell i can't see what ate me up from the inside too close to hell i will never see what ate me up from the inside out  so sing along we are a battle so sing along why have you forsaken me  why have you forsaken me why have you forsaken me why have you forsaken me why have you forsak

That's better! The libraries that we will use later to perform topic modeling usually provide preprocessing but it is always good to have control over what we manipulate.

# Topic modeling

- [LDA (latent dirichlet allocation)](https://fr.wikipedia.org/wiki/Allocation_de_Dirichlet_latente) are the common way to do topic modeling in the few last years, it works and it's quite easy to use with common python library like [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html).
- [BERTopic](https://maartengr.github.io/BERTopic/index.html) seems to be one of the best technic this day to perform topic modeling. It combine the leverage of [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) the famous language model with [c-TF-IDF](https://maartengr.github.io/BERTopic/api/ctfidf.html) tansformer.

## Define default tokenizer and Lemmatizer

It could be difficult to process all this data on my computer or Kaggle. The memory will quickly be overwhelmed. I will work with a sample of our previously load data in order to avoid memory overload.

In [23]:
from utils.functions import sample
from utils.plots import barplot_by_decade

# sample the data
sdf = sample(data=df)

[(2010, 571595), (2000, 33998), (1990, 14189), (1980, 8788), (1970, 7088), (1960, 4727), (2020, 2619)]


In [24]:
# check the distribution of the sample
barplot_by_decade(sdf)

In [25]:
# get the results of data cleaning
cleaned_text = sdf["lyrics"].apply(clean_text)
# update dataframe
sdf.update(cleaned_text)
sdf.head(3)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
2584605,A Thousand Years,pop,Tom Paxton,1968,364,{},the burgher banged his fist on the table red f...,3890197,en,en,en,1960
2507633,How Can I Say Im Sorry,rb,Jimmy Ruffin,1965,357,{},ooh i want you back i want you back can't go...,3779801,en,en,en,1960
2409626,Sing What You Wanna,pop,Shorty Long,1968,47,{},ever since my baby been gone write no song...,3634638,en,en,en,1960


## Vizualise most frequent words over decades

In [26]:
from utils.terms_document_matrix import TermsDocumentsMatrix
from utils.processing import preprocess

ModuleNotFoundError: No module named 'tqdm'

In [None]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [1960, 1970, 1980, 1990],
                           colorscale = 'Plasma')

# display bar charts of most frequent terms
tdm.most_freq_terms(n_rows = 2, n_cols = 2, n_terms = 15)

According to the bar graphs displayed above, a group of words seems to recur on each decade: Love, know, go, feel ... Words that seem to relate to the popular song that can talk about love. This is consistent with our previous analysis from the pie charts showing the proportions of musical styles across time. We also notice an important presence of onomatopoeia like yeah or oh.

In [None]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [2000, 2010, 2020],
                          colorscale = 'Plasma')

# most frequent terms
tdm.most_freq_terms(n_rows = 2, n_cols = 2, n_terms = 15)

We get similar results on this second decade with similar high occurrence words. We see a greater amount of onomatopoeia in the current decade. We can explain this by an emergence of the rap music style on this current and last decade. There is in this style of music a very used process, the 'ad-libs'. They are sounds, words or onomatopoeias that the artists pronounce sometimes between two verses or at the end of a sentence to give more impact to their text and to dynamize the atmosphere of a title. This may explain the greater presence of onomatopoeia in the lyrics of this decade.


## Topic modeling with LDA

LDA is a common technic use in topic modeling, we firstly process basic preprocessing.

In [None]:
# utils
from datetime import datetime
import logging
from config import gensim_log

#initiate log file
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                    level=logging.INFO,
                   force = True)

In [None]:
from utils.lda.model import LDATopicModeling, matcher

lda_model = LDATopicModeling(df = sdf, decade = 1990)
likelihoods = lda_model.get_likelihood()

In [None]:
with open('/kaggle/working/sample.log',"w") as source:
    for line in source:
        match = matcher.search(line)
        if match:
            likelihoods.append(float(match.group(1)))

In [None]:
a = [elem for elem in lda_model.model[lda_model.get_corpus]]
for i in a:
    if len(i) == 1:
        print("la taille dépasse 1")
        print(i)

In [None]:
print([i for i in lda_model.model[lda_model.get_corpus]])

In [None]:
print([i for i in lda_model.get_corpus])


In [None]:
# print the result
lda_model.model.print_topics()

In [None]:
lda_model.plot_tsne(2)

In [None]:
lda_model.get_likelihood


In [None]:
lda_model.dashboard_LDAvis()

Let's perform lda for each decade.

In [None]:
from utils.lda.pipeline import LDAPipeline

lda_models = LDAPipeline()

In [None]:
lda_models.get_metrics()

Display information for each decade.

In [None]:
lda_1960 = lda_models.lda_info(1960)

In [None]:
lda_1960

In [None]:
lda_1970 = lda_models.lda_info(1970)

In [None]:
lda_1970

In [None]:
lda_1980 = lda_models.lda_info(1980)

In [None]:
lda_1980

In [None]:
lda_1990 = lda_models.lda_info(1990)

In [None]:
lda_1990

In [None]:
lda_2000 = lda_models.lda_info(2000)

In [None]:
lda_2000

In [None]:
lda_2010 = lda_models.lda_info(2010)

In [None]:
lda_2010

In [None]:
lda_2020 = lda_models.lda_info(2020)

In [None]:
lda_2020

Basic LDA model give us our baseline, we've got this default **perspicacity** and **coherence** score. Let's try to improve this to score and also our qualitative intuition about topic. As we saw topic seems really similar over the decade and it's quite difficult to retrieve some good topics given the representation we compute. 

# Improve the preprocessing

In this part, we will try to create a pre-processing function that can take into account bigrams and trigrams and also allow to put aside the terms that could have been too recurrent in the previous part.

## ngram recognition with gensim

A way to improve lyrics comprehension is to use bigram and trigram with the help of phraser in gensim.

In [None]:
import gensim

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# process in lyrics into words
data = sdf['lyrics'].tolist()
data_words = list(sent_to_words(data))

In [None]:
# display the result
data_words[0][:10]

In [None]:
import gensim

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_model = gensim.models.phrases.Phraser(bigram)
trigram_model = gensim.models.phrases.Phraser(trigram)

In [None]:
from numpy.random import randint, seed

# set seed from numpy
seed(18)

# draw the upper bound
upper_bound = randint(len(data_words))

# display a random example
print('bigram model: ', bigram_model[data_words[upper_bound]])
print('\ntrigram model: ', bigram_model[data_words[upper_bound]])
print('\nsentence: ',' '.join(data_words[upper_bound]))

We are able to see some bigram and bigram, most of the time it's a words with its adjective or a group of onomatopeia.

As previously explained, we would also like to set aside irrelevant terms that may have been recurrent in our last analysis. If we look at the previous bar charts, we can observe a significant amount of these last terms over all the decades. This is the case for example of like, know or yeah. Most of the time these terms qualified as uninteresting are verbs or onomatopoeias. Let's try to identify the less interesting ones by comparing the bar charts with the visualizations of pyLDAvis. We can notice a strong recurrence of the verbs like, know, come, get which are not necessarily relevant because they are found in most of the analyzed topics. We can also find onomatopoeias oh, yeah and la in most of the topics.

In [None]:
# list recurrent terms
from utils.processing import new_nlp
recurrent_terms = {'like','know','come','get', 'got','go','to','oh','yeah','la'}


# default preprocessing
def ngram_preprocess(text, nlp = new_nlp,
                     bigram = bigram_model,
                     trigram = trigram_model,
                    new_stopwords = recurrent_terms):
    
    # perform basic preprocessing to transform sentence to list of words
    words = gensim.utils.simple_preprocess(text)
    
    # customize stopwords
    spacy_stopwords = new_nlp.Defaults.stop_words
    ext_stopwords = spacy_stopwords | new_stopwords # union of set
    
    #removing stop words
    no_stop_words = [word for word in words if word not in ext_stopwords]
    
    # perform bigram model
    bigram_words = bigram[no_stop_words]
    
    # perform trigram model
    trigram_words = trigram[bigram_words]
    
    # recreate the sentence
    sentence = ' '.join(trigram_words)
    
    #tokenization to get lemma
    tokens = [token for token in nlp(sentence)]
    
    #LEMMATISATION and filter alphanumeric characters
    sentence = [word.lemma_ for word in tokens if word.text.isalpha()]

    return sentence

In [None]:
# test with the previously draw sentence
ngram_preprocess(' '.join(data_words[upper_bound]))[:10]

Let's try to rerun LDA with the new preprocessing function 

In [None]:
ngram_model = LDATopicModeling(decade = 2000, lang_preprocess = ngram_preprocess)

In [None]:
ngram_model.dashboard_LDAvis()

In [None]:
ngram_model.plot_tsne()

In [None]:
ngram_model = LDATopicModeling(
    df=sdf,
    decade = 2000,
    lang_preprocess = ngram_preprocess,
    cross_valid = True,
    epochs = 10
)

In [None]:
ngram_model.plot_coherence('alpha')

In [None]:
ngram_model.plot_coherence('eta')

In [None]:
ngram_model.plot_tsne()

In [None]:
ngram_model.dashboard_LDAvis()

In [None]:
lda_cv_models = LDAPipeline(prep = ngram_preprocess, cv = True)

In [None]:
lda_cv_models.get_metrics()

In [None]:
lda_cv_1960 = lda_cv_models.lda_info(1960)

In [None]:
lda_cv_1960

In [None]:
lda_cv_1970 = lda_cv_models.lda_info(1970)

In [None]:
lda_cv_1970

In [None]:
lda_cv_1980 = lda_cv_models.lda_info(1980)

In [None]:
lda_cv_1980

In [None]:
lda_cv_1990 = lda_cv_models.lda_info(1990)

In [None]:
lda_cv_1990

In [None]:
lda_cv_2000 = lda_cv_models.lda_info(2000)

In [None]:
lda_cv_2000

In [None]:
lda_cv_2010 = lda_cv_models.lda_info(2010)

In [None]:
lda_cv_2010

In [None]:
lda_cv_2020 = lda_cv_models.lda_info(2020)

In [None]:
lda_cv_2020

In [None]:
lda_cv_models.get_metrics()

## 2015 songs lyrics topic modelling

Let's first retrieve the english song in 2015 (year of with max genius lyrics repertoried).

Let's plot the tag distribution

The barplot below shows the frequency of each tag color by total views

Let's try topic modelling with top2vec library which is the easiest to start. But first let's filter the lyrics.

# 2000 song lyrics topic modelling

## References