### This notebook is to detect the language of the songs using the lyrics we got from Lyrics genius. It uses pretrained FastText language identification model available at https://fasttext.cc/docs/en/language-identification.html


The language identification model (lid.176.bin) used can directly be credited to the following papers - 

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759)

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [FastText.zip: Compressing text classification models](https://arxiv.org/abs/1612.03651)

In [1]:
import fasttext
import pandas as pd

The function when called loads the pretrained language identification model predicts the most probable language for each songs

In [2]:
def lang_identification(data_to_use, column_to_detect):
    # loading the pretrained language identification model
    pretrained_model_loc = './pre-trained-models/lid.176.bin'
    lang_model = fasttext.load_model(pretrained_model_loc)
    lang_identified = []
    lang_identified_prob = []
    for lyric in data_to_use["lyrics"]:
        lyric = str(lyric).replace("\n"," ")
        lang_identified.append(lang_model.predict(lyric)[0][0])
        lang_identified_prob.append(lang_model.predict(lyric)[1][0])
        
    top_lang_identified = pd.Series(lang_identified, name="top_lang_identified")
    top_lang_identified_prob = pd.Series(lang_identified_prob, name="top_lang_identified_prob")

    data_to_use = pd.concat([data_to_use, top_lang_identified, top_lang_identified_prob], axis=1)

    data_to_use.loc[data_to_use["lyrics"].isna(), "top_lang_identified"] = None
    data_to_use.loc[data_to_use["lyrics"].isna(), "top_lang_identified_prob"] = None

    return data_to_use

  

Load the lyrics data

In [3]:
tracksLyricFeatures_complete = pd.read_csv('./tracksLyricFeatures/tracksLyricFeaturesComplete.csv', encoding='utf8')
tracksLyricFeatures_complete.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,track,trackArtist,genre,lyrics
0,0,0,2,Food,AWOL,Hip-Hop,"The Waking Blind LyricsThe waking blind, embra..."
1,1,0,3,Electric Ave,AWOL,Hip-Hop,Oversikt over norske artister Lyrics*&aute&rom...
2,2,0,5,This World,AWOL,Hip-Hop,"E.T. Lyrics[Verse One: Mickey Factz]\nSee, I s..."
3,3,0,10,Freeway,Kurt Vile,Pop,"Freeway LyricsI got a freeway in mind, let go ..."
4,4,0,134,Street Music,AWOL,Hip-Hop,"Goin’ Down Lyrics[Intro: DJ Drama, Fabolous & ..."


Call the model to find the language of the lyrics

In [4]:
tracksLyricFeatures = lang_identification(tracksLyricFeatures_complete,"lyrics")
tracksLyricFeatures.head()



Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob
0,0,0,2,Food,AWOL,Hip-Hop,"The Waking Blind LyricsThe waking blind, embra...",__label__en,0.932648
1,1,0,3,Electric Ave,AWOL,Hip-Hop,Oversikt over norske artister Lyrics*&aute&rom...,__label__en,0.253449
2,2,0,5,This World,AWOL,Hip-Hop,"E.T. Lyrics[Verse One: Mickey Factz]\nSee, I s...",__label__en,0.949876
3,3,0,10,Freeway,Kurt Vile,Pop,"Freeway LyricsI got a freeway in mind, let go ...",__label__en,0.960411
4,4,0,134,Street Music,AWOL,Hip-Hop,"Goin’ Down Lyrics[Intro: DJ Drama, Fabolous & ...",__label__en,0.948774


Save the data with the language information

In [5]:
tracksLyricFeatures.iloc[:,2:].to_csv('./tracksLyricFeatures/tracksLyricFeaturesLang.csv', encoding="utf8", index=False)

In [2]:
tracksLyricFeatures = pd.read_csv('./tracksLyricFeatures/tracksLyricFeaturesLang.csv')
tracksLyricFeatures.head()

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob
0,2,Food,AWOL,Hip-Hop,"The Waking Blind LyricsThe waking blind, embra...",__label__en,0.932648
1,3,Electric Ave,AWOL,Hip-Hop,Oversikt over norske artister Lyrics*&aute&rom...,__label__en,0.253449
2,5,This World,AWOL,Hip-Hop,"E.T. Lyrics[Verse One: Mickey Factz]\nSee, I s...",__label__en,0.949876
3,10,Freeway,Kurt Vile,Pop,"Freeway LyricsI got a freeway in mind, let go ...",__label__en,0.960411
4,134,Street Music,AWOL,Hip-Hop,"Goin’ Down Lyrics[Intro: DJ Drama, Fabolous & ...",__label__en,0.948774


Overall count of songs with English as the top language

* Documents with english as the identified language = 12483 <> 50% - Medium Dataset
* Documents with english as the identified language = 3954 <> 49% - Small Dataset

In [3]:
tracksLyricFeatures["top_lang_identified"].value_counts()

__label__en    12483
__label__fr      409
__label__es      250
__label__pt      125
__label__de       98
__label__it       90
__label__pl       40
__label__sr       29
__label__la       16
__label__sv       13
__label__id       13
__label__fi       12
__label__ro       12
__label__tr       11
__label__hr       10
__label__ru        9
__label__da        9
__label__hu        7
__label__nl        6
__label__no        6
__label__cs        5
__label__sq        4
__label__ca        4
__label__sk        3
__label__ko        3
__label__ja        3
__label__et        2
__label__sh        2
__label__lv        2
__label__fa        1
__label__gd        1
__label__ms        1
__label__af        1
__label__vi        1
__label__sl        1
Name: top_lang_identified, dtype: int64

Count of English songs in the >90% probability cut -

* Documents with English as the identified language and prediction prob >90% = 9623 <> 38% - Medium Dataset
* Documents with English as the identified language and prediction prob >90% = 2993 <> 37% - Small Dataset

In [8]:
len(tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified_prob']>=0.9) & (tracksLyricFeatures['top_lang_identified']=="__label__en")])

9623

Count of English songs in the >80% probability cut -

* Documents with English as the identified language and prediction prob >80% = 10910 <> 43% - Medium Dataset
* Documents with English as the identified language and prediction prob >80% = 3395 <> 42% - Small Dataset


In [9]:
len(tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified_prob']>=0.8) & (tracksLyricFeatures['top_lang_identified']=="__label__en")])

10910

Count of English songs in the >70% probability cut -

* Documents with English as the identified language and prediction prob >70% = 11106 <> 45% - Medium Dataset
* Documents with English as the identified language and prediction prob >70% = 3395 <> 43% - Small Dataset


In [10]:
len(tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified_prob']>=0.7) & (tracksLyricFeatures['top_lang_identified']=="__label__en")])

11106

Count of English songs in the >60% probability cut -

* Documents with English as the identified language and prediction prob >60% = 11521 <> 46% - Medium Dataset
* Documents with English as the identified language and prediction prob >60% = 3590 <> 45% - Small Dataset


In [11]:
len(tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified_prob']>=0.6) & (tracksLyricFeatures['top_lang_identified']=="__label__en")])

11521

In [12]:
tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified']=="__label__en")]

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob
0,2,Food,AWOL,Hip-Hop,"The Waking Blind LyricsThe waking blind, embra...",__label__en,0.932648
1,3,Electric Ave,AWOL,Hip-Hop,Oversikt over norske artister Lyrics*&aute&rom...,__label__en,0.253449
2,5,This World,AWOL,Hip-Hop,"E.T. Lyrics[Verse One: Mickey Factz]\nSee, I s...",__label__en,0.949876
3,10,Freeway,Kurt Vile,Pop,"Freeway LyricsI got a freeway in mind, let go ...",__label__en,0.960411
4,134,Street Music,AWOL,Hip-Hop,"Goin’ Down Lyrics[Intro: DJ Drama, Fabolous & ...",__label__en,0.948774
...,...,...,...,...,...,...,...
24990,155292,Scarlet Sails,Alex Mason,Instrumental,Big Red Son LyricsTHE AMERICAN ACADEMY of Emer...,__label__en,0.944068
24992,155294,Cast Away,Alex Mason,Instrumental,March 2022 Singles Release Calendar LyricsAtte...,__label__en,0.481507
24993,155295,Attraction,Alex Mason/The Minor Emotion,Instrumental,Game Law: “Have Fun!” Lyrics|ABCDEFGHIJKLMNOPQ...,__label__en,0.936187
24994,155296,Fallen Stars,Alex Mason,Instrumental,2018 haywoodindahood Listening Log LyricsLast ...,__label__en,0.631934


In [13]:
tracksLyricFeatures.shape

(25000, 7)

Count of songs across all genres for different probability cuts in the medium dataset

 greater than 90% probability of language prediction

In [17]:
tracksLyricFeatures_end90 = tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified_prob']>=0.9) & (tracksLyricFeatures['top_lang_identified']=="__label__en")]
tracksLyricFeatures_end90["genre"].value_counts()

Rock                   3833
Electronic             1860
Experimental            744
Folk                    728
Hip-Hop                 696
Instrumental            550
Pop                     523
Old-Time / Historic     187
International           141
Jazz                     85
Country                  79
Spoken                   61
Classical                52
Soul-RnB                 51
Blues                    32
Easy Listening            1
Name: genre, dtype: int64

 greater than 80% probability of language prediction

In [14]:
tracksLyricFeatures_end80 = tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified_prob']>=0.8) & (tracksLyricFeatures['top_lang_identified']=="__label__en")]
tracksLyricFeatures_end80["genre"].value_counts()

Rock                   4411
Electronic             2096
Hip-Hop                 832
Experimental            809
Folk                    806
Pop                     607
Instrumental            583
Old-Time / Historic     207
International           153
Jazz                     98
Country                  87
Spoken                   63
Soul-RnB                 60
Classical                56
Blues                    39
Easy Listening            3
Name: genre, dtype: int64

 greater than 70% probability of language prediction

In [15]:
tracksLyricFeatures_end70 = tracksLyricFeatures[(tracksLyricFeatures['top_lang_identified_prob']>=0.7) & (tracksLyricFeatures['top_lang_identified']=="__label__en")]
tracksLyricFeatures_end70["genre"].value_counts()

Rock                   4499
Electronic             2134
Hip-Hop                 847
Folk                    822
Experimental            816
Pop                     616
Instrumental            594
Old-Time / Historic     207
International           156
Jazz                    100
Country                  87
Spoken                   63
Soul-RnB                 60
Classical                59
Blues                    43
Easy Listening            3
Name: genre, dtype: int64

It is apparent that there is great class imbalance with Rock and Electronic genres being the majority classes. Before the model building we'll be removing some of the minority classes (Old-Time, International, Jazz, Country, Spoken, Soul-RnB, Classical, Blues, Easy Listening) from the dataset and also removing songs from the majority classes(Rock, Electronic) to treat the class imbalance

Also, we will be using the songs for which English was predicted as the top langugge with greater than 80% probability