**<font size="6">Lyric Topic Modelling</font>**
***
This python program takes in Spotify user account info, grabs the user's top song's lyrics from Genius.com, and performs Latent Dirichlet Allocation to turn a collection of lyrics to a set of topics. This provides a simple framework for having specialized data collection for users and performing Natrual Language Processing on this data set to find meaningful information that can be gleamed by lyrics. This program focused on seeing if there are topics and patterns in lyrics that users might not be aware of: such as topics that are more common in a certain genre or that a user unknowningly likes a particular topic. Because this program scrapes lyrics, there is a possibility for sentiment analysis on this user specfic dataset, which would not be as powerful with a bag of words data set of the lyrics.
   
# Packages and API
The required packages, which can be found in the requirements.txt file, are

In [21]:
import json
import spotipy
import requests
import bs4
import nltk
import gensim
import pickle
import pyLDAvis

`pip install -r requirements.txt` can be used to install these in the proper environment with bash. <br>
The spotipy library is a python wrapper of the Spotify API, and this project uses this API to get the most played tracks from a user, as well as the Genius API to get lyrics. For the Spotify API, one must get a Client ID, a Secret Client ID, and pick a redirect URL, and you can export these variables in the bash without revealing them with:<br>
`export client_id = 'CLIENT_ID'; // Your client id`<br>
`export client_secret = 'CLIENT_SECRET'; // Your secret`<br>
`export redirect_uri = 'REDIRECT_URI'; // Your redirect uri`<br>

Also a token is needed for the Genius API so:<br>
`export genius_token = 'ACCESS_TOKEN'; // Your generated client token`<br>

***
# Usage
With these two API's, these three functions were created:<br>

The first is the function users_top_tracks, which uses the Spotify API, and specifically the spotipy wrapper. This function takes in a Spotify's User ID (not your username), requests access to get a authorization token to make a spotipy object, and then asks the object to grab the user's top tracks given a time range and song limit (up to 50). This function then grabs the wanted information from the JSON response and returns a list of tuples; the tuple consists of the name of the track and the artist name.

In [22]:
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials

def users_top_tracks(username):
    #username = 22gkfie5pgkyn5ect6h3nixji for my account
    scope = 'user-top-read'
    token = util.prompt_for_user_token(username, scope)
    #'short_term' is 4 weeks, 'medium_term' is 6 months, 'long_term' years???
    range = 'medium_term'
    tracks = []

    if token:
        sp = spotipy.Spotify(auth=token)
        sp.trace = False
        result = sp.current_user_top_tracks(time_range = range, limit = 50)
        for item in enumerate(result['items']):
            track_name = item[1]['name']
            artist_name = item[1]['artists'][0]['name']
            data = (track_name, artist_name)
            tracks.append(data)

    return tracks

def playlist_tracks(uri):
    tracks = []
    client_credentials_manager = SpotifyClientCredentials()
    sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

    username = uri.split(':')[2]
    playlist_id = uri.split(':')[4]
    results = sp.user_playlist(username, playlist_id,fields='tracks,name')
    for item in enumerate(results['tracks']['items']):
    #    print(json.dumps(item, indent=4, sort_keys=True))
        track_name = item[1]['track']['name']
        artist_name = item[1]['track']['artists'][0]['name']
        data = (track_name, artist_name)
        tracks.append(data)
    return tracks

The next two functions are for scarping lyrics from genius, and were barely altered from Jack Schultz's blog post titled _Getting Song Lyrics from Genius’s API + Scraping_. The lyrics_from_genius function essentially parses a search url with the song title and artist name given, and goes through the html looking for a match from the search results for the song title with the correct artist. Once a match is found, the api-path tied to the song is taken and passed into the lyrics_from_song_api_path function. Here another url is created, and then lyrics are scraped and sent back to lyrics_from_genius, which cleans up the text of newlines and some punctuation before returning a very long string containing the text. If no match is found, False is returned. <br>
link: __[Jack Schultz's Wordpress](https://bigishdata.com/2016/09/27/getting-song-lyrics-from-geniuss-api-scraping/)__

In [23]:
import time
from bs4 import BeautifulSoup
#I just put in my client token because I am very lazy
base_url = "http://api.genius.com"
headers = {'Authorization': 'Bearer OtlAODRzS0K0gCnNYGGj1yLOUwXmmfKGadcvkFNFuu_HoOELT_nOKaUAO2fE3eEg'}
search_url = base_url + "/search"

def lyrics_from_song_api_path(song_api_path):
    song_url = base_url + song_api_path
    response = requests.get(song_url, headers=headers)
    json = response.json()
    path = json["response"]["song"]["path"]
    #gotta go regular html scraping... come on Genius
    page_url = "http://genius.com" + path
    page = requests.get(page_url)
    html = BeautifulSoup(page.text, "html.parser")
    #remove script tags that they put in the middle of the lyrics
    [h.extract() for h in html('script')]
    #at least Genius is nice and has a tag called 'lyrics'!
    lyrics = html.find("div", class_="lyrics").get_text() #updated css where the lyrics are based in HTML
    return lyrics

def lyrics_from_genius(song_title, artist_name):
    match = False
    search_url = base_url + "/search"
    data = {'q': song_title}
    response = requests.get(search_url, params=data, headers=headers)
    json = response.json()
    song_info = None
    for hit in json["response"]["hits"]:
        if hit["result"]["primary_artist"]["name"] == artist_name:
            song_info = hit
            match = True
            break
    if match:
        song_api_path = song_info["result"]["api_path"]
        text0 = lyrics_from_song_api_path(song_api_path)
        text1 = text0.replace('\n', ' ')
        text2 = text1.replace('[', ' ')
        text = text2.replace(']', ' ')
        return text
    else:
        return False

The text will then be sent to a tokenize_lyrics function that separates all the words, makes them lowercase, removes stopwords (include words that are added by Genius' lyric organization and sounds commonly found in songs), them are stemmed with a Porter2 / Snowball stemmer.

In [24]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

import warnings
warnings.simplefilter('ignore')

nltk.download('punkt')
nltk.download('stopwords')

def tokenize_lyrics(text):
    tokenizer = RegexpTokenizer(r'\w+')
    text = text.lower()
    words = tokenizer.tokenize(text)

    tokens = []
    stop_words = stopwords.words('english')
    additional_stopwords = "a able about above abst accordance according accordingly across act actually added adj affected affecting affects after afterwards again against ah all almost alone along already also although always am among amongst an and announce another any anybody anyhow anymore anyone anything anyway anyways anywhere apparently approximately are aren arent arise around as aside ask asking at auth available away awfully b back be became because become becomes becoming been before beforehand begin beginning beginnings begins behind being believe below beside besides between beyond biol both brief briefly but by c ca came can cannot can't cause causes certain certainly co com come comes contain containing contains could couldnt d date did didn't different do does doesn't doing done don't down downwards due during e each ed edu effect eg eight eighty either else elsewhere end ending enough especially et et-al etc even ever every everybody everyone everything everywhere ex except f far few ff fifth first five fix followed following follows for former formerly forth found four from further furthermore g gave get gets getting give given gives giving go goes gone got gotten h had happens hardly has hasn't have haven't having he hed hence her here hereafter hereby herein heres hereupon hers herself hes hi hid him himself his hither home how howbeit however hundred i id ie if i'll im immediate immediately importance important in inc indeed index information instead into invention inward is isn't it itd it'll its itself i've j just k keep keeps kept kg km know known knows l largely last lately later latter latterly least less lest let lets like liked likely line little 'll look looking looks ltd m made mainly make makes many may maybe me mean means meantime meanwhile merely mg might million miss ml more moreover most mostly mr mrs much mug must my myself n na name namely nay nd near nearly necessarily necessary need needs neither never nevertheless new next nine ninety no nobody non none nonetheless noone nor normally nos not noted nothing now nowhere o obtain obtained obviously of off often oh ok okay old omitted on once one ones only onto or ord other others otherwise ought our ours ourselves out outside over overall owing own p page pages part particular particularly past per perhaps placed please plus poorly possible possibly potentially pp predominantly present previously primarily probably promptly proud provides put q que quickly quite qv r ran rather rd re readily really recent recently ref refs regarding regardless regards related relatively research respectively resulted resulting results right run s said same saw say saying says sec section see seeing seem seemed seeming seems seen self selves sent seven several shall she shed she'll shes should shouldn't show showed shown showns shows significant significantly similar similarly since six slightly so some somebody somehow someone somethan something sometime sometimes somewhat somewhere soon sorry specifically specified specify specifying still stop strongly sub substantially successfully such sufficiently suggest sup sure	t take taken taking tell tends th than thank thanks thanx that that'll thats that've the their theirs them themselves then thence there thereafter thereby thered therefore therein there'll thereof therere theres thereto thereupon there've these they theyd they'll theyre they've think this those thou though thoughh thousand throug through throughout thru thus til tip to together too took toward towards tried tries truly try trying ts twice two u un under unfortunately unless unlike unlikely until unto up upon ups us use used useful usefully usefulness uses using usually v value various 've very via viz vol vols vs w want wants was wasnt way we wed welcome we'll went were werent we've what whatever what'll whats when whence whenever where whereafter whereas whereby wherein wheres whereupon wherever whether which while whim whither who whod whoever whole who'll whom whomever whos whose why widely willing wish with within without wont words world would wouldnt www x y yes yet you youd you'll your youre yours yourself yourselves you've z zero 1 2 3 4 5 6 7 8 9 0"
    stop_words.append('chorus')
    stop_words.append('verse')
    stop_words.append('bridge')
    stop_words.append('introduction')
    stop_words.append('intro')
    stop_words.append('interlude')
    stop_words.append('hook')
    stop_words.append('instrument')

    stop_words.append('la')
    stop_words.append('bah')
    stop_words.append('uh')
    stop_words.append('unh')
    stop_words.append('yeah')
    stop_words.append('ya')
    stop_words.append('ay')
    stop_words.append('hey')
    stop_words.append('nae')
    stop_words.append('ooh')
    stop_words.append('woah')
    stop_words += additional_stopwords.split()
    for w in words:
        if w not in stop_words:
            tokens.append(w)

    stems = []
    stemmer = SnowballStemmer("english")
    for token in tokens:
        token = stemmer.stem(token)
        if token != "":
            stems.append(token)
    return stems

[nltk_data] Downloading package punkt to /home/dhchen2/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dhchen2/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Lastly, the topic_modelling function takes in a collection of tokenized docs, creates a dictionary from the words, and makes the corpus into a bag of words implementation. The number of topics is then determined (and can be changed), and then an ldamodel from gensim is created with the corpus, number of topics, dictionary, and amount of iterations done, before returning topics. <br>
A lot of this code was taught and modified barely from Susan Li's __[Medium post](https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21)__ and its subsequent __[Github](https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/topic_modeling_Gensim.ipynb)__

In [25]:
from gensim import corpora, models

def topic_modelling_demo(docs):
    dictionary = corpora.Dictionary(docs)
    corpus = [dictionary.doc2bow(text) for text in docs]
    
    import pickle
    pickle.dump(corpus, open('corpus.pkl', 'wb'))
    dictionary.save('dictionary.gensim')
    
    NUM_TOPICS = 3
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=20)
    ldamodel.save('model5.gensim')
    topics = ldamodel.print_topics(num_words=3)
    
    return topics

For the demo, we are calling the function that gets a playlist's tracks and using that instead of the user's most listened to tracks. Then the other functions are called in order as usual. The number of tracks from the Spotify API is printed first, then the amount of tracks with lyrics from Genius is printed. Then a collection is made with the tokenized lyrics, and LDA is performed on the collection before returning topics which are then printed. Then a visualiser is called to look at the term frequency, a term's relevance to its term, and the size and relation between topics.

In [29]:
import warnings
warnings.simplefilter('ignore')

track_docs = []
playlist_uri = 'spotify:user:22gkfie5pgkyn5ect6h3nixji:playlist:4e7WfaU15ZSSPsZsVhLVSy'
demo_info = playlist_tracks(playlist_uri)
print("Number of tracks grabbed is " + str(len(demo_info)))
for i in demo_info:
    track_lyrics = lyrics_from_genius(i[0], i[1])
    if(track_lyrics):
        track_docs.append(track_lyrics)
print("Number of track lyrics found is " + str(len(track_docs)))
lyric_collection = []
for j in track_docs:
    lyric_collection.append(tokenize_lyrics(j))
topics = topic_modelling_demo(lyric_collection)
for topic in topics:
    print(topic)
    
dictionary_vis = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus_vis = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus_vis, dictionary_vis, sort_topics=False)
pyLDAvis.enable_notebook()
pyLDAvis.display(lda_display)

Number of tracks grabbed is 41
Number of track lyrics found is 36
(0, '0.015*"god" + 0.011*"jesus" + 0.010*"sing"')
(1, '0.017*"nigga" + 0.011*"bitch" + 0.010*"prais"')
(2, '0.048*"drug" + 0.028*"love" + 0.021*"feel"')


Here we call the last two boxes of code again, except we are changing both the number of topics (i.e. *Num_Topics = 4*) as well as number of words per topic (i.e. *num_words = 4*). Then we call the functions normally like in the program

In [27]:
from gensim import corpora, models

def topic_modelling(docs):
    dictionary = corpora.Dictionary(docs)
    corpus = [dictionary.doc2bow(text) for text in docs]
    
    pickle.dump(corpus, open('corpus.pkl', 'wb'))
    dictionary.save('dictionary.gensim')
    
    NUM_TOPICS = 4
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=20)
    ldamodel.save('model5.gensim')
    topics = ldamodel.print_topics(num_words=4)
    
    return topics

In [28]:
track_docs = []
username = "22gkfie5pgkyn5ect6h3nixji" #my user id from Spotify 22gkfie5pgkyn5ect6h3nixji
track_info = users_top_tracks(username)
print("Number of tracks grabbed is " + str(len(track_info)))
for i in track_info:
    track_lyrics = lyrics_from_genius(i[0], i[1])
    if(track_lyrics):
        track_docs.append(track_lyrics)
print("Number of track lyrics found is " + str(len(track_docs)))
lyric_collection = []
for j in track_docs:
    lyric_collection.append(tokenize_lyrics(j))
topics = topic_modelling(lyric_collection)
for topic in topics:
    print(topic)
    
dictionary_vis = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus_vis = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus_vis, dictionary_vis, sort_topics=False)
pyLDAvis.enable_notebook()
pyLDAvis.display(lda_display)

Number of tracks grabbed is 50
Number of track lyrics found is 35
(0, '0.034*"love" + 0.016*"feel" + 0.010*"call" + 0.010*"nigga"')
(1, '0.027*"bitch" + 0.023*"nigga" + 0.014*"life" + 0.013*"time"')
(2, '0.016*"night" + 0.014*"feel" + 0.010*"fuck" + 0.009*"goin"')
(3, '0.018*"shit" + 0.012*"nigga" + 0.011*"talk" + 0.011*"gonna"')
