## Dataset - Spotify mining


#### To achieve all the aformentioned goals, we first defined the data that we would use on our analysis.

We decided to extract all the data we need from ***Spotify***.

Spotify is probably one of the most well-known music streaming platforms all over the world. It contains more than 40 millions of tracks that users can listen to thanks to their multi-platform web and mobile applications. 

But in this case, we are mostly interested in the Spotify Web API endpoints. Using them, we have been able to obtain our desired information about music artists, albums, and tracks directly from the Spotify Data Catalogue. Furthermore, we have also been interested in obtaining calculated audio features of tracks.

The functions used to download our desired data are the following ones:

* Search by genre
* Get an Artist's Top Tracks
* Get an Artist's Albums
* Get an Album's Tracks
* Get a Track
* Get Song's audio features

To actually download all data needed, we have used the already existing Spotify library for Python (_Spotipy_). In order to use it, we first had to get credentials that Spotify asks. Anyone can ask for these credentials through the Spotify Website.


### We found where we get the data from! Now what?

The amount music data availabe though way big! At first it was a bit overwhelming,but to make thing more realistic, but also keep reliability in our work, we agreed in the following flow of actions:

- Retrive the ***Top 100 Artists*** of ***7 Top Genre's*** of Spotify. This totals to _700 Artists_.
- Fetch the current ***Top 10 songs of each Artist***.
- Find the ***collaborations*** of each Artist, by mining each Artist's albms.
- Get ***audio features*** of each one of these songs.
- **Mine** the lyrics of each one of these songs.

The genres we decided to retrieve songs from are the following:

* Pop
* Rock
* Hip-hop
* Blues
* Indie
* Country
* Soul

# Basic stats. Let's understand the dataset better

### The installation and code developed to use the Spotify API is the folowing:

In [None]:
#Import needed libraries
import sys
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
import pprint
import csv
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd

In [None]:
#Spotify autentification credentials
client_id = "b06999e849764d0cb9d85ff2b4762fd9"
client_secret = "4565b9044d694deb9cf54eddb9b08e69"

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

## Artist's Dataset (artists.csv)

First, the artists' list of top 100 artists for each genre has been created. In this case, the *Search by genre* endpoint has been used by assigning that the type of the response information desired is artist. The maximum length of objects on each response that Spotify allows is just 50, so we needed to make the call twice by using the parameters *limit* and *offset* to obtain our information desired.

The downloaded list of artists has been saved on *artists.csv* file and for each artist, the following information has been kept.

* Artist
* Genre
* Artist uri
* Artist id

In [None]:
#Download the top 100 artists for each genre defined
def downloadTopArtistsGenre(genres):
  #Open 
  file = open("artists.csv", "w+")
  fileCsv = csv.writer(file)
  fileCsv.writerow(['Artist','Genre','Artist_uri','Artist_id'])
  for genre in genres:
    result = sp.search(q='genre:'+genre, type='artist',limit=50,offset=0)
    for artist in result['artists']['items']:
      #artist JSON dict like: https://developer.spotify.com/documentation/web-api/reference/artists/get-artist/
      row = [artist['name'].encode('ascii', 'ignore'),genre,artist['uri'].encode('ascii', 'ignore'),artist['id'].encode('ascii', 'ignore')]
      fileCsv.writerow(row)
    result = sp.search(q='genre:'+genre, type='artist',limit=50,offset=50)
    for artist in result['artists']['items']:
      row = [artist['name'].encode('ascii', 'ignore'),genre,artist['uri'].encode('ascii', 'ignore'),artist['id'].encode('ascii', 'ignore')]
      fileCsv.writerow(row)
  file.close()

In [None]:
#Genres defined
genres = ['pop','rock','hip-hop','blues','indie','country','soul']
downloadTopArtistsGenre(genres)

Afterwards, the collaborations for each artist wanted to be retrieved. To do it, we had to obtain the list of songs for each artist and then checked if the song has multiple artists or not.

As the Spotify platform does not allow to directly obtain the songs for each artist, first all the albums for each artist were obtained again with the *limit* and *offset* parameters. And then, for each album retrieved, all songs were downloaded and analyzed. A similar procedure was executed for the singles of each artist to obtain as much information as possible related to collaborations because we realized that some remix songs created were not included in an album. This time, we just retrieved, at maximum 50 singles for each artist.

Furthermore, we realized that we had to exclude the *compilation* albums and lists retrieved on the same service because, otherwise, information regarding the list of reproductions created by Spotify was also retrieved and processed on our creation of the graph.

In this step, we decided to do not filter the collaborations within our list of artists, as it would be nice to also abstract some information from the other collaborations at some point of our further analysis. That is why, each time that more than one artists were found on a song, a new entry was added to the *artists_colab.csv* or *artists_colab_singles.csv* files with the following format without any other filtering.

* Artist
* Artist id
* Artist colab.
* Artist colab. id
* Song
* Song id

In [None]:
#Downloaded all colaborations of one artist based on his albums
def downloadColaborationsPerArtistsAlbums(artistsList):
  file = open("artists_colab.csv", "w+")
  fileCsv = csv.writer(file)
  fileCsv.writerow(['Artist', 'Artist_id', 'ArtistColab','ArtistColab_id', 'Song', 'Song_id'])

  artistsCol = list()
  for index, artist in artistsList.iterrows():
    total = 1
    offset = 0
    while(offset < total):
      result = sp.artist_albums(artist['Artist_uri'], limit=50, album_type='album', offset=offset)
      for album in result['items']:
        if (album['album_type'] != 'Compilation'):
          result1 = sp.album_tracks(album['uri'], limit=50)
          for song in result1['items']:   
            for art in song['artists']:
              # tests if art in artists - take only links between the top artists
              if(artist['Artist'] != art['name']):
                artistsCol.append(art['name'])
                row = [artist['Artist'].encode('ascii', 'ignore'), artist['Artist_id'].encode('ascii', 'ignore'), art['name'].encode('ascii', 'ignore'),art['id'].encode('ascii', 'ignore'),song['name'].encode('ascii', 'ignore'), song['id'].encode('ascii', 'ignore')]
                fileCsv.writerow(row)
      if (offset == 0):
        total = result['total']
      offset += 1
  print("Done")
  file.close()

#Downloaded all colaborations of one artist based on his singles
def downloadColaborationsPerArtistsSingles(artistsList):
  file = open("artists_colab_singles.csv", "w+")
  fileCsv = csv.writer(file)
  fileCsv.writerow(['Artist', 'Artist_id', 'ArtistColab','ArtistColab_id', 'Song', 'Song_id'])
  artistsCol = list()
  for index, artist in artistsList.iterrows():
    result = sp.artist_albums(artist['Artist_uri'], limit=50, album_type='single', offset=0)
    for album in result['items']:
      if (album['album_type'] != 'Compilation'):
        result1 = sp.album_tracks(album['uri'], limit=50)
        for song in result1['items']:   
          for art in song['artists']:
            # tests if art in artists - take only links between the top artists
            if(artist['Artist'] != art['name']):
              artistsCol.append(art['name'])
              row = [artist['Artist'].encode('ascii', 'ignore'), artist['Artist_id'].encode('ascii', 'ignore'), art['name'].encode('ascii', 'ignore'),art['id'].encode('ascii', 'ignore'),song['name'].encode('ascii', 'ignore'), song['id'].encode('ascii', 'ignore')]
              fileCsv.writerow(row)
  print("Done")
  file.close()

In [None]:
#Load artists form the previous saved file
artistsList = pd.read_csv('./SGI_Sportify_Project/artists.csv', encoding='utf-8')
#Create artists_colab.csv file
downloadColaborationsPerArtistsAlbums(artistsList)
#Create artists_colab_singles.csv file
downloadColaborationsPerArtistsSingles(artistsList)

## Songs Dataset (topsters_all.csv)

The dataset used for the Network of Song Similaruty is the topsters_all.csv. It contains information about the top 10 songs, of top 10 artists, of all genres. To create this we were based on the artists.csv

To obtain all the relevant data for this particular dataset, we also used the Spotify Library of Python (Spotipy).
After mining Spotify data, we got the elements needed , by mainly using the following functins:

>sp.artist_top_tracks<br>
sp.audio_features

The topsters_all.csv contains the following data in order.
>- **Genre**: A list, which contains the genre(s) that the song belongs to. To do this we "borrowed" the genre(s) of the relevant artist from the artists.csv.
- **Artist**: The artist's name.
- **Artist_id**: Spotify API assigns a unique ID to each Artist.
- **Song**: The name of the song.
- **Song_id**: Similarly with Artists, Spotify API assigns a unique ID TO each song. This way you can distinguish to songs that might share the same name.
- **Audio_Metric**: A vector of 3 elements. Each of these elements is a particular audio feature. Namely, _Valence_, _Energy_ and _Dancebility_. We got those features by mining the Spotify API.

To gather the data for the topsters_all.csv, **we were based on the artists.csv**, from which we retrieved every artist. Having ecery artist, leaves with one thing. That is to get the **top 10 songs** of each one.

#### The first issue that we encountered was that the artists.csv contains a number duplicate artists.<br>
> ***Why this happens is explained below:<br>***
To obtain the aritsts.csv we downloaded each genre's 100 most popular artists. However, there are artists associated with more than one genres, thus they appear more than once in the artists.csv<br><br>
>**In the topsters_all.csv we want every song to occur only once, but we also need to keep all the different genres that this song is associated with.**<br><br>
>To do this we make a list of all the genres that the artist of the respective song is associated with, and assign this list as the song's genre. 
(Unfortunately the Spotify API does not directly return the genre(s) of a particualar song, that is why we assign the corresponding artist's genre(s).

#### The second issue was that we still encountered duplicate songs.<br>
> ***Why this happens is explained below:***<br>
Since two artists might have collaborated on a very popular song, this song will appear on both artist's top 10 list.<br><br>
> This was solved by creating a list where in every iteration we test if every new song is already in the list. If not we append it and write the song in csv. If it is already in the list we skip it. We keep doing that until we have finished with all songs.<br><br>
> Yes. Since the genre we assign to each song, is the genre of it's artist, if a popular song "belongs" to two artists and we skip one of them, it is possible that we miss a genre. However, we assume that the two collaborating artists, most likely belong to the same genre(s), and there is no important "loss" of data.

#### ***Below is the code for obtaining the topsters_all.csv***

In [None]:
# Create the topsters_all1.csv
path = "topsters_all.csv"
songs = [] # create a list, where every song will be appended at a time

with open(path, "wb") as f:
    top_tracks_csv = csv.writer(f)
    top_tracks_csv.writerow(['Genre', 'Artist', 'Artist_id', 'Song', 'Song_id', 'Audio_Metric'])
    
    for index, row in artists_df.iterrows():
        artist_id = row['Artist_id']
        artist_name = row['Artist']
        genre = list(artists_df.loc[artists_df['Artist_id'] == artist_id]["Genre"]) # geta list of every artist's genre(s)
        top_tracks_list = sp.artist_top_tracks(artist_id)['tracks'] # get top 10 tracks of every artist by the Spotify API
        for song in top_tracks_list:
            try:
                song_id = song['id']
                if song_id not in songs:# check if the song is already written in the csv.
                    songs.append(song_id)
                    audio_features = sp.audio_features([song_id])[0]
                    #in the three following lines I round the song features into two decimals.
                    valence = round(audio_features['valence'],2)
                    energy = round(audio_features['energy'],2)
                    danceability = round(audio_features['danceability'],2)
                    audio_metric = [valence, energy, danceability] # create the audio_metric vector
                    row = [genre, artist_name.encode('ascii', 'ignore'), str(artist_id).encode('ascii', 'ignore'), song['name'].encode('ascii', 'ignore'), str(song['id']).encode('ascii', 'ignore'), audio_metric]
                    top_tracks_csv.writerow(row)
            except:
                pass

In [None]:
# read the topsters_all.csv, to create the songs_df and use it later on, in the Network creation and analysis
songs_df = pd.read_csv(r"topsters_all.csv")

## Topsters Datasets (/topsters/)

### Set Up
Below we set up the libraries for this notebook. All needed libraries are imported here. 

If you lack a library, or experience any kind of issues please refer to the bottom of the page to find the exact versions of the libraries used (retrieved with `pip freeze`). You can always install a library on-the-fly on your notebook with `!pip install library`.

In [2]:
import sys
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
import pprint
import csv
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import time
import math
import os
from os import path
import numpy as np
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
from nltk.collocations import *
import seaborn as sns
import joypy
import matplotlib.cm as cm
import random
import socket #for lyric fetching
from nltk.stem.porter import *
from textblob import TextBlob
from nltk.corpus import stopwords 

In [3]:
#libraries to fetch lyrics
from azlyrics import lyrics
import pandas as pd
import re
import lyricwikia
import urllib2
import json
import bs4
import requests
import unicodedata
import nltk

We will be creating two directories on our system:
- `lyrics` is going to hold the lyrics in .csv files, presented by song and divided in files by Arist
- `topsters` is going to hold the Top 10 songs in .csv files, divided by Artist

In [4]:
mkdir lyrics

A subdirectory or file lyrics already exists.


In [5]:
mkdir topsters

A subdirectory or file topsters already exists.


Below we load the necessary Top 700 Artist data from the corresponding `.csv`.

It is necessary to save that data as the Top 700 Artist data can change based on the data fecthed by Spotify. That, inherently creates the need of fectching the lyrics of any Artist that recently made it on the Top 700 and, as we will see afterwards, this can be time consuming.

Thus, to have a common basis of discussion for the full length of our report, we are using the same `.csv` files throughout.

In [6]:
#SETTING THE DATAFRAMES WE NEED HERE
artists_df = pd.read_csv('artists.csv')

Below, you will find our Spotify credentials. They are needed to fecth data from their REST API endpoints. 

Since they are ours, please respect [Spotify's Developers ToS](https://developer.spotify.com/terms/) when you are using our code.

Furthermore, we are creating a Spotify client object to enact any calls needed.

In [7]:
client_id = "b06999e849764d0cb9d85ff2b4762fd9"
client_secret = "4565b9044d694deb9cf54eddb9b08e69"

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Below we are fetching the Top 10 songs (_topsters_) for each one of the Top 700 Artists and putting them inside corrsponding `.csv` files. Each `.csv` will have rows that hold the following information: _'Artist', 'Artist id', 'Song', 'Song id'_.

The code segment includes a performance provision, skipping the calls if a _topster_ file already exists for an _Artist_.

In [8]:
counter = 0
counter_dup = 0
import os.path
for index, row in artists_df.iterrows():
    counter += 1
    # have overview of how soon this code segment is done
    if (counter % 10 == 0):
        #print(counter)
        pass
    artist_id = row['Artist_id']
    artist_name = row['Artist']
    try:
        # do not query if we already have the information
        if (os.path.exists("./topsters/songs_"+str(artist_id)+".csv")):
            counter_dup += 1
            continue
        fileCsv2 = csv.writer(open("./topsters/songs_"+str(artist_id)+".csv", "w+"))
        fileCsv2.writerow(['Artist', 'Artist_id', 'Song', 'Song_id'])
        top_track_list = sp.artist_top_tracks(str(artist_id))['tracks']
    except:
        #print(artist_id)
        pass
    for track in top_track_list:
        row = [artist_name.encode('ascii', 'ignore'),str(artist_id).encode('ascii', 'ignore'),track['name'].encode('ascii', 'ignore'), track['id'].encode('ascii', 'ignore')]
        fileCsv2.writerow(row)

## Lyrics Mining (/lyrics/)
Lyric mining is a challenging task. This falls under the caveats of our venture. As with the challenges we faced with data preparation from Spotify, lyric mining would become the next big hurdle for us. To successfuly manage to attain the lyrics for our songs we've had to resort in using a mix of services. A simpler solution would be to use the [Musicmatch API](https://developer.musixmatch.com/) which offers direct access to the data we need.  

We have decided to refrain from using that API because:
- The free available version of their service is limited to 2000 requests per day and only 30% of the song. Even though the first part of the problem can be dealt with, by mining on consecutive days, the second part would be a roadblock for a meaningful lyrical analysis.
- [The costs are very high](https://developer.musixmatch.com/plans) for the amount of data we need.

Regardless, it is interesting to note that Musicmatch is [offering solutions](https://developer.musixmatch.com/ai)  that can greatly resemble the possibilities that one has by enhancing the solutions described in our research. 

Concequently, we have utilized three methods of attaining lyrics:
1. __AZLyrics__: We directly mine their website for the song in question (using the available `azlyrics` library). AZLyrics offers one of the most comprehensive, open libraries for lyrics. Unfortunately, they impose a Rate Limit on the requests one can make on their website. The Rate Limit itself is unknown and once reached the client's IP is banned for an, also, unknown amount of time.
2. __LyricWikia__: Our method (through the corresponding library) also uses HTTP requests. Although there are no Rate Limits that seem to be restrictive with this solution, LyricWikia's library is not completely comprehensive.
3. __Genius__: Genius's library of lyrics is focused in (and started from) Hip-Hop and Pop songs. Regardless, in the last years there has been a surge of new lyric pages of more genres. This solution was designated to be a last resort. Genius does not offer an endpoint to retrieve lyrics (the closest would be `/songs/:id`, but it is essentialy the same as an HTTP request of the website) but rather one to search using their search engines and fetch the result. Thus when finding a possible match for a song, we would mine the song's page and parse it with Beautiful Soup. This leads us to conduct two API queries to receive a result from Genius and thus, due to its increased time requirements. Regardless, Genius is used extensively in the final solution, as LyricWikia can drop the ball in terms of finding lesser known songs.

Below, you will find our Genius API credentials. They are needed to fecth data from their REST API endpoints. 

Since they are ours and you are granted access for the purposes of this report, please respect [Genius's ToS](https://genius.com/static/terms) when you are using our code.

In [9]:
genius_api_dict = {'id':'eOFdp-v__I7JeKporMF3K49wKnz3cYIkJcOGUP1wRz2uV7AmWh1eeBup_zJMCkqA',
              'secret': 'kbbVWQxhXtsfnHm1W65gkjnvxlt9A3U2ih5bf6KWBJe6_KeYWZWyxC4rDK62OX9OQ9cLWTH0FgKZWoRmEofH9g',
              'access_token': 'jhpqKVzO880gvQ5i-JkEh08wTUhmmA4kMuCkt5MQKLKMrWnjjqs3Z0WiVd49TmR7'}

On the section below, we define a series of __utility functions__ that will allow us to fetch the lyrics for every song.

First and foremost, we define the function to fetch algoritms from Genius. It follows the thinking we mentioned before:
1. A search is conducted in Genius's engine using the artist name and title of the song
2. The first result is taken and its page HTML is fetched
3. The HTML is parsed with Beautiful Soup, taking lyrics and concluding if the corresponding song is the correct one

In [10]:
def genius_Lyrics(artist, song):
    querystring = "http://api.genius.com/search?q=" + urllib2.quote(song + " " + artist)
    request = urllib2.Request(querystring)
    request.add_header("Authorization", "Bearer " + genius_api_dict['access_token'])   
    request.add_header("User-Agent", "curl/7.9.8 (i686-pc-linux-gnu) libcurl 7.9.8 (OpenSSL 0.9.6b) (ipv6 enabled)") #Must include user agent of some sort, otherwise 403 returned
    while True:
        try:
            response = urllib2.urlopen(request, timeout=4) #timeout set to 4 seconds; automatically retries if times out
            raw = response.read()
        except socket.timeout:
            #print("Timeout raised and caught")
            continue
        break
    json_obj = json.loads(raw)
    body = json_obj["response"]["hits"]

    num_hits = len(body)
    if num_hits==0:
        #print("\t\tNo results for: " + song)
        return

    for result in body:
        url = result["result"]["url"]
        
        page = requests.get(url)
        if page.status_code == 404:
            return None

        # Scrape the song lyrics from the HTML
#         start_time = time.time()
        html = bs4.BeautifulSoup(page.text, "html.parser")
        title_genius = html.find("h1", class_="header_with_cover_art-primary_info-title").get_text().strip()
        artist_genius = html.find("a", class_="header_with_cover_art-primary_info-primary_artist").get_text().strip()
        if (title_genius == song.strip() ):#and artist_genius == artist):
            lyrics = html.find("div", class_="lyrics").get_text()
            lyrics = re.sub('(\[.*?\])*', '', lyrics)
            lyrics = re.sub('\n{2}', '\n', lyrics)  # Gaps between verses

            lyrics = unicodedata.normalize('NFKD', lyrics).encode('ascii','ignore')
#             last_time = time.time()
#             print('MINE time: {}'.format(last_time - start_time))
            return lyrics.strip('\n')
    return None    

In [11]:
def removePunctuation(tokens):
    """
    This function finds punctuation (apart from '#' symbols) from each string token inside the provided token list
    
    arguments: tokens: list of strings
    returns: list of strings (tokens)
    """
    new_token_list = []
    for token in tokens:
        #returns a new string without punctuation
        new_token = re.sub(r'[^\w\s#]','',token)
        if (len(new_token) > 0):
            new_token_list.append(new_token)
    return new_token_list

These two functions below can be used interchangaebly, depending on what the developer wants to achieve. 
- `getLyricsAllSongs`, fetches the lyrics for all songs of every one of the Top 700 Artists
- `getLyricsTopsters`, fetches the lyrics for the Top 10 songs of every one of the Top 700 Artists

Our focus on this research is on the results that can be attained from the Top 10 songs and, thus, we use `getLyricsTopsters`

In [12]:
def getLyricsAllSongs(artists_df):
    for index, row in artists_df.iterrows():
        artist_id = row['Artist_id']
        artist_name = row['Artist']
        artists_song_df = pd.read_csv('./songs/songs_'+artist_id+'.csv')
        getLyrics(artist_id, artist_name, artists_song_df)

        
def getLyricsTopsters(artists_df):
    for index, row in artists_df.iterrows():
        artist_id = row['Artist_id']
        artist_name = row['Artist']
        artists_song_df = pd.read_csv('./topsters/songs_'+artist_id+'.csv')
        getLyrics(artist_id, artist_name, artists_song_df)


`getLyrics` is the main utility function which creates an artist's file and gets the lyrics for all designated songs of an artist (depending on what is passed on `artists_song_df`).

Each lyric `.csv` file includes the following fields: __'Artist', 'Artist id','Song','Song id','Lyrics'__

Songs which include the strings _'strumental'_ and _'emix'_ are skipped (we deliberately do not have a first character for the words to avoid the need of capitalization or lowerification and achieve better performance with the same results). 

Furthermore, we remove any parentheses from the song title, as in the overwhelming majority of the cases it includes information about the song's current mix or version, which is not included in the lyric's page as well.

Afterward we try to fetch the lyrics from the following services (in the presented order):
1. AZLyrics
2. LyricWikia
3. Genius

Finally we remove all punctuation from each lyric file and add a row on the `.csv`

If a corresponding lyric file already exists for the artist in question, then the algorithm returns.

In [13]:
def getLyrics(artist_id, artist_name, artists_song_df):
    entered = False
    if (os.path.exists("./lyrics/lyrics_"+str(artist_id)+".csv")):
#         print("{} already exists".format("./lyrics/lyrics_"+str(artist_id)+".csv"))
        return
    fileCsv_song_lyrics = csv.writer(open('./lyrics/lyrics_'+artist_id+'.csv', "w+"))
    fileCsv_song_lyrics.writerow(['Artist', 'Artist_id','Song','Song_id','Lyrics'])
    unique_song_list = []
    for index_song, row_song in artists_song_df.iterrows():
    #         print(row_song['Song'])
        if (not(row_song['Song'] in unique_song_list)):
            unique_song_list.append(row_song['Song'])
            lyrics = None
            if (not('strumental' in row_song['Song'] and 'emix' in row_song['Song'])):
                song_name_wo_parentheses = re.sub(r'\([^()]*\)', '', row_song['Song'])
#                 print artist_name
#                 print "===="+song_name_wo_parentheses
                try:
                    wd = lyrics(artist_name, song_name_wo_parentheses)
                    if ('Error' in wd):
                        wd = lyrics(artist_name, row_song['Song'])
                    if ('Error' in wd):
                        raise Exception
            #             print(song_name_wo_parentheses +' - Unable to find lyrics in AZ')
                    else:
                        lyrics = wd
#                         print(row_song['Song'] +' - FOUND LYRICS in AZ')

                except: #if shadow banned from AZLyrics
#                     print('\tAZ failed')
                    try:
                        lyrics = lyricwikia.get_lyrics(artist_name, song_name_wo_parentheses)
                    except:
#                         print('\tLyric Wikia failed')
                        lyrics = genius_Lyrics(artist_name, song_name_wo_parentheses)
                        if (lyrics == None):
                            pass
#                             print("\tGenius failed")
            if (not(lyrics == None) and not('Unfortunately, we are not licensed' in lyrics)):
                entered = True
                #tokenize lyrics
                lyrics = nltk.word_tokenize(lyrics)
                lyrics = removePunctuation(lyrics)
                if isinstance(lyrics, list):
                    lyric_string = ""
                    for lyric in lyrics:
                        lyric_string = lyric_string + " " + lyric
                    lyrics = lyric_string
                fileCsv_song_lyrics.writerow([artist_name, artist_id,row_song['Song'],row_song['Song_id'],lyrics])           #do NOT write in the CSV if you cannot find the lyrics
#                 print("\tADDED LYRICS")
    if not(entered):
        pass
#         print('\tNO SONG FOR ARTIST {}'.format(artist_name))

In [14]:
getLyricsTopsters(artists_df)