# Sentiment Analysis - Lyrics Extraction

- Website: https://genius.com/

- Genius API client: https://genius.com/developers?ref=genius.engineering


**Extract Lyrics**
- requests module
- create an .env file to store you API token 
- use python-dotenv package to load the env variable from the .env file
- install in conda environment:
    - conda install -c conda-forge python-dotenv
- install in vs code terminal:
    - pip install python-dotenv


# Extract one song lyrics from Genius

In [5]:
import os 
from dotenv import load_dotenv
import requests
from bs4 import BeautifulSoup

#load environment variables from .env file
load_dotenv(override=True)

# genius API token from environment variable 
TOKEN = os.getenv('GENIUS_API_TOKEN')
BASE_URL = "https://api.genius.com"

# function to search for a song
def search_song(song_title, artist_name):
    search_url = f'{BASE_URL}/search'
    headers = {'Authorization':f"Bearer {TOKEN}"}
    params = {'q': f"{song_title} {artist_name}"}
    
    try:
        response = requests.get(search_url, headers=headers, params=params)
        response.raise_for_status()
        response_json= response.json()
        
        # print("JSON Response:", response_json)
        
        if 'response' not in response_json:
            return "Error: 'response' key not found in JSON response"
        
        hits = response_json['response'].get('hits', [])
        if not hits:
            return "No results found"
        
        song_hit = hits[0]['result']
        
        return song_hit['id']
    
    except(requests.RequestException, KeyError) as e:
        print(f'Error searching for song: {e}')
        return None
        

# function to get song lyrics
def get_lyrics(song_id):
    song_url = f'{BASE_URL}/songs/{song_id}'
    headers = {'Authorization': f'Bearer {TOKEN}'}
    
    response = requests.get(song_url, headers=headers)
    song_info = response.json()['response']['song']
    lyrics_path = song_info['path']
    
    lyrics_url = f'https://genius.com{lyrics_path}'
    lyrics_response = requests.get(lyrics_url)
    lyrics_html = lyrics_response.text
    
    # extract lyrics from HTML
    soup = BeautifulSoup(lyrics_html, 'html.parser')
    
     # Genius lyrics are usually inside <div> tags with class "Lyrics__Container"
    lyrics = ""
    for div in soup.find_all("div", class_=lambda x: x and 'Lyrics__Container' in x):
        lyrics += div.get_text(separator="\n")
    
    return lyrics if lyrics else "Lyrics not found"
    
song_title = 'How Deep Is Your Love'
artist_name = 'Bee Gees'
song_id = search_song(song_title, artist_name)
lyrics = get_lyrics(song_id)

print(lyrics)

[Verse 1]
I know your eyes in the mornin' sun
I feel you touch me in the pourin' rain
And the moment that you wander far from me
I wanna feel you in my arms again
[Pre-Chorus]
And you come to me on a summer breeze
Keep me warm in your love, then you softly leave
And it's me you need to show
How deep is your love?
[Chorus]
How deep is your love? How deep is your love?
I really mean to learn
'Cause we're livin' in a world of fools
Breakin' us down
When they all should let us be
We belong to you and me
[Verse 2]
I believe in you
You know the door to my very soul
You're the light in my deepest, darkest hour
You're my saviour when I fall[Pre-Chorus]
And you may not think I care for you
When you know down inside that I really do
And it's me you need to show
How deep is your love?
[Chorus]
How deep is your love? How deep is your love?
I really mean to learn
'Cause we're livin' in a world of fools
Breakin' us down
When they all should let us be
We belong to you and me
[Bridge]
Na-na-na-na-na
N

In [2]:
import json

def save_lyrics_to_json(song_title, artist_name, lyrics, file_path):
    data = {
        'song_title': song_title,
        'artist_name': artist_name,
        'lyrics': lyrics
    }
    
    with open(file_path, 'w') as f:
        json.dump(data, f, indent=4)

# save_lyrics_to_json(song_title, artist_name, lyrics, 'lyrics.json')

In [9]:
import pandas as pd

df_song = pd.read_csv(r'./Dataset/top_song_2014_2024.csv')

df_song_2014 = df_song[df_song['year'] == 2014]

df_song_2014

Unnamed: 0,year,song,artist,play_count
0,2014,Thinking out Loud,Ed Sheeran,473
1,2014,Come Away With Me,Norah Jones,242
2,2014,Happy Little Pill,Troye Sivan,242
3,2014,Superheroes,The Script,232
4,2014,How Long Will I Love You - Bonus Track,Ellie Goulding,189
5,2014,Try,Colbie Caillat,189
6,2014,Chandelier,Sia,142
7,2014,Not a Bad Thing,Justin Timberlake,125
8,2014,You're So Fine,Guba,111
9,2014,Cool Kids,Echosmith,89


# Extract MULTIPLE song lyrics:

1) Read the dataset and load the top 15 songs from 2014 to 2024

2) Search for each song using Genius API: Use Genius API to find the song ID

3) Scrape the lyrics: For each song ID, scrape the lyrics using the song's genius page

4) store the result in a strucutred JSON format
    - collect the year, song, artist and lyrics 
    - store then in JSON file 

**Time Taken to scrape 150 lyrics**

- print to console: 11minutes, 42.6 seconds

- store inside json file: 9 minutes, 51.4 seconds

In [19]:
import os
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
from dotenv import load_dotenv


df_song = pd.read_csv(r'./Dataset/top_song_2014_2024.csv')

#load environment variables from .env file 
load_dotenv(override=True)

# genius API token from environment variable 
TOKEN = os.getenv('GENIUS_API_TOKEN')
BASE_URL = 'https://api.genius.com'

#Function to search for a song
def search_song1 (song_title, artist_name):
    search_url = f'{BASE_URL}/search' 
    headers = {'Authorization': f'Bearer {TOKEN}'}
    params = {'q': f"{song_title} {artist_name}"}
    
    try:
        response = requests.get(search_url, headers=headers, params=params)
        response.raise_for_status()
        response_json = response.json()
        
        #extract song hit
        hits = response_json['response'].get('hits', [])
        if not hits:
            return None
        
        song_hit = hits[0]['result']
        return song_hit['id']
    
    except requests.RequestException as e:
        print(f'Error searching for song: {e}')
        return None
    

# Function to get song lyrics
def get_lyrics(song_id):
    song_url = f'{BASE_URL}/songs/{song_id}'
    headers = {'Authorization': f'Bearer {TOKEN}'}
    
    response = requests.get(song_url, headers=headers)
    song_info = response.json()['response']['song']
    lyrics_path = song_info['path']
    
    lyrics_url = f'https://genius.com{lyrics_path}'
    lyrics_response = requests.get(lyrics_url)
    lyrics_html = lyrics_response.text
    
    # extract lyrics from HTML using Beuatiful Soup
    soup = BeautifulSoup(lyrics_html, 'html.parser')
    lyrics = ""
    for div in soup.find_all("div", class_=lambda x: x and 'Lyrics__Container' in x):
        lyrics += div.get_text(separator="\n")
    
    return lyrics if lyrics else 'Lyrics not found'

#function to get and store lyrics for each song 
def fetch_and_store_lyrics(df_songs, output_file):
    data = []
    
    #iterate through each song in the df
    for _, row in df_songs.iterrows():
        song_title = row['song']
        artist_name = row['artist']
        year = row['year']
    
        song_id = search_song1(song_title, artist_name)
        if song_id:
            lyrics = get_lyrics(song_id)
            song_data = {
                'year': year,
                'artist': artist_name,
                'song': song_title, 
                'lyrics': lyrics
            }
            data.append(song_data)
        
            # print(json.dumps(song_data, ensure_ascii=False, indent=4))
        # else:
        #     print(f"Song '{song_title}' by '{artist_name}' not found")
    
    # # save the data to a JSON file 
    # with open(output_file, 'w', encoding='utf-8') as f:
    #     json.dump(data, f, ensure_ascii=False, indent=4)
        
# fetch_and_store_lyrics(df_song, 'top_songs_lyrics.json')

# fetch_and_print_lyrics(df_song)

# Errors - while scraping

1) Invalid URL 'https:genius.com/Ed-sheeran-thinking-out-loud-lyrics': No host supplied
    - Solutions: missing slash (//) in the URL in get_lyrics function in the URL 

# Errors - json file output
* Year 2014: 
    - "artist": "Ellie Goulding",
    - "song": "How Long Will I Love You - Bonus Track"
    - error - bonus track

* Year 2015:
    1) "artist": "Meghan Trainor"
        - "song": "Like I'm Gonna Lose You (feat. John Legend)"
        - error - get rid of 'ft John Legend' 
    
    2) "artist": "Paris Music"
        - "song": "Mad World",
        - error - artist should be: Gary Jules/Michael Andrews

    3) "artist": "John Lennon",
        - "song": "Imagine - Remastered 2010",
        - error - get rid of 'Remastered 2010'

* Year 2016:
    1) "artist": "Beyoncé"
        - "song": "7-Nov",
        - it should be 7/11 not 7-Nov LOL - change date format to 'text'

* Year 2018:
    1) "artist": "Bebe Rexha",
        - "song": "Meant to Be (feat. Florida Georgia Line)",
        - get rid of 'feat...'
    
    2)  "artist": "Queen",
        - "song": "Under Pressure - Remastered 2011",
        - get rid of 'Remastered...'
    
    3)  "artist": "Queen",
        -  "song": "Don't Stop Me Now - Remastered 2011",
        - get rid of 'Remastered...'

* year 2019:
    1)  "artist": "Julia Michaels",
        - "song": "Anxiety (with Selena Gomez)",
        - get rid of selena gomez
    
    2) "artist": "Drax Project",
        - "song": "Woke Up Late (feat. Hailee Steinfeld)",
        - get rid of Hailee
    
    3) "artist": "Yuna",
        - "song": "Lelaki (Remaster 2015)",
        - get rid of remaster

    4)  "artist": "Fifth Harmony",
        - "song": "Work from Home (feat. Ty Dolla $ign)",
        - get rid of ty dolla 
    
    5)  "artist": "Ella",
        - "song": "Layar Impian",
        - No lyrics found: https://www.lyrics.my/artists/ella/lyrics/layar-impian

    6) "artist": "JP Saxe",
        "song": "If the World Was Ending (feat. Julia Michaels)",
        - get rid of Julia

* Year 2020:
    1) "artist": "JP Saxe",
        - "song": "If the World Was Ending (feat. Julia Michaels)",
        - get rid of Julia

* Year 2021:
    1) "artist": "Lady Gaga",
        - "song": "Rain On Me (with Ariana Grande)",
        - get rid of Ariana
    
    2)  "artist": "G Sette",
        - "song": "Black Tattoo - Lonely",
        - It should be JB's Lonely 
    
    3) "artist": "Devis",
        - "song": "Chi se ne frega",
        - I dont know who this person is 😭😭😭
        - DELETE
    
    4)  "artist": "Ed Sheeran",
        - "song": "I Don't Care (with Justin Bieber)",
        - Get rid of JB
    
    5) "artist": "Sam Smith",
        - "song": "Dancing With A Stranger (with Normani)",
        - get rid of Normani

* Year 2022:
    1) "artist": "Sam Smith",
        - "song": "Unholy (feat. Kim Petras)",
        - get rid of Kim Petras
    
    2) "artist": "Rema",
        - "song": "Calm Down (with Selena Gomez)",
        - get rid of Selena 
    
    3) "artist": "Lil Nas X",
        - "song": "INDUSTRY BABY (feat. Jack Harlow)",
        - get rid of Jack Harlow

* Year 2023:
    1) "artist": "Stephen Sanchez",
        - "song": "Until I Found You (with Em Beihold)",
        - get rid of 'with Em..'
    
    2)  "artist": "Sarah",
        - "song": "Ke Hujung Dunia",
        - Wrong Lyrics: https://www.lyrics.my/artists/siti-sarah-raissuddin/lyrics/ke-hujung-dunia
    
    3) "artist": "Amir Jahari",
        - "song": "Hasrat (OST Imaginur)",
        - get rid of OST
    
    4) "artist": "Megan Thee Stallion",
        - "song": "Mamushi (feat. Yuki Chiba)",
        - mix of english and japanese
        - get rid of Yuki

# Solutions: 
- Manually edit the name of the song to get the get the right lyrics from Genius

- run the search once again with the correct song name 

- Songs in my native language:
    * Manually insert lyrics into JSON file
    * I delete the lyrcis and leave it empty for now.