# ADS 509 Module 1: APIs and Web Scraping

This notebook has two parts. In the first part, you will scrape lyrics from AZLyrics.com. In the second part, you'll run code that verifies the completeness of your data pull. 

For this assignment you have chosen two musical artists who have at least 20 songs with lyrics on AZLyrics.com. We start with pulling some information and analyzing them.


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


# Importing Libraries

In [1]:
import os
import datetime
import re

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import random

In [36]:
# Use this cell for any import statements you add

import shutil
from bs4 import Comment

---

# Lyrics Scrape

This section asks you to pull data by scraping www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [3]:
artists = {'posty':"https://www.azlyrics.com/p/postmalone.html",
           'slipknot':"https://www.azlyrics.com/s/slipknot.html"}
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: The webscraping we are about to do is allowed because only two things are disallowed on the robots.txt page for azlyrics.com. This includes /lyricsdb/ and /song/. We are not using either page, so what we are about to preform is acceptable. All crawlers on azlyrics.com have these rules.


In [8]:
# Set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)

# Iterate over each artist and their respective artist page
for artist, artist_page in artists.items():
    print(f"Grabbing lyrics pages for: {artist}")

    # Request the artist's page
    r = requests.get(artist_page)
    time.sleep(5 + 10 * random.random())  # Sleep to avoid overwhelming the server

    # Parse the HTML content
    soup = BeautifulSoup(r.text, 'html.parser')
    print(f"Processing {artist_page}")

    # Find all links (<a> tags) to lyrics pages inside divs with class 'listalbum-item'
    for div in soup.find_all('div', class_='listalbum-item'):
        a_tag = div.find('a', href=True)  # Find the <a> tag inside the div

        if a_tag:
            href = a_tag['href']

            # Check if the href ends with '.html' (a lyrics page) and starts with '/'
            if href.endswith('.html') and href.startswith('/'):
                full_url = f"https://www.azlyrics.com{href}"
                lyrics_pages[artist].append(full_url)

Grabbing lyrics pages for: posty
Processing https://www.azlyrics.com/p/postmalone.html
Grabbing lyrics pages for: slipknot
Processing https://www.azlyrics.com/s/slipknot.html


In [9]:
# Print the first 5 links for each artist
for artist, links in lyrics_pages.items():
    print(f"{artist}: {links[:5]}")

posty: ['https://www.azlyrics.com/lyrics/postmalone/neverunderstand.html', 'https://www.azlyrics.com/lyrics/postmalone/moneymademedoit.html', 'https://www.azlyrics.com/lyrics/postmalone/gitwitu.html', 'https://www.azlyrics.com/lyrics/postmalone/goddamn.html', 'https://www.azlyrics.com/lyrics/postmalone/fuck.html']
slipknot: ['https://www.azlyrics.com/lyrics/slipknot/slipknot.html', 'https://www.azlyrics.com/lyrics/slipknot/gently-1996.html', 'https://www.azlyrics.com/lyrics/slipknot/donothingbitchslap.html', 'https://www.azlyrics.com/lyrics/slipknot/onlyone.html', 'https://www.azlyrics.com/lyrics/slipknot/tatteredandtorn48176.html']


Let's make sure we have enough lyrics pages to scrape. 

In [10]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20) 

In [11]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For posty we have 148.
The full pull will take for this artist will take 0.41 hours.
For slipknot we have 132.
The full pull will take for this artist will take 0.37 hours.


In [12]:
# Display the populated lyrics pages dictionary
for artist, song_urls in lyrics_pages.items():
    print(f"\nArtist: {artist}")
    for url in song_urls[:5]:  # Show the first 5 URLs to keep output short
        print(url)


Artist: posty
https://www.azlyrics.com/lyrics/postmalone/neverunderstand.html
https://www.azlyrics.com/lyrics/postmalone/moneymademedoit.html
https://www.azlyrics.com/lyrics/postmalone/gitwitu.html
https://www.azlyrics.com/lyrics/postmalone/goddamn.html
https://www.azlyrics.com/lyrics/postmalone/fuck.html

Artist: slipknot
https://www.azlyrics.com/lyrics/slipknot/slipknot.html
https://www.azlyrics.com/lyrics/slipknot/gently-1996.html
https://www.azlyrics.com/lyrics/slipknot/donothingbitchslap.html
https://www.azlyrics.com/lyrics/slipknot/onlyone.html
https://www.azlyrics.com/lyrics/slipknot/tatteredandtorn48176.html


## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [48]:
def generate_filename_from_link(link):
    if not link:
        return None
    
    # Remove the protocol
    link = link.replace("https://", "").replace("http://", "")
    
    # Remove the ".html" suffix
    link = link.replace(".html", "")
    
    # Find the last occurrence of "/" and get the part after it
    last_slash_index = link.rfind("/")
    if last_slash_index != -1:
        # Extract the part after the last "/"
        name = link[last_slash_index + 1:]
    else:
        # If there is no "/", return None or handle as needed
        return None
    
    # Replace problematic characters with underscores
    name = name.replace(":", "_").replace("?", "_")
    
    # Add the ".txt" extension
    name = name + ".txt"
    
    return name


In [33]:
# Function to clean and extract the song title
def extract_title(raw_title):
    # Extract the part between " - " and " Lyrics"
    if " - " in raw_title and "Lyrics" in raw_title:
        start = raw_title.index(" - ") + 3
        end = raw_title.index(" Lyrics")
        return raw_title[start:end].strip()
    return raw_title.strip()

In [40]:
# Function to clean the lyrics by removing the comment
def clean_lyrics(lyrics_div):
    # Remove comments (including the "Usage of azlyrics.com content" comment)
    for element in lyrics_div(text=lambda text: isinstance(text, Comment)):
        element.extract()
    
    # Return the clean text from the div
    return lyrics_div.get_text(separator="\n").strip()

In [45]:
# Define the folder path
folder_name = "C:/Users/benog/OneDrive/Documents/Grad School/USD/ADS 509 Text Mining/lyrics"

# Check if the folder exists
if os.path.isdir(folder_name):
    print(f"'{folder_name}' already exists \nRemoving it and creating a new one")
    shutil.rmtree(folder_name)

# Create the new folder
os.mkdir(folder_name)

'C:/Users/benog/OneDrive/Documents/Grad School/USD/ADS 509 Text Mining/lyrics' already exists. 
Removing it and creating a new one.


In [50]:
url_stub = "https://www.azlyrics.com" 
start = time.time()

total_pages = 0 

# Step to flatten the list of links for song lyrics
for artist, urls in lyrics_pages.items():
    # Flatten the list of links if there are any lists inside the urls
    lyrics_pages[artist] = [item for sublist in urls for item in sublist] if any(isinstance(i, list) for i in urls) else urls

# Extract song data from each artist
for artist, url_list in lyrics_pages.items():
    # Create a subfolder for the artist
    artist_folder = os.path.join(folder_name, artist).replace('\\', '/')

    # Check if the folder exists, remove and recreate it if needed
    if os.path.isdir(artist_folder):
        print(f"Folder '{artist_folder}' exists. Removing it and creating a new one.")
        shutil.rmtree(artist_folder)

    os.mkdir(artist_folder)

    # Iterate over the lyrics pages, request the page, extract title and lyrics, and save the data
    for i, song_url in enumerate(url_list):

        # Request the lyrics page
        response = requests.get(song_url)
        time.sleep(5 + 10 * random.random())  # Sleep to avoid overwhelming the server

        # Parse the page with BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract the song title
        raw_title = soup.find('title').text.strip()
        title = extract_title(raw_title) 

        # Find the comment preceding the lyrics
        comment = soup.find(string=lambda text: isinstance(text, Comment) and "Usage of azlyrics.com content" in text)

        if comment:
            # Find the div that contains the lyrics after the comment
            lyrics_div = comment.find_parent("div")
            if lyrics_div:
                # Clean the lyrics by removing the comment
                lyrics = clean_lyrics(lyrics_div)
            else:
                lyrics = "Lyrics not found"
        else:
            lyrics = "Lyrics not found"

        # Write the title and lyrics to a file
        filename = generate_filename_from_link(song_url)
        file_path = os.path.join(artist_folder, filename).replace('\\', '/')

        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(f"{title}\n\n{lyrics}")

end = time.time()
print(f"Total pages fetched: {total_pages}")
print(f"Time taken: {end - start} seconds")

Folder 'C:/Users/benog/OneDrive/Documents/Grad School/USD/ADS 509 Text Mining/lyrics/posty' exists. Removing it and creating a new one.
Total pages fetched: 0
Time taken: 2941.5527362823486 seconds


In [51]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

Total run time was 1.0 hours.


---

# Evaluation

This assignment asks you to pull data by scraping www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [52]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [53]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For posty we have 147 files.
For posty we have roughly 56180 words, 3824 are unique.
For slipknot we have 132 files.
For slipknot we have roughly 43801 words, 3570 are unique.
