<a href="https://colab.research.google.com/github/claudiamoses/DataScience-Class-Projects/blob/main/Copy_of_DATASCI_112_Lab_5A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Song Lyrics

In this lab, you will scrape the lyrics of songs by your favorite artist.

In Part B of this lab, you will train a model on these lyrics so that you can generate a new "song" in the style of that artist!

Find a web site or REST API that contains song lyrics by your favorite artist. If you decide to scrape a webpage, be sure to document any ethical decisions you made, such as:
- not making more requests to the website than necessary
- staggering your requests
- respecting the robots.txt file.

Some words of wisdom from our experience:

- The [genius.com API](https://docs.genius.com/) does not return the lyrics, only a link to a webpage with the lyrics.
- The free version of the [MusixMatch API](https://developer.musixmatch.com/plans) only returns the first 30% of the lyrics.
- If you are web scraping, find a web page that has links to all the songs by the artist, like [this page for the Steve Miller Band](https://www.azlyrics.com/s/stevemillerband.html). [_Note:_ It appears that `azlyrics.com` blocks web scraping, so you may have to find a different lyrics web site.] Then, you can scrape this page, extract the hyperlinks, and issue new HTTP requests to each hyperlink to get each song.

Create a `DataFrame`, where each row represents a song by that artist. The `DataFrame` should have two columns:
- The first should contain the title of the song.
- The second should contain the complete lyrics of the song as a string.
    - It should not contain any extraneous text or HTML tags.

Display the `DataFrame` in the Colab. Save the `DataFrame` as a CSV file, and download the CSV. (You will need it in Part B.)

For this assignment, I decided to scrape the song lyrics from one of my favorite rock singers of all time, Stevie Nicks! I found this sketchy looking website and thought it was perfect: http://rockalittle.com/snsongs.htm

In [None]:
# YOUR CODE HERE
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://rockalittle.com/snsongs.htm"
response = requests.get(url)
content = response.text
soup = BeautifulSoup(response.text, parser="html.parser")

In [None]:
#This function cleans the scraped text lines for symbols and returns the line as
#one singular string of text

def clean_line(line):
    i = 0
    cleaned_line = ""
    while i < len(line):
        if line[i] == "<":
          if line[i:i+5] == "<br/>":
            line = line[:i] + " " + line[i + 5:]
          else:
            j = i
            while i < len(line) and line[i] != ">":
                i += 1
            i += 1  # Increment i to skip the closing ">"
        elif line[i] in ["\n", "\r"]:
            i += 1
        else:
            cleaned_line += line[i]
            i += 1
    if cleaned_line[:6] == "Lyrics":
      return cleaned_line[6:]
    return cleaned_line

In [None]:
#This function extracts the lyrics from each lyric link using the html
#parser.
def extract_lyrics(lyrics_url):
    new_response = requests.get(lyrics_url)
    new_soup = BeautifulSoup(new_response.text, parser="html.parser")
    table = new_soup.find("table")
    if table == None:
      return
    rows = table.find_all("tr")
    lyric_data = ""
    for row in rows:
      cells = row.find_all("td")
      paragraphs = cells[0].find_all('p')
      for paragraph in paragraphs:
        edited = str(paragraph)
        lyric_data += (clean_line(edited))
    return lyric_data

In [None]:
#This function cycles through all the song links on the site and calls the
#extract_lyrics function before storing everything as a list of dictionaries

links = soup.find_all('a')
song_data = []
for link in links:
    song_name = link.text.strip()
    song_url = link.get('href')
    lyrics = extract_lyrics("http://rockalittle.com/" + song_url)
    song_data.append({'Song': song_name, 'Lyrics': lyrics})

In [None]:
df = pd.DataFrame(song_data)
df.drop([193, 194], axis=0, inplace=True)
df.to_csv('Stevie.csv', index=False)
df

Unnamed: 0,Song,Lyrics
0,Affairs of the Heart,One set of doors was the color of honey One se...
1,After the Glitter Fades,Well I never thought I'd make it Here in Holly...
2,Alice,Well I heard she flew down to the Mountain Cit...
3,The Apartment Song,I used to live in a two room apartment Neig...
4,Angel,Sometimes The most beautiful things The most i...
...,...,...
188,Won't You Say You Will,When my time is only Passing me from day to d...
189,You Can Still Change Your Mind,It's gonna be another hard night You wanna tak...
190,You Like Me,Don't read your words into my words Don't let ...
191,You May Be The One,You may be the one But you'll never be the one...


_DOCUMENT ANY ETHICAL DECISIONS HERE._



I decided to use my favorite artist, whose concert I am currently in Omaha, Nebraska to see, Stevie Nicks, for this assignment. The thing about Stevie Nicks is that she was at her prime in the 70s, meaning that there are a lot of websites listing links to lyrics of her songs, and a lot of them look to be untouched since the early 2000s. This was the case for rockalittle.com, the website I used for this assignment. It was so old that there were not even classes to indicate where to look for text. I had to do a lot of string cleaning, but had zero issue with getting blocked from the page. I checked the robots.txt, which says:

User-agent: *
Allow: *

Sitemap: http://rockalittle.com/troops/photos_jan27_2007/resources/ebook/sitemap.xml

There were also no ads on the page so nothing to profit off of. Because of all these factors, I didn't have to make many ethcial decisions as I scraped this data.

## Submission Instructions

- Restart this notebook and run the cells from beginning to end.
  - Go to Runtime > Restart and Run All.

In [None]:
# @markdown Run this cell to download this notebook as a webpage, `_NOTEBOOK.html`.

import google, json, nbformat

# Get the current notebook and write it to _NOTEBOOK.ipynb
raw_notebook = google.colab._message.blocking_request("get_ipynb",
                                                      timeout_sec=30)["ipynb"]
with open("_NOTEBOOK.ipynb", "w", encoding="utf-8") as ipynb_file:
  ipynb_file.write(json.dumps(raw_notebook))

# Use nbconvert to convert .ipynb to .html.
!jupyter nbconvert --to html --log-level WARN _NOTEBOOK.ipynb

# Download the .html file.
google.colab.files.download("_NOTEBOOK.html")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

- Open `_NOTEBOOK.html` in your browser, and save it as a PDF.
    - Go to File > Print > Save as PDF.
- Double check that all of your code and output is visible in the saved PDF.
- Upload the PDF to [Gradescope](https://www.gradescope.com/courses/694907).
    - Please be sure to select the correct pages corresponding to each question.