# Ultimate Guitar Scrape - 2
## Scraping individual tabs
*After initial scrape of the top 5,000 tabs on ultimate-guitar.com, scrape each individual tab for metadata/chords*

----

**Final Project for Data & Databases**

**C.J. Robinson**

**Fall 2024**

-----
### Scrape song pages

In [3]:
import re
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
from playwright.async_api import async_playwright

In [4]:
# read in initial data
top_songs_df = pd.read_csv("top_songs.csv")
top_songs_df.head()

Unnamed: 0.1,Unnamed: 0,rank,artist,artist_list,song,ratings,hits,type,song_link,artist_link,star_count
0,0,1,Ed Sheeran,['Ed Sheeran'],Perfect,48238,41205313,chords,https://tabs.ultimate-guitar.com/tab/ed-sheera...,https://www.ultimate-guitar.com/artist/ed_shee...,5.0
1,1,2,Jeff Buckley,['Jeff Buckley'],Hallelujah (ver 2),54484,39807305,chords,https://tabs.ultimate-guitar.com/tab/jeff-buck...,https://www.ultimate-guitar.com/artist/jeff_bu...,5.0
2,2,3,Elvis Presley,['Elvis Presley'],Cant Help Falling In Love,32809,33890059,chords,https://tabs.ultimate-guitar.com/tab/elvis-pre...,https://www.ultimate-guitar.com/artist/elvis_p...,5.0
3,3,4,Passenger,['Passenger'],Let Her Go,24248,31904817,chords,https://tabs.ultimate-guitar.com/tab/passenger...,https://www.ultimate-guitar.com/artist/passeng...,5.0
4,4,5,John Legend,['John Legend'],All Of Me,26699,29790560,chords,https://tabs.ultimate-guitar.com/tab/john-lege...,https://www.ultimate-guitar.com/artist/john_le...,5.0


### Loop through all the pages!

In [14]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

song_list = []
bad_urls_1k = []
counter = 0

for url in top_songs_df['song_link'][0:1000]:
    my_url = url
    #counter for progress
    counter += 1
    
    try:
        await page.goto(my_url, timeout=120000)
        time.sleep(3)
        #scroll down to get chords to actually load
        await page.evaluate("window.scrollTo(0, 2000)") 
        time.sleep(3)
        html = await page.content()
        
        soup_doc = BeautifulSoup(html, "html.parser")
        
        song_dict = {}
        song_dict['link'] = my_url
        
        #there's a bunch of metadata that may or may not be in the header
        #since it depends on the author
        #go through anything in that header and pull out metadata name and value
        try:
            for tag in soup_doc.find_all("th", class_ = "ZvOWv"):
                data_type = tag.text.lower()
                song_dict[data_type] = tag.next_sibling()[0].text
        except:
            print("No meta data")
            print(my_url)
        
        # only grab chord elements if it is chords, not tabs
        try:
            chord_list = []
            #pull all chords
            for chords in soup_doc.find_all("span", class_="_Oy28"):
                chord_list.append(chords.text)
        
            #get unique chord list
            song_dict['chord_list'] = list(set(chord_list))
            #also grab full text...just in case
            song_dict['full_text'] = soup_doc.find("pre", class_="tK8GG Ty_RP").text
        except:
            print("No chords")
        
        # get raw text of contributions, will regex later
        try:
            song_dict['contributors'] = soup_doc.find("span", class_="zku_4").text
        except:
            print("No contributors" + my_url)
        
        # get popularity raw, also will regex later
        try:
            song_dict['popularity'] = soup_doc.find("div", class_="_apVM").text
        except:
            print("No popularity" + my_url)
    
        song_list.append(song_dict)

    except:
        print("error with " + url)
        # get a list of errored URLS
        bad_urls_1k.append(url)

    if counter % 10 == 0:
        print(str(counter) + " / 1000 done") 
    
top_songs_df_meta = pd.json_normalize(song_list)
top_songs_df_meta.to_csv("top_songs_metadata_1k.csv", index = False)

bad_urls_1k_df = pd.DataFrame(bad_urls_1k, columns=["bad_urls"])
bad_urls_1k_df.to_csv('bad_urls_1k_df.csv', index=False)

Future exception was never retrieved
future: <Future finished exception=Exception('Connection closed while reading from the driver')>
Exception: Connection closed while reading from the driver
Future exception was never retrieved
future: <Future finished exception=Exception('Connection closed while reading from the driver')>
Exception: Connection closed while reading from the driver


No chords
10 / 1000 done
No chords
No chords
20 / 1000 done
No chords
30 / 1000 done
40 / 1000 done
No chords
50 / 1000 done
No chords
No chords
No chords
60 / 1000 done
No chords
No chords
70 / 1000 done
No chords
No chords
No chords
80 / 1000 done
No chords
90 / 1000 done
100 / 1000 done
No chords
No chords
110 / 1000 done
No chords
No chords
120 / 1000 done
130 / 1000 done
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/759809
140 / 1000 done
No chords
150 / 1000 done
No chords
No chords
No chords
No chords
No chords
160 / 1000 done
No chords
No chords
No chords
No chords
170 / 1000 done
180 / 1000 done
No chords
No chords
190 / 1000 done
No chords
200 / 1000 done
210 / 1000 done
No contributorshttps://tabs.ultimate-guitar.com/tab/anji/menunggu-kamu-chords-2333561
No chords
220 / 1000 done
No chords
230 / 1000 done
No chords
240 / 1000 done
No chords
No chords
250 / 1000 done
No chords
No chords
No chords
260 / 1000 done
No chords
270 / 1000 done
No chords
No chords
28

In [143]:
# top_songs_df = pd.json_normalize(song_list)
# top_songs_df.to_csv("top_songs_metadata_1k.csv", index = False)

In [18]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

song_list_2k = []
bad_urls_2k = []
counter = 0

for url in top_songs_df['song_link'][1000:2000]:
    my_url = url
    #counter for progress
    counter += 1
    
    try:
        await page.goto(my_url, timeout=120000)
        time.sleep(3)
        #scroll down to get chords to actually load
        await page.evaluate("window.scrollTo(0, 2000)") 
        time.sleep(3)
        html = await page.content()
        
        soup_doc = BeautifulSoup(html, "html.parser")
        
        song_dict = {}
        song_dict['link'] = my_url
        
        #there's a bunch of metadata that may or may not be in the header
        #since it depends on the author
        #go through anything in that header and pull out metadata name and value
        try:
            for tag in soup_doc.find_all("th", class_ = "ZvOWv"):
                data_type = tag.text.lower()
                song_dict[data_type] = tag.next_sibling()[0].text
        except:
            print("No meta data")
            print(my_url)
        
        # only grab chord elements if it is chords, not tabs
        try:
            chord_list = []
            #pull all chords
            for chords in soup_doc.find_all("span", class_="_Oy28"):
                chord_list.append(chords.text)
        
            #get unique chord list
            song_dict['chord_list'] = list(set(chord_list))
            #also grab full text...just in case
            song_dict['full_text'] = soup_doc.find("pre", class_="tK8GG Ty_RP").text
        except:
            print("No chords")
        
        # get raw text of contributions, will regex later
        try:
            song_dict['contributors'] = soup_doc.find("span", class_="zku_4").text
        except:
            print("No contributors" + my_url)
        
        # get popularity raw, also will regex later
        try:
            song_dict['popularity'] = soup_doc.find("div", class_="_apVM").text
        except:
            print("No popularity" + my_url)
    
        song_list_2k.append(song_dict)

    except:
        print("error with " + url)
        # get a list of errored URLS
        bad_urls_2k.append(url)

    if counter % 10 == 0:
        print(str(counter) + " / 1000 done") 
    
top_songs_df_meta_2k = pd.json_normalize(song_list_2k)
top_songs_df_meta_2k.to_csv("top_songs_metadata_2k.csv", index = False)

bad_urls_2k_df = pd.DataFrame(bad_urls_2k, columns=["bad_urls"])
bad_urls_2k_df.to_csv('bad_urls_2k_df.csv', index=False)

Future exception was never retrieved
future: <Future finished exception=Exception('Connection closed while reading from the driver')>
Exception: Connection closed while reading from the driver
Future exception was never retrieved
future: <Future finished exception=Exception('Connection closed while reading from the driver')>
Exception: Connection closed while reading from the driver


No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/759410
10 / 1000 done
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/rihanna/fourfiveseconds-chords-1705341
No chords
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/rihanna/california-king-bed-chords-1004189
20 / 1000 done
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/guns-n-roses/sweet-child-o-mine-guitar-pro-220689
30 / 1000 done
No chords
No chords
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/rihanna/love-on-the-brain-chords-1809471
40 / 1000 done
No chords
No chords
50 / 1000 done
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/michael-jackson/heal-the-world-chords-71843
60 / 1000 done
No chords
No chords
No chords
No chords
No chords
70 / 1000 done
No chords
80 / 1000 done
No chords
90 / 1000 done
100 / 1000 done
No chords
No chords
110 / 1000 done
No chords
No chords
No chords
120 / 1000 done
N

In [7]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

song_list_3k = []
bad_urls_3k = []
counter = 0

for url in top_songs_df['song_link'][2000:3000]:
    my_url = url
    #counter for progress
    counter += 1
    
    try:
        await page.goto(my_url, timeout=120000)
        time.sleep(2)
        #scroll down to get chords to actually load
        await page.evaluate("window.scrollTo(0, 2000)") 
        time.sleep(2)
        html = await page.content()
        
        soup_doc = BeautifulSoup(html, "html.parser")
        
        song_dict = {}
        song_dict['link'] = my_url
        
        #there's a bunch of metadata that may or may not be in the header
        #since it depends on the author
        #go through anything in that header and pull out metadata name and value
        try:
            for tag in soup_doc.find_all("th", class_ = "ZvOWv"):
                data_type = tag.text.lower()
                song_dict[data_type] = tag.next_sibling()[0].text
        except:
            print("No meta data")
            print(my_url)
        
        # only grab chord elements if it is chords, not tabs
        try:
            chord_list = []
            #pull all chords
            for chords in soup_doc.find_all("span", class_="_Oy28"):
                chord_list.append(chords.text)
        
            #get unique chord list
            song_dict['chord_list'] = list(set(chord_list))
            #also grab full text...just in case
            song_dict['full_text'] = soup_doc.find("pre", class_="tK8GG Ty_RP").text
        except:
            print("No chords")
        
        # get raw text of contributions, will regex later
        try:
            song_dict['contributors'] = soup_doc.find("span", class_="zku_4").text
        except:
            print("No contributors" + my_url)
        
        # get popularity raw, also will regex later
        try:
            song_dict['popularity'] = soup_doc.find("div", class_="_apVM").text
        except:
            print("No popularity" + my_url)
    
        song_list_3k.append(song_dict)

    except:
        print("error with " + url)
        # get a list of errored URLS
        bad_urls_3k.append(url)

    if counter % 10 == 0:
        print(str(counter) + " / 1000 done") 
    
top_songs_df_meta_3k = pd.json_normalize(song_list_3k)
top_songs_df_meta_3k.to_csv("top_songs_metadata_3k.csv", index = False)

bad_urls_3k_df = pd.DataFrame(bad_urls_3k, columns=["bad_urls"])
bad_urls_3k_df.to_csv('bad_urls_3k_df.csv', index=False)

No chords
10 / 1000 done
No contributorshttps://tabs.ultimate-guitar.com/tab/bob-dylan/girl-from-the-north-country-chords-1087317
No chords
20 / 1000 done
No chords
30 / 1000 done
No chords
No chords
No chords
40 / 1000 done
No chords
No chords
50 / 1000 done
No chords
No chords
60 / 1000 done
No chords
70 / 1000 done
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/one-direction/perfect-chords-1774305
No chords
No chords
80 / 1000 done
No chords
90 / 1000 done
100 / 1000 done
No contributorshttps://tabs.ultimate-guitar.com/tab/the-itchyworms/di-na-muli-chords-2413785
No chords
No chords
No chords
No chords
110 / 1000 done
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/hivi/remaja-chords-1980289
120 / 1000 done
No chords
No chords
No chords
130 / 1000 done
No chords
140 / 1000 done
No chords
error with #
No chords
150 / 1000 done
No chords
160 / 1000 done
No chords
170 / 1000 done
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/

In [8]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()

song_list_4k = []
bad_urls_4k = []
counter = 0

for url in top_songs_df['song_link'][3000:5000]:
    my_url = url
    #counter for progress
    counter += 1
    
    try:
        await page.goto(my_url, timeout=120000)
        time.sleep(2)
        #scroll down to get chords to actually load
        await page.evaluate("window.scrollTo(0, 2000)") 
        time.sleep(2)
        html = await page.content()
        
        soup_doc = BeautifulSoup(html, "html.parser")
        
        song_dict = {}
        song_dict['link'] = my_url
        
        #there's a bunch of metadata that may or may not be in the header
        #since it depends on the author
        #go through anything in that header and pull out metadata name and value
        try:
            for tag in soup_doc.find_all("th", class_ = "ZvOWv"):
                data_type = tag.text.lower()
                song_dict[data_type] = tag.next_sibling()[0].text
        except:
            print("No meta data")
            print(my_url)
        
        # only grab chord elements if it is chords, not tabs
        try:
            chord_list = []
            #pull all chords
            for chords in soup_doc.find_all("span", class_="_Oy28"):
                chord_list.append(chords.text)
        
            #get unique chord list
            song_dict['chord_list'] = list(set(chord_list))
            #also grab full text...just in case
            song_dict['full_text'] = soup_doc.find("pre", class_="tK8GG Ty_RP").text
        except:
            print("No chords")
        
        # get raw text of contributions, will regex later
        try:
            song_dict['contributors'] = soup_doc.find("span", class_="zku_4").text
        except:
            print("No contributors" + my_url)
        
        # get popularity raw, also will regex later
        try:
            song_dict['popularity'] = soup_doc.find("div", class_="_apVM").text
        except:
            print("No popularity" + my_url)
    
        song_list_4k.append(song_dict)

    except:
        print("error with " + url)
        # get a list of errored URLS
        bad_urls_4k.append(url)

    if counter % 10 == 0:
        print(str(counter) + " done") 
    
top_songs_df_meta_4k = pd.json_normalize(song_list_4k)
top_songs_df_meta_4k.to_csv("top_songs_metadata_4k.csv", index = False)

bad_urls_4k_df = pd.DataFrame(bad_urls_4k, columns=["bad_urls"])
bad_urls_4k_df.to_csv('bad_urls_4k_df.csv', index=False)

No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/kelly-clarkson/since-u-been-gone-chords-148473
No contributorshttps://tabs.ultimate-guitar.com/tab/supertramp/give-a-little-bit-chords-1087315
No chords
No chords
10 done
No chords
No chords
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/metallica/nothing-else-matters-video-1024840
20 done
No contributorshttps://tabs.ultimate-guitar.com/tab/dua-lipa/idgaf-chords-2064489
30 done
No chords
No chords
No chords
No chords
No chords
40 done
No chords
No chords
50 done
No chords
No chords
No chords
No chords
60 done
No chords
No chords
No chords
No chords
70 done
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/maroon-5/animals-chords-1513699
No chords
No chords
No contributorshttps://tabs.ultimate-guitar.com/tab/katy-perry/teenage-dream-chords-972550
No chords
80 done
No chords
No chords
90 done
No chords
100 done
No chords
No chords
No chords
110 done
120 done
No chords
No chords
No chords
13