# Building DB of not_100 songs
1. Discogs 3,000 songs
3. Kaggle DB 500 songs
4. Spotify 500 songs - with my selected artists/playlists:


In [3]:
import requests # to download html code
from bs4 import BeautifulSoup # to navigate through the html code
import pandas as pd
import numpy as np
import re  #regexp

In [4]:
url = "https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=1"

In [5]:
# 3. download html with a get request. Use the function request.get() and store the output in response
response = requests.get(url)
# 200 status code means OK! response.status_code
print(response.status_code)

200


In [6]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.text, 'html.parser')
# 4.2. check that the html code looks like it should
print(soup.prettify())

<!DOCTYPE html>
<html class="is_not_mobile needs_reduced_ui" lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="en" http-equiv="content-language"/>
  <meta content="no-cache" http-equiv="pragma">
   <meta content="-1" http-equiv="expires">
    <!-- OT will rewrite convert these to javascript and update our consent module accordingly -->
    <script class="optanon-category-C0002" type="text/plain">
     window.consent.resolveGroup(window.consent.PERFORMANCE_GROUP)
    </script>
    <script class="optanon-category-C0003" type="text/plain">
     window.consent.resolveGroup(window.consent.FUNCTIONALITY_GROUP)
    </script>
    <script class="optanon-category-C0004" type="text/plain">
     window.consent.resolveGroup(window.consent.TARGETING_GROUP)
    </script>
    <meta content="initial-scale=1.0,width=device-width" id="viewport" 

In [None]:
#"https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=1
#https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=2
#https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=3

In [9]:
#URL iterations - 8 pages total 

iterations = range(1, 9, 1)

for i in iterations:
    page = str(i)
    url = "https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=" + page 
    print(url)

https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=1
https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=2
https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=3
https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=4
https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=5
https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=6
https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=7
https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=8


Respectful scraping:

Before starting with the actual scraping, though, there's something we need to note when sending massive, automated requests to websites: it's rude.

We just have 13 of them, which is not too many, but it's still a good practice to let a few seconds pass in between requests. Some pages don't like being scraped and will block your IP if they detect it's sending automated requests. Others might have a small server for the traffic they handle, and sending too many requests might crash the site.

The sleep module will help us with that.

We will now scrape all the pages and store the response into a list - waiting a few seconds in between requests:

In [10]:
# To make it more "human", we can randomize the waiting time:
from time import sleep
from random import randint

pages = []

for i in range(1, 9, 1):

    # assemble the url:
    page = str(i)
    url = "https://www.discogs.com/search/?limit=250&sort=have%2Cdesc&type=release&page=" + page 

    # download html with a get request:
    response = requests.get(url)

    # monitor the process by printing the status code
    print("Status code: " + str(response.status_code))

    # store response into "pages" list
    pages.append(response)

    # respectful nap:
    wait_time = randint(1,6)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 5 second/s.


Note how if you print the object pages after running the code above, you'll just see the response code messages, but the html code is still accessible and you can parse it the same way we've always done:

In [13]:
for i in range(1, 8, 1):
    BeautifulSoup(pages[i].content, "html.parser")

In [33]:
soup_1 = BeautifulSoup(pages[0].content, "html.parser")
soup_2 = BeautifulSoup(pages[1].content, "html.parser")
soup_3 = BeautifulSoup(pages[2].content, "html.parser")
soup_4 = BeautifulSoup(pages[3].content, "html.parser")
soup_5 = BeautifulSoup(pages[4].content, "html.parser")
soup_6 = BeautifulSoup(pages[5].content, "html.parser")
soup_7 = BeautifulSoup(pages[6].content, "html.parser")
soup_8 = BeautifulSoup(pages[7].content, "html.parser")

In [21]:
print(soup_1)


<!DOCTYPE html>

<html class="is_not_mobile needs_reduced_ui" lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="en" http-equiv="content-language"/>
<meta content="no-cache" http-equiv="pragma">
<meta content="-1" http-equiv="expires">
<!-- OT will rewrite convert these to javascript and update our consent module accordingly -->
<script class="optanon-category-C0002" type="text/plain">
            window.consent.resolveGroup(window.consent.PERFORMANCE_GROUP)
        </script>
<script class="optanon-category-C0003" type="text/plain">
            window.consent.resolveGroup(window.consent.FUNCTIONALITY_GROUP)
        </script>
<script class="optanon-category-C0004" type="text/plain">
            window.consent.resolveGroup(window.consent.TARGETING_GROUP)
        </script>
<meta content="initial-scale=1.0,width=device-width" id="viewpor

Locate:
- song name
- artist



#### Song names

In [31]:
soup_1.select("div:nth-child(1)>h4>a")[0].get_text()

'Random Access Memories'

In [43]:
# This is just for one page right now
song_names = []
for div in soup_1.find_all("div", {"class": "card card_large float_fix shortcut_navigable"}):
    for elem in div.find_all("h4"):
        song_names.append(elem.get_text().replace("\n",""))
song_names

['Random Access Memories',
 'Good Kid, M.A.A.d City',
 'Thriller',
 'The Dark Side Of The Moon',
 'Rumours',
 'The Dark Side Of The Moon',
 'Lazaretto',
 'Purple Rain',
 'Nevermind',
 'Harvest',
 'The Rise And Fall Of Ziggy Stardust And The Spiders From Mars',
 'Hotel California',
 'Born In The U.S.A.',
 'AM',
 'Thriller',
 'Brothers In Arms',
 'Déjà Vu',
 'The Cars',
 'My Beautiful Dark Twisted Fantasy',
 'Lost In The Dream',
 'Breakfast In America',
 'Carrie & Lowell',
 'Bridge Over Troubled Water',
 'Wish You Were Here',
 'The Chronic',
 'Fleet Foxes',
 "(What's The Story) Morning Glory?",
 'The Stranger',
 'Bad',
 'Wish You Were Here',
 'The Dark Side Of The Moon',
 'For Emma, Forever Ago',
 'Let It Be',
 'Lateralus',
 'Untitled',
 'Tapestry',
 'Discovery',
 'Johnny Cash At San Quentin',
 'Grease (The Original Soundtrack From The Motion Picture)',
 'Mezzanine',
 'In Rainbows',
 'Back To Black',
 'Frampton Comes Alive!',
 'Enter The Wu-Tang (36 Chambers)',
 'Madvillainy',
 'The Wall

#### Artist Names

In [44]:
artists = []
for div in soup_1.find_all("div", {"class": "card card_large float_fix shortcut_navigable"}):
    for elem in div.find_all("h5"):
        artists.append(elem.get_text().replace("\n",""))
artists

['Daft Punk',
 'Kendrick Lamar',
 'Michael Jackson',
 'Pink Floyd',
 'Fleetwood Mac',
 'Pink Floyd',
 'Jack White (2)',
 'Prince And The Revolution',
 'Nirvana',
 'Neil Young',
 'David Bowie',
 'Eagles',
 'Bruce Springsteen',
 'Arctic Monkeys',
 'Michael Jackson',
 'Dire Straits',
 'Crosby, Stills, Nash & Young',
 'The Cars',
 'Kanye West',
 'The War On Drugs',
 'Supertramp',
 'Sufjan Stevens',
 'Simon and Garfunkel*',
 'Pink Floyd',
 'Dr. Dre',
 'Fleet Foxes',
 'Oasis (2)',
 'Billy Joel',
 'Michael Jackson',
 'Pink Floyd',
 'Pink Floyd',
 'Bon Iver',
 'The Beatles',
 'Tool (2)',
 'Led Zeppelin',
 'Carole King',
 'Daft Punk',
 'Johnny Cash',
 'Various',
 'Massive Attack',
 'Radiohead',
 'Amy Winehouse',
 'Peter Frampton',
 'Wu-Tang Clan',
 'Doom* And Madlib - Madvillain',
 'Pink Floyd',
 'Foreigner',
 'The Beatles',
 'Neutral Milk Hotel',
 'Billy Joel',
 'Led Zeppelin',
 'David Bowie',
 'Simon & Garfunkel',
 'Huey Lewis And The News*',
 'John Williams (4), The London Symphony Orchestra

In [46]:
pages_parsed = []
song_names = []
artists = []

for i in range(len(pages)):
    # parse all pages
    pages_parsed.append(BeautifulSoup(pages[i].content, "html.parser"))
    # select only the info about the movies
    #discogs_html = pages_parsed[i]
    for div in soup_1.find_all("div", {"class": "card card_large float_fix shortcut_navigable"}):
        for elem in div.find_all("h4"):
            song_names.append(elem.get_text().replace("\n",""))
    for div in soup_1.find_all("div", {"class": "card card_large float_fix shortcut_navigable"}):
        for elem in div.find_all("h5"):
            artists.append(elem.get_text().replace("\n",""))
        
        


# Checking our output:
print(len(song_names)) # output: 
print(len(artists))  # output: 

2000
2000


In [48]:
not_hot_songs = pd.DataFrame({"song_title":song_names,
                           "artist":artists})
not_hot_songs

Unnamed: 0,song_title,artist
0,Random Access Memories,Daft Punk
1,"Good Kid, M.A.A.d City",Kendrick Lamar
2,Thriller,Michael Jackson
3,The Dark Side Of The Moon,Pink Floyd
4,Rumours,Fleetwood Mac
...,...,...
1995,Simon And Garfunkel's Greatest Hits,Simon & Garfunkel
1996,Stop Making Sense,Talking Heads
1997,Desire,Bob Dylan
1998,Van Halen II,Van Halen


In [51]:
# If i wanted just a random fraction of theese. Not_hot_songs.sample(frac=0.2, replace=False, random_state=1)

In [49]:
#Savr them into a csv at the nd top_100_df.to_csv("hot100.csv",index=False)

In [169]:
!ls

[34mLab_2[m[m                   Web_scraping_Lab1.ipynb
README.md               hot100.csv
