## Evaluating Jazz album prices on Discogs.com
**Program:** 01_scrape_data.ipynb <br>
**Created by:** Chris Chan<br>
**Date:** Jan 14, 2021<br>
**Purpose:** Using selenium, scrape jazz vinyl data from discogs.com<br>
**Key Features:** Price, media condition, sleeve condition, release date, artist, album, record label, haves, wants, ratings, etc.

### Research Ideas:
- Predict (collector) vinyl price
- https://blog.discogs.com/en/discogs-top-100-most-expensive-records/
- limits: only 100 data points
- can we obtain # copies/presses
- The goal isn't to analyze why recs have gone up in general, but if collector, what are some attributes that may predict the price. OR we can ask what are some attributes that predict the RATINGS of expensive recs

**UPDATED** 
**DISCOGS**
- most expensive lists: https://www.discogs.com/lists?list=expensive+items&page=2
- use this to obtain all years (2010-2019)
- example of month (jan2010): https://www.discogs.com/lists/Most-expensive-items-sold-in-Discogs-Marketplace-for-October-2010/140095
- example of album within month (jan2010): https://www.discogs.com/La-Monte-Young-Drift-Study-43740-50950-PM-5-VIII-68-NYC/release/1512276
- example of 100 expensive from archives: http://web.archive.org/web/20180502225137/https://blog.discogs.com/en/discogs-top-100-most-expensive-records/
- graph of 10 years: https://blog.discogs.com/en/discogs-top-100-most-expensive-records/

**MISC MUSIC**
- vinylfactory: https://thevinylfactory.com/features/online-tools-for-record-collectors/
- links to spotify: http://www.disconest.com/
- discogs misc: https://web.archive.org/web/20210106081812/https://blog.discogs.com/en/
- discogs misc: https://blog.discogs.com/en/vinyl-record-price-guide/
- data: https://www.discogs.com/developers#page:user-collection
- data: https://data.discogs.com/?prefix=data/2020/
- pitchfork: https://pitchfork.com/reviews/best/reissues/?page=1
- spotify github: https://github.com/nsgrantham/pitchfork-reviews
- spotify kaggle: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks
- spotify medium: https://towardsdatascience.com/step-by-step-to-visualize-music-genres-with-spotify-api-ce6c273fb827
- discogs (full code): http://www.diva-portal.org/smash/get/diva2:1443317/FULLTEXT01.pdf

**NEW LIST - ALL ITEMS. Need to pare down to something specific**
**to consider:**
- vintage vinyl years?
- jazz only?
- certain format?
- create ranking of how many times an arist shows up in list
- differential in haves and wants?
- price history?

- dicogs: lists: https://www.discogs.com/user/discogs/lists?header=1
- https://medium.com/@kdavis7190/vinyl-resale-price-prediction-6cb0adaedcb9
- https://github.com/kdavis01/projects/blob/master/vinyl_resale_regression/Data_Gathering.ipynb

**For Write up:**
- https://dgmono.com/2014/01/08/perspective-collecting-vintage-jazz-vinyl-a-labor-of-love/
- miles davis - kind of blue will change your life - reasons for grabbing more exp recs ($25 for reissue these days)
    - or you can just find avg price of jazz records in general
    

This is the program I ulimately used to scrape Discogs data. My previous attempts using beautifulsoup kept giving me limitation issues. Therefore I went the Selenium route. Some of this code was borrowed from a previous Metis project that was found on github. It helped give some foundation on a workable soultion to scraping the data without as many issues. For the beautiful soup version, please see /pgms/scratch_vinyl2.ipynb

### I. setup

In [67]:
import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup

## allows us to use reg expressions to search fields
import re

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions as EC

import random
import time
import csv
import os

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

# user agent
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent

options = Options()
ua = UserAgent()
userAgent = ua.random
print(userAgent)
options.add_argument(f'user-agent={userAgent}')

#driver = webdriver.Chrome(chrome_options=options)


Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36


### II. Selenium to scrape album contents

In [68]:
driver = webdriver.Chrome(chromedriver,chrome_options=options)

# update the page here to start where left off. Page is essentially the count as well
pagestart=21
pageend=34

#_url = "https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=100&year1=1900&year2=1970&format=Vinyl&price=over40&genre=Jazz&currency=USD&style=Hard+Bop&page="
# not just bebop
_url = "https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=100&year1=1900&year2=1970&format=Vinyl&genre=Jazz&currency=USD&style=Hard+Bop&price=20to40&page="
url= _url + str(pagestart) + "#more%3Dyear" 

# starting url
#driver.get("https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=100&year1=1900&year2=1970&format=Vinyl&price=over40&genre=Jazz&currency=USD&style=Hard+Bop&page=3#more%3Dyear")
driver.get(url)


with open('recorddata.csv', 'w',newline='') as csvfile:
    file = csv.writer(csvfile)
    # make headers
    file.writerow(['Artist_Album', 'Label', 'Country', 'Format', 'Notes', 'Genre', 'Release_Date', 'Style', 'Rate_Haves_Wants', 'Media_Condition',
                     'Sleeve_Condition', 'Seller_Rating', 'Recorded_at', 'Pressed_by', 'Price'])
    count = pagestart #3
    
    while count < pageend: # 7:
        
        # find all links on album marketplace page and store in list

        result_elements = '//a[contains(@href, "/sell/item/")]'

        albums = []

        albumdriver = driver.find_elements_by_xpath(result_elements)

        for url in albumdriver:
            albums.extend([url.get_attribute('href')])
        
        # get rid of duplicates

        albumsclean = [album.split('?', 1)[0] for album in albums]
        albumurls = set(albumsclean)

        # scrape info from album page

        for album in albumurls:
            
            driver.get(album)

            soup = BeautifulSoup(driver.page_source, 'html.parser')

            try:
                artist_album = driver.find_element_by_xpath("//h1[contains(@id, 'profile_title')]").text.strip()
            except:
                artist_album = np.nan
        
            ## cc: label
            try:
                rlabel = driver.find_element_by_xpath("//div[contains(text(), 'Label:')]/following-sibling::div").text.strip()
            except:
                rlabel = np.nan
                
            ## cc: country
            try:
                country = driver.find_element_by_xpath("//div[contains(text(), 'Country:')]/following-sibling::div").text.strip()
            except:
                country = np.nan    
            
            ## cc: format
            try:
                rformat = driver.find_element_by_xpath("//div[contains(text(), 'Format:')]/following-sibling::div").text.strip()
            except:
                rformat = np.nan    
                
            ## cc: notes
            try:
                notes = driver.find_element_by_xpath("//h3[contains(text(), 'Notes')]/following-sibling::div").text.strip()
            except:
                notes = np.nan    
                        
            
            try:
                genre = driver.find_element_by_xpath("//div[contains(text(), 'Genre:')]/following-sibling::div").text.strip()
            except:
                genre = np.nan

            try:
                release_date = driver.find_element_by_xpath("//div[contains(text(), 'Released:')]/following-sibling::div").text.strip()
            except:
                release_date = np.nan

            try:
                style = driver.find_element_by_xpath("//div[contains(text(), 'Style:')]/following-sibling::div").text.strip()
            except:
                style = np.nan

            try:
                rate_haves_wants = driver.find_element_by_xpath("//a[contains(@class, 'button-blue')]/following-sibling::div").text.strip()
            except:
                rate_haves_wants = np.nan

            try:
                m_condition = driver.find_element_by_xpath("//strong[contains(text(), 'Media:')]/following-sibling::span").text.strip()
            except:
                m_condition = np.nan
    
            try:
                sleeve = driver.find_element_by_xpath("//strong[contains(text(), 'Sleeve:')]")
                s_condition = sleeve.find_element_by_xpath('..').text.strip()
            except:
                s_condition = np.nan

            try:
                seller_rating = driver.find_element_by_xpath("//span[@class='star_rating']/following-sibling::strong").text.strip()
            except:
                seller_rating = np.nan

            # recorded at        
            try:
                recorded_at = driver.find_element_by_xpath("//span[contains(text(),'Recorded At')]/following-sibling::a").text.strip()
            except:
                recorded_at = np.nan    
                
            # pressed at
            try:
                pressed_by = driver.find_element_by_xpath("//span[contains(text(),'Pressed By')]/following-sibling::a").text.strip()
            except:
                pressed_by = np.nan    
    

            try:
                price = soup.find(class_='price').text.strip()
            except:
                price = np.nan

            observation = [artist_album, rlabel, country, rformat, notes, genre, release_date, style, rate_haves_wants, m_condition, 
                           s_condition, seller_rating, recorded_at, pressed_by, price]

            file.writerow(observation)

            time.sleep(.5+2*random.random())
            
        # go to next page in the marketplace
        
        #nextpage = "https://www.discogs.com/sell/list?sort=listed%2Casc&currency=USD&limit=25&page=" + str(count+1) + "&format=Vinyl"
        #nextpage = "https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=100&year1=1900&year2=1970&format=Vinyl&price=over40&genre=Jazz&currency=USD&style=Hard+Bop&page=" + str(count+1) + "#more%3Dyear"
        nextpage = "https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=100&year1=1900&year2=1970&format=Vinyl&genre=Jazz&currency=USD&style=Hard+Bop&price=20to40&page=" + str(count+1)
        driver.get(nextpage)
        
        count = count + 1

  driver = webdriver.Chrome(chromedriver,chrome_options=options)
