### Research Ideas:
- Predict (collector) vinyl price
- https://blog.discogs.com/en/discogs-top-100-most-expensive-records/
- limits: only 100 data points
- can we obtain # copies/presses
- The goal isn't to analyze why recs have gone up in general, but if collector, what are some attributes that may predict the price. OR we can ask what are some attributes that predict the RATINGS of expensive recs

**UPDATED** 
**DISCOGS**
- most expensive lists: https://www.discogs.com/lists?list=expensive+items&page=2
- use this to obtain all years (2010-2019)
- example of month (jan2010): https://www.discogs.com/lists/Most-expensive-items-sold-in-Discogs-Marketplace-for-October-2010/140095
- example of album within month (jan2010): https://www.discogs.com/La-Monte-Young-Drift-Study-43740-50950-PM-5-VIII-68-NYC/release/1512276
- example of 100 expensive from archives: http://web.archive.org/web/20180502225137/https://blog.discogs.com/en/discogs-top-100-most-expensive-records/
- graph of 10 years: https://blog.discogs.com/en/discogs-top-100-most-expensive-records/

**MISC MUSIC**
- vinylfactory: https://thevinylfactory.com/features/online-tools-for-record-collectors/
- links to spotify: http://www.disconest.com/
- discogs misc: https://web.archive.org/web/20210106081812/https://blog.discogs.com/en/
- discogs misc: https://blog.discogs.com/en/vinyl-record-price-guide/
- data: https://www.discogs.com/developers#page:user-collection
- data: https://data.discogs.com/?prefix=data/2020/
- pitchfork: https://pitchfork.com/reviews/best/reissues/?page=1
- spotify github: https://github.com/nsgrantham/pitchfork-reviews
- spotify kaggle: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks
- spotify medium: https://towardsdatascience.com/step-by-step-to-visualize-music-genres-with-spotify-api-ce6c273fb827
- discogs (full code): http://www.diva-portal.org/smash/get/diva2:1443317/FULLTEXT01.pdf

**NEW LIST - ALL ITEMS. Need to pare down to something specific**
**to consider:**
- vintage vinyl years?
- jazz only?
- certain format?
- create ranking of how many times an arist shows up in list
- differential in haves and wants?
- price history?

- dicogs: lists: https://www.discogs.com/user/discogs/lists?header=1
- https://medium.com/@kdavis7190/vinyl-resale-price-prediction-6cb0adaedcb9
- https://github.com/kdavis01/projects/blob/master/vinyl_resale_regression/Data_Gathering.ipynb

**For Write up:**
- https://dgmono.com/2014/01/08/perspective-collecting-vintage-jazz-vinyl-a-labor-of-love/


## I. setup

In [54]:
from bs4 import BeautifulSoup
import requests

## allows us to use reg expressions to search fields
import re
#runtime_regex = re.compile('Run')
#soup.find(text=runtime_regex)
import pandas as pd
from fake_useragent import UserAgent

### Function 1: for PRICE only
- this is different because it takes the first instance of the value in the field name
- may need to create a separate function for cleaning this field (no $)

In [1]:
def get_album_price(soup, field_name):
    
    '''
    Grab a value from discogs archive HTML
    Takes a string attribute of an album on the page and returns the string
    We can go directly to the field name using the class element
    
    This will be used for:
        . have, wants, rating, numrating, price
    '''
    if soup.find_all(class_=field_name):
        obj = soup.find_all(class_=field_name)[0].text
    #print(obj)
        return obj
    else:
        obj=float('NaN')
        return obj
    
    #if not obj: 
    #    return None
    
    # this works for most of the values
#     next_element = obj.findNext()
#     print(next_element)
    
#     if next_element:
#         return next_element.text 

#     else:
#         return None

### Function 2: for regular expression items
- the majority of our variables will process through this

In [2]:
def get_album_txt(soup, field_name):
    
    '''Grab a value from discogs archive HTML
    
    Takes a string attribute of an album on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    print(obj)
    if not obj: 
        return None
    
    # this works for most of the values
    next_element = obj.findNext()
    print(next_element)
    if next_element:
        return next_element.text 

    else:
        return None

#### Diagnostics below only work if soup is defined

In [None]:
# rated

rating = get_album_txt(soup,'Rated').strip()
#rated = get_album_txt(soup,'Rated').strip()
print(rating)

In [None]:
# ratings
gen = get_album_txt(soup,'Genre:').strip()
print(gen)

In [None]:
sleeve=soup.find(text=re.compile('Sleeve:..VG'))

#sleeve = get_album_txt(_soup,'Sleeve:').strip()
print(sleeve)

### Function 3: Grab votes, have, want
- performed parsing of votes field into haves/wants

In [3]:
def get_album_votes(soup, field_name, rtype ):
    
    '''Grab a value from discogs archive HTML
       Takes info in votes containing #votes, haves, wants
    '''    
    obj = soup.find(text=re.compile(field_name))
    #return obj

    if not obj:
        return None
    
    
#     if obj=='No Rating Yet':
#         return None
        
    splitvotes = obj.split('(')[0].strip()
    #print(splitvotes)
    #return splitvotes
    
    # parse votes
    if rtype == 'vote':
        votes = splitvotes.split(' ',1)[1].split(' ',1)[0]
        #print(votes)
        return votes

    # parse haves
    elif rtype=='have':
        splithaves = obj.split('(')[1].strip()
        #print(splithaves)
        #return splithaves
        haves= splithaves.split(',',1)[0].split(' ',1)[0]
        #print(haves)
        return haves
    
    # parse wants
    elif rtype=='want':
        splitwants = obj.split('(')[1].split(',',1)[1].strip()
        #print(splitwants)
        wants = splitwants.split(' ',1)[0]
        #print(wants)
        return wants
        
    else:
        return None

### Function 4: Sleeve condition

In [4]:
def get_album_sleeve(soup, field_name):

    # Condition is a way to get to sleeve
    _sl  = soup.find(text=re.compile(field_name))
    _sl2 =_sl.findNext().find_all('p')[1].text
    sleeve=_sl2.split(' ',0)[0].split(':',1)[1].strip()
    #print(sleeve)
    return sleeve


#### Diagnostic works only if soup is defined

In [None]:
# string of have/wants, etc. they need to be parsed
votestr = get_album_votes(_soup,'votes','want')
print(votestr)

#### Create helper functions to parse strings into appropriate data types_
- The returned values all need a bit of formatting before we can work with this data.  Here are a few helper functions.

In [5]:
import dateutil.parser

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(float(moneystring))

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

## RANDOM TIMER (if necessary)

In [49]:
import random
import time

# append the page number


In [None]:
# 

### II. STARTING POINT
- enter url of page with list depending on filters

In [60]:
# Jazz bop only
url='https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=100&year1=1900&year2=1970&format=Vinyl&price=over40&genre=Jazz&currency=USD&style=Hard+Bop&page=1'
response=requests.get(url)
response.status_code

200

In [None]:
page = response.text

In [None]:
soup = BeautifulSoup(page, "lxml")

In [None]:
print(soup.prettify())

#### obtain all rows of data, eventually parsing out links per album

In [None]:
rows = [row for row in soup.find(class_='table_block mpitems push_down table_responsive')
        .find_all('tr',class_='shortcut_navigable')]  # tr tag is for rows

In [None]:
# We are only looking for the LINK. and inside the link we'll pull data
rows

### Full code to scrape each page and link
- Extract list of links for each album/row of data 
- gives specific links to album pages

In [66]:
# Jazz bop only
# This is where you can also set your timers

from fake_useragent import UserAgent

n=2
#def discogsall():
#page_list = ['2']
page_list = [str(i) for i in str(n)]

r = "https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=100&year1=1900&year2=1970&format=Vinyl&price=over40&genre=Jazz&currency=USD&style=Hard+Bop&page="

# initialize final album list
coveralbinfo = [] # albumlist
inneralbinfo = [] # alb_page_info_list
    
for page in range(1,3): #page_list:
        
    url= r + str(page) + "#more%3Dyear" 
    
    ua = UserAgent()
    user_agent = {'User-agent': ua.random}
    #response  = requests.get(url, headers = user_agent)    

    response=requests.get(url,headers=user_agent)
    spage = response.text
    soup = BeautifulSoup(spage, "lxml")

    #time.sleep(.5+2*random.random())

    rows = [row for row in soup.find(class_='table_block mpitems push_down table_responsive')
            .find_all('tr',class_='shortcut_navigable')]  # tr tag is for rows
    
    '''
    ----------------------------------------------------------
    2. Call function for grabbing data from each row
    ----------------------------------------------------------    
    '''
    albumlist = {}

    for row in rows[0:]:

        '''
        items are the number of items within a td block (columns).
        there are 6 items in a block
        '''
        items = row.find_all('td')
        #print(len(items))
        #print(items)

        #link = row.find('a')['href']    
        # just take the first item in the td block whihch is link
        link = items[1].find('a')['href']
        #print(link)

        # for title, take the full artist+title for uniqueness. otherwise, dictionary will get unique artist only
        _link = items[1].find('a')

        #arttitle= _link.text #.split('-')[0].strip()

        artist= _link.text.split('-')[0].strip()

        #title1 = title_string.split('-')[1].strip()

        title1 = _link.text.split('-')[1].strip()
        title  = title1.split('(')[0].strip()

        arttitle=artist+" - "+title
        #print("art+title",arttitle,"art",artist)

        # sleeve
        if items[1].find('span',class_="item_sleeve_condition"):
            sleeve=items[1].find('span',class_="item_sleeve_condition").text.strip()
            #print(sleeve)
        else:
            sleeve=''
        
        albumlist[arttitle] = [link] + [artist] + [title] + [sleeve]
        #print(albumlist)
        #albumlist[title] = [link] + [i.text for i in items]
    
        ## END FOR LOOP (COVER)
    
    #print(albumlist)
    alb = pd.DataFrame(albumlist).T
    alb.columns = ['link_stub','artist','title','sleeve']
    coveralbinfo.append(albumlist)
    '''
    ----------------------------------------------------------
    3. Call function for grabbing data from each album
    ----------------------------------------------------------
    '''

    # LINK_STUB is the added parts to the URL
    alb_page_info_list = []

    for link in alb.link_stub:
        alb_page_info_list.append(get_album_dict(link))

        # This is your end product of info per page. you want this to be unique and set aside so it's appended to the next pages    
        #print(alb_page_info_list)
        
    #final_album_list.append(alb_page_info_list)
    
    
inneralbinfo.append(alb_page_info_list)
#coveralbinfo.append(albumlist)

print("Page Number:",page)

time.sleep(2+2*random.random())

        # convert dictionary to DF
    #     alb_page_info[page] = pd.DataFrame(alb_page_info_list)  #convert list of dict to df
    #     alb_page_info[page].set_index('arttitle', inplace=True)
    #     alb_page_info[page]
    
    

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4041

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4041

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4169

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4169

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/639322-Jazz-Line-2">Jazz Line (2)</a> ‎– JAZ-33-01

                </div>
Label:
<div class="content">
<a href="/label/639322-Jazz-Line-2">Jazz Line (2)</a> ‎– JAZ-33-01

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
          

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4163

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4163

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, Album, Mono, LP
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, Album

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1544

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1544

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4017

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4017

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono, <i>Microgroove</i>
<br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                             

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4023

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4023

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4048

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4048

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4032

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4032

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4056

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4056

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/34094-Riverside-Records">Riverside Records</a> ‎– RLP 12-280

                </div>
Label:
<div class="content">
<a href="/label/34094-Riverside-Records">Riverside Records</a> ‎– RLP 12-280

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                             

Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7038

                </div>
Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7038

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP,

Label:
<div class="content">
<a href="/label/76075-New-Jazz">New Jazz</a> ‎– NJLP-8236

                </div>
Label:
<div class="content">
<a href="/label/76075-New-Jazz">New Jazz</a> ‎– NJLP-8236

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP,

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1521

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1521

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Compilation, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl,

Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7047

                </div>
Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7047

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP,

Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– prLP 186

                </div>
Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– prLP 186

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, 10", Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, 

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– 1554, 
            <a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1554

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– 1554, 
            <a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1554

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                      

Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7157, 
            <a href="/label/19591-Prestige">Prestige</a> ‎– 7157

                </div>
Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7157, 
            <a href="/label/19591-Prestige">Prestige</a> ‎– 7157

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                             

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1539

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1539

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Repress, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vin

Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– 7084, 
            <a href="/label/19591-Prestige">Prestige</a> ‎– LP 7084, 
            <a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7084

                </div>
Label:
<div class="content">
<a href="/label/19591-Prestige">Prestige</a> ‎– 7084, 
            <a href="/label/19591-Prestige">Prestige</a> ‎– LP 7084, 
            <a href="/label/19591-Prestige">Prestige</a> ‎– PRLP 7084

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                 

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1581

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 1581

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4147

                </div>
Label:
<div class="content">
<a href="/label/281-Blue-Note">Blue Note</a> ‎– BLP 4147

                </div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, Album, Mono
            <br/>
</div>
Format:
<div class="content">
                                                
                                                                                                                                                                                                                                                                            
            Vinyl, LP, A

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


In [67]:
coveralbinfo

[{'Tina Brooks - True Blue': ['/sell/item/1244267034',
   'Tina Brooks',
   'True Blue',
   'Near Mint (NM or M-)'],
  'Cliff Jordan* & John Gilmore - Blowing In From Chicago': ['/sell/item/1146903229',
   'Cliff Jordan* & John Gilmore',
   'Blowing In From Chicago',
   'Very Good (VG)'],
  'Freddie Hubbard - Hub': ['/sell/item/1105531603',
   'Freddie Hubbard',
   'Hub',
   'Good (G)'],
  'Curtis Fuller - The Opener': ['/sell/item/1134421976',
   'Curtis Fuller',
   'The Opener',
   'Very Good (VG)'],
  'Lee Morgan - Search For The New Land': ['/sell/item/1069322934',
   'Lee Morgan',
   'Search For The New Land',
   'Very Good (VG)'],
  "Cannonball Adderley - Somethin' Else": ['/sell/item/1281156432',
   'Cannonball Adderley',
   "Somethin' Else",
   'Very Good (VG)'],
  'Hank Mobley With Donald Byrd And Lee Morgan - Hank Mobley With Donald Byrd And Lee Morgan': ['/sell/item/786354043',
   'Hank Mobley With Donald Byrd And Lee Morgan',
   'Hank Mobley With Donald Byrd And Lee Morgan'

In [68]:
#inneralbinfo[0][:1:-1]
inneralbinfo

[[{'arttitle': 'The John Coltrane Quartet - Africa / Brass',
   'rlabel': 'Impulse!',
   'rformat': 'Vinyl, LP, Album, Mono, Gatefold',
   'country': 'US',
   'release_dt': datetime.datetime(1961, 9, 15, 0, 0),
   'genre': 'Jazz',
   'style': 'Free Jazz, Hard Bop, Modal',
   'rating': 4.64,
   'media': 'Very Good Plus (VG+)',
   'price': '$500.00',
   'votes': 42,
   'haves': 428,
   'wants': 361},
  {'arttitle': 'Thad Jones - Frank Wess',
   'rlabel': 'Prestige',
   'rformat': 'Vinyl, LP, Album, Mono',
   'country': 'US',
   'release_dt': datetime.datetime(1957, 1, 15, 0, 0),
   'genre': 'Jazz',
   'style': 'Hard Bop',
   'rating': 4.5,
   'media': 'Very Good Plus (VG+)',
   'price': '$587.99',
   'votes': 14,
   'haves': 58,
   'wants': 236},
  {'arttitle': 'Kenny Drew Quintet / Quartet* - This Is New',
   'rlabel': 'Riverside Records',
   'rformat': 'Vinyl, LP, Album, Mono',
   'country': 'US',
   'release_dt': datetime.datetime(1957, 1, 15, 0, 0),
   'genre': 'Jazz',
   'style': 'H

### Convert both of these dictionaries to DF and MERGE

In [None]:
print(len(inneralbinfo[0]))

### Function: Pull data from each row in album list

In [50]:
def pullalb():
    albumlist = {}

    for row in rows[0:]:

        '''
        items are the number of items within a td block (columns).
        there are 6 items in a block
        '''
        items = row.find_all('td')
        #print(len(items))
        #print(items)

        #link = row.find('a')['href']    
        # just take the first item in the td block whihch is link
        link = items[1].find('a')['href']
        #print(link)

        # for title, take the full artist+title for uniqueness. otherwise, dictionary will get unique artist only
        _link = items[1].find('a')

        #arttitle= _link.text #.split('-')[0].strip()

        artist= _link.text.split('-')[0].strip()

        #title1 = title_string.split('-')[1].strip()

        title1 = _link.text.split('-')[1].strip()
        title  = title1.split('(')[0].strip()

        arttitle=artist+" - "+title
        #print("art+title",arttitle,"art",artist)

        # sleeve
        if items[1].find('span',class_="item_sleeve_condition"):
            sleeve=items[1].find('span',class_="item_sleeve_condition").text.strip()
            #print(sleeve)
        else:
            sleeve=''
        
        albumlist[arttitle] = [link] + [artist] + [title] + [sleeve]
        #print(albumlist)
        #albumlist[title] = [link] + [i.text for i in items]

    albumlist

    

### Scraping Multiple Pages

Now that we have the links for album lists we can visit each link to extract even more information about each record. 

### Create DF of dictionary above and Transpose

In [None]:
# you put into the function so remove
# alb = pd.DataFrame(albumlist).T
# alb.columns
# alb.columns = ['link_stub','artist','title','sleeve']
# alb

### Function: Use full list of records to scrape individual record info

In [61]:
def get_album_dict(link):
    '''
    From discogs link stub, request html, parse with BeautifulSoup, and
    collect 
        - artist + title        
        - label
        - format
        - country
        - released
        - genre
        - style
        - haves
        - wants
        - avg rating
        - rates        
        - price
        - media
        - sleeve
        
    Return information as a dictionary.
    '''
    
    base_url = 'https://www.discogs.com'
    
    
    #Create full url to scrape
    url = base_url + link
    #print(url)
    
    ua = UserAgent()
    user_agent = {'User-agent': ua.random}
    #response  = requests.get(url, headers = user_agent)    
    
    #Request HTML and parse
    response = requests.get(url,headers=user_agent)
    page = response.text
    soup = BeautifulSoup(page,"lxml")

    
    headers = ['arttitle',
               'rlabel',
               'rformat',
               'country',
               'release_dt',
               'genre',
               'style',
               'rating',
               'media',
               #'sleeve', 
               'price',
               'votes',
               'haves',
               'wants'
              ]
    
    #Get title
    title_string = soup.find('title').text    
    artist = title_string.split('-')[0].strip()    
    title1 = title_string.split('-')[1].strip()
    title  = title1.split(':')[0].strip()
    
    arttitle = artist+" - "+title
#     arttitle= _link.text #.split('-')[0].strip()
#     artist= _link.text.split('-')[0].strip()
#     title = _link.text.split('-')[1].strip()
    

    #Get label (unicode split - need to figure out how to split)
    #rlabel = get_album_txt(soup,'Label:').split(r'-')[0].strip()
    
    if get_album_txt(soup,'Label:'):
        rlabel = get_album_txt(soup,'Label:').split('\u200e')[0].strip()
    else:
        rlabel=''
    
    #Get format
    if get_album_txt(soup,'Format:'):
        rformat = get_album_txt(soup,'Format:').strip()
    else:
        rformat=''
    #print(rformat)
    
    #Get country    
    if get_album_txt(soup,'Country:'):        
        country = get_album_txt(soup,'Country:').strip()
    else:
        country=''
    #print(country)
    
    #Get release date
    if get_album_txt(soup,'Released:'):
        release_dt = get_album_txt(soup,'Released:').strip()
        release_dt = to_date(release_dt)
    else:
        release_dt =''
    #print(release_dt)
    
    #Get genre
    if get_album_txt(soup,'Genre:'):
        genre = get_album_txt(soup,'Genre:').strip()
    else:
        genre=''
    #print(genre)
    
    #Get style
    if get_album_txt(soup,'Style:'):
        style = get_album_txt(soup,'Style:').strip()
    else:
        style=''
    #print(style)
    
    #Get rating    
    _rating = get_album_txt(soup,'Rated')
    if _rating:
        rating= float(_rating)
    else:
        rating=float('NaN')
    
    #Media
    if get_album_txt(soup,'Media:'):
        media = get_album_txt(soup,'Media:').strip()
    else:
        media=''
    #print(media)
    
    #Sleeve (use 'condition' for sleeve condition)
#     sleeve = get_album_sleeve(soup,'Condition').strip()
#     print(sleeve)
    
    #Price (remove $)
    #if get_album_price(soup,'price'):
    price = get_album_price(soup,'price')
    #price = money_to_int(_price)
    #else:
    #    price = float('Nan')
    #print(price)
    
    # votes 
    if get_album_votes(soup,'votes','vote'):
        _votes= get_album_votes(soup,'votes','vote')    
        votes=int(float(_votes))
    else:
        votes=float('NaN')

    # haves
    if get_album_votes(soup,'votes','have'):
        _haves= get_album_votes(soup,'votes','have')    
        haves=int(float(_haves))
    else:
        haves=float('NaN')
        
    # wants
    if get_album_votes(soup,'votes','want'):
        _wants= get_album_votes(soup,'votes','want')    
        wants=int(float(_wants))
    else:
        wants=float('NaN')
    
    #Create album dictionary and return
    album_dict = dict(zip(headers, [arttitle,
                                    rlabel,
                                    rformat,
                                    country,
                                    release_dt,
                                    genre,
                                    style ,
                                    rating, 
                                    media,
                                    #sleeve,
                                    price,
                                    votes,
                                    haves,
                                    wants
                                   ]))
    return album_dict

    time.sleep(2+2*random.random())

#### Call the function to pull data

In [None]:
# LINK_STUB is the added parts to the URL
alb_page_info_list = []

for link in alb.link_stub:
    alb_page_info_list.append(get_album_dict(link))

## For Loop to pull data

In [None]:
import random
import time


# 1. get list of 100 items into row and create df

# append the page number
page_list = ['1','2']

for page in page_list:
    
    url= "https://www.discogs.com/sell/list?sort=listed%2Casc&currency=USD&limit=25&page=" + str(page) + "#more%3Dyear"
    response=requests.get(url)
    page = response.text
    soup = BeautifulSoup(page, "lxml")
    #print(page)
    
    time.sleep(.5+2*random.random())

### This is your dictionary of x albums for data

In [None]:
alb_page_info_list

### III. bringing together data

In [None]:
alb_page_info = pd.DataFrame(alb_page_info_list)  #convert list of dict to df
alb_page_info.set_index('arttitle', inplace=True)

alb_page_info

(Note: the rating is indeed missing from a few of these pages!  How could you fix that?)

We can now match this back up with the movie information collected from the table by merging these dataframes.

In [None]:
alb_page_info.columns

In [None]:
alb.shape

### Merge DF together

In [None]:
alb_mrg = alb.merge(alb_page_info, left_index=True, right_index=True)

alb_mrg

In [None]:
alb_mrg.to_csv('discog2.csv')

## TESTING

### List page
- this page gives a list of 250 records based on teh filter criteria used
- this is only used for testing on code below

In [None]:
#url='https://www.discogs.com/La-Monte-Young-Drift-Study-43740-50950-PM-5-VIII-68-NYC/release/1512276'
url='https://www.discogs.com/sell/list?sort=price%2Cdesc&limit=250&year1=1900&year2=1970&format=Vinyl&price=over40&genre=Jazz&currency=USD'
response=requests.get(url)
response.status_code

In [None]:
page = response.text

In [None]:
soup = BeautifulSoup(page, "lxml")

In [None]:
print(soup.prettify()) ## Makes the code nice

#### .find gives you FIRST instance of tag

In [None]:
soup.find('a')

#### .find_all gives ALL instances. you can select instance

In [None]:
soup.find_all('a')[-1]

#### Extracting things like HREF from an element in HTML

In [None]:
testing = soup.find_all('a')[-1]

In [None]:
testing.get("href")

### Test on individual record
- this gives an idea of each record that is clicked and the info in it

In [None]:
url='https://www.discogs.com/sell/item/1105531603'
response=requests.get(url)
response.status_code

In [None]:
_page = response.text

In [None]:
_soup = BeautifulSoup(_page, "lxml")

In [None]:
#sleeve=_soup.find(text=re.compile('Condition'))
#next_element = sleeve.findNext()

#sleeve = get_album_txt(_soup,'Sleeve:').strip()

#x=next_element.find_all('p')[1].text

#y=x.findNext()
#print(sleeve)
#print(next_element,x)
#print(x,y)
#print(x)

#y=x.split(' ',0)[0].split(':',1)[1].strip()


#print(y)
#_soup.find(class_='section_content').find_all('strong').text

# haves= splithaves.split(',',1)[0].split(' ',1)[0]
# print(haves)

_sleeve=_soup.find(text=re.compile('Condition'))
#_sleeve=_soup.find(class_='section_content')
#print(_sleeve)

#sleeve2=_sleeve.find_all(text=re.compile('Media:'))
#print('sleeve2',sleeve2)

_sleeve2=_sleeve.findNext(text=re.compile('Sleeve'))
print(_sleeve2)

x=_sleeve2.find_all('a')
print(x)

#_sleeve2=_sleeve.findNext().find('a')[1].text
#print(_sleeve2)

#_sleeve3=_sleeve2.find_all('p')[1].text

sleeve=_sleeve2.split(' ',0)[0].split(':',1)[1].strip()
#print(sleeve)


#### Based on function for album price

In [None]:
# string of have/wants, etc. they need to be parsed
price = get_album_price(soup,'price')
print(price)

In [None]:
#Get domestic gross
def get_album_txt2(soup, field_name):
    r = (soup.find_all(class_=field_name)[0]).text
    return r
get_album_txt2(soup,'price')   

In [None]:
# string of have/wants, etc. they need to be parsed
# potentially separate way to get ratings 
rateetc = get_album_price(soup,'rating_value_sm')
print(rateetc)

In [None]:
rs = soup.find(text=re.compile('Rated'))
next_element = rs
next_element2 = next_element.findNext()    
print(next_element2)

#### Votes gives us the info we need for votes and haves/wants

In [None]:
print(soup.find(text=re.compile('votes')))

In [None]:
# country
country = get_album_txt(soup,'Country:').strip()
print(country)

In [None]:
# media condition
media = get_album_txt(soup,'Media:').strip()
print(media)

In [None]:
sleeve = get_album_txt(soup,'Sleeve:').strip()
print(sleeve)

In [None]:
# media condition
rlabel = get_album_txt(soup,'Label:').strip()
print(rlabel)

In [None]:
# votes, haves, wants
votes = get_album_txt(soup,'votes').strip()
print(votes)

### Testing : get votes, haves wants
- this one works but refined futher below

In [None]:
# def get_album_votes(soup, field_name):
    
#     '''Grab a value from discogs archive HTML
#        Takes info in votes containing #votes, haves, wants
#     '''    
#     obj = soup.find(text=re.compile(field_name))
#     return obj
#     if not obj: 
#         return None
    

In [None]:
# string of have/wants, etc. they need to be parsed
votestr = get_album_votes(_soup,'votes')
print(votestr)

#### Testing of parsing individually

In [None]:
splitvotes = votestr.split('(')[0].strip()
print(splitvotes)

# parse votes
votes = splitvotes.split(' ',1)[1].split(' ',1)[0]
print(votes)

# parse haves
splithaves = votestr.split('(')[1].strip()
print(splithaves)

haves= splithaves.split(',',1)[0].split(' ',1)[0]
print(haves)

# parse wants
splitwants = votestr.split('(')[1].split(',',1)[1].strip()
wants = splitwants.split(' ',1)[0]
print(wants)

#### Testing to get artist, title

In [None]:
#Get title
title_string = soup.find('title').text
artist = title_string.split('-')[0].strip()
title1 = title_string.split('-')[1].strip()
title2 = title1.split(':')[0].strip()
print(artist,title2)

### LECTURES:
- linear regression
- cross validation<br>
    . 80% training<br>
    . 20% testing<br>
- categorical (one hot encoding)
- continuous - standardize (z scores?)


#### testing how this function works

looping through append put replace 
tutorilal - section 
webmojo - base url - indiv 
made list of extra part of link appended

tutorial scrape
. 

for loop for indiidual pages
list of ind link stub
. for loop -f


