Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

back off spotify on errors ; comparison fixes suggested by @iangilman

  • Loading branch information...
commit 752e52c0a587971a36d15fdc449e93b9ae69ed8c 1 parent efc663a
@edsu authored
Showing with 4,482 additions and 29,505 deletions.
  1. +1 −0  .gitignore
  2. +0 −32 README.rst
  3. +1,309 −4,091 aoty_cmp.csv
  4. +3,108 −25,368 aoty_cmp.json
  5. +63 −13 aoty_cmp.py
  6. +1 −1  aoty_csv.py
View
1  .gitignore
@@ -1,3 +1,4 @@
config.py
config.pyc
*.log
+*.pyc
View
32 README.rst
@@ -1,32 +0,0 @@
-aotycmp
-=======
-
-aotycmp is a hack to see what albums on Alf Eaton's `Albums of the Year (AOTY) <http://aoty.hubmed.org>`_ list of lists can be streamed from `Spotify <http://spotify.com>`_ and `Rdio <http://rdio.com>`_.
-
-There are a few JSON data files in this repository:
-
-* aoty.json - the full dump af AOTY scraped data
-* aoty_dedupe.json - the albums aggregated together
-* aoty_cmp.json - the results of looking up albums at rdio and spotify
-* aoty_cmp.csv - results suitable for import to a spreadsheet
-
-The steps for reproducing the results stored in aoty_cmp.json are to:
-
-#. pip install -r requirements.pip
-#. cp config.py.orig config.py
-#. get a Rdio API Key and put credentials in config.py
-#. ./aoty.py # crawls aoty.hubmed.org and stores data in aoty.json
-#. ./aoty_dedupe.py # dedupes albums across lists and stores in aoty_dedupe.json
-#. ./aoty_cmp.py # reads aoty_dedupe.json and stores results of rdio/spotify lookups in aoty_cmp.json
-#. ./aoty_csv.py # dump aoty_cmp.json as csv for spreadsheet
-
-Maybe I should've dumped the crawled data into CouchDB instead of chaining
-JSON dumps together like this. Could be more fun right? It would make it
-easier to not repeat spotify and rdio API lookups.
-
-If you have your own list of albums, and you want to see if they are available
-on spotify and rdio, you should be able to format your list like
-aoty_dedupe.json and point aoty_cmp.py at it.
-
-Alf says, it might be easier to scrape the content using this URL in the future:
-http://apps.hubmed.org/aoty/?_start=0&_limit=1000&_format=xml
View
5,400 aoty_cmp.csv
1,309 additions, 4,091 deletions not shown
View
28,476 aoty_cmp.json
3,108 additions, 25,368 deletions not shown
View
76 aoty_cmp.py
@@ -1,5 +1,7 @@
#!/usr/bin/env python
+import re
+import sys
import json
import time
import logging
@@ -10,7 +12,7 @@
import config
-def main():
+def main(console=False):
logging.basicConfig(filename="aoty_cmp.log", level=logging.INFO)
aoty = json.loads(open("aoty_dedupe.json").read())
for a in aoty:
@@ -19,9 +21,12 @@ def main():
album = a['album']
a['spotify'] = spotify(artist, album)
a['rdio'] = rdio(artist, album)
+ if console:
+ progress(a)
logging.info(a)
except Exception, e:
logging.exception(e)
+ sys.exit()
time.sleep(1)
open("aoty_cmp.json", "w").write(json.dumps(aoty, indent=2))
@@ -29,23 +34,44 @@ def spotify(artist, album):
q = '%s AND "%s"' % (artist, album)
q = quote(q.encode('utf-8'))
url = 'http://ws.spotify.com/search/1/album.json?q=' + q
- j = urlopen(url).read()
- response = json.loads(j)
+
+ # spotify search api throws sporadic 502 errors
+ tries = 0
+ max_tries = 10
+ response = None
+ while True:
+ tries += 1
+ r = urlopen(url)
+
+ if r.code == 200:
+ j = urlopen(url).read()
+ response = json.loads(r.read())
+ break
+
+ if tries > max_tries:
+ break
+
+ backoff = tries ** 2
+ logging.warn("received %s when fetching %s, sleeping %s", r.code, url, backoff)
+ time.sleep(backoff)
+
+ if not response:
+ raise Exception("couldn't talk to Spotify for %s/%s", artist, album)
can_stream = False
url = None
for a in response['albums']:
- if a['name'] == album and spotify_artist(a, artist):
+ if a['name'].lower() == album.lower() and spotify_artist(a, artist):

Shouldn't you use clean() on the album name here?

@edsu Owner
edsu added a note

Good eye, thnx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
url = a['href']
- if config.COUNTRY in a['availability']['territories'].split(' '):
+ if config.COUNTRY in a['availability']['territories'].split(' ') or a['availability']['territories'] == 'worldwide':
can_stream = True
return {'can_stream': can_stream, 'url': url}
def spotify_artist(a, artist_name):
for artist in a['artists']:
- if artist['name'] == artist_name:
+ if clean(artist['name']) == clean(artist_name):
return True
return False
@@ -53,22 +79,46 @@ def rdio(artist, album):
consumer = oauth.Consumer(config.RDIO_CONSUMER_KEY,
config.RDIO_CONSUMER_SECRET)
client = oauth.Client(consumer)
- q = {'method': 'search',
- 'query': ('%s %s' % (artist, album)).encode('utf-8'),
- 'types': 'Album'}
+ q = {
+ 'method': 'search',
+ 'query': ('%s %s' % (artist, album)).encode('utf-8'),
+ 'types': 'Album',
+ '_region': config.COUNTRY
+ }
j = client.request('http://api.rdio.com/1/', 'POST', urlencode(q))[1]
- response = json.loads(j)
+ try:
+ response = json.loads(j)
+ print json.dumps(response, indent=2)
+ except:
+ logging.error("unable to load json from %s: %s", url, j)
can_stream = False
url = None
for r in response['result']['results']:
- if r['name'] == album and r['artist']:
+ if clean(r['name']) == clean(album) and clean(r['artist']) == clean(artist):
url = "http://rdio.com" + r['url']
if r['canStream'] == True:
can_stream = True
return {'can_stream': can_stream, 'url': url}
-
+
+def progress(a):
+ r = a['rdio']['can_stream']
+ s = a['spotify']['can_stream']
+ if r and s:
+ sys.stderr.write(".")
+ elif r:
+ sys.stderr.write("r")
+ elif s:
+ sys.stderr.write("s")
+ else:
+ sys.stderr.write("x")
+
+def clean(a):
+ a = a.lower()
+ a = re.sub('^the ', '', a)
+ a = re.sub(' \(.+\)$', '', a)
+ return a

I would remove spaces and punctuation as well.

If you want to be hard core about it, you can have a diagnostic mode where it spits out non-matching names, to see if there are any other cases you should cover. When building http://letsfathom.com, I ended up also removing things like "original soundtrack" before comparing. Another one to watch out for is & versus and.

@edsu Owner
edsu added a note

Yes, I am recording non-matches in the resuting JSON file, and in the log file that gets generated as the search is underway. I just never looked terribly close at it, so thank you :+1:

I'll rip out punctuation, which should cover "&" and ".". I did end up removing anything in parentheses to get around problems like http://www.rdio.com/artist/Portishead/album/Third_(Non_EU_Version)_1/ hopefully that isn't too overzealous.

Yeah, in my experience, parentheticals are better removed.

You should explicitly remove the word " and ", since that won't be caught by the punctuation search, and it's often interchangeable with &. The other approach is to convert "&" to "and".

@edsu Owner
edsu added a note

Ok I added " and ".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
if __name__ == "__main__":
- main()
+ main(console=True)
View
2  aoty_csv.py
@@ -5,7 +5,7 @@
import csv
import json
-writer = csv.writer(open('aoty_cmp.csv', 'w'), dialect='excel')
+writer = csv.writer(open('aoty_cmp.csv', 'w'))
writer.writerow(["Artist", "Album", "Listed", "Spotify URL",
"Spotify Streamable", "Rdio URL", "Rdio Streamable"])
@iangilman

Shouldn't you use clean() on the album name here?

@iangilman

I would remove spaces and punctuation as well.

If you want to be hard core about it, you can have a diagnostic mode where it spits out non-matching names, to see if there are any other cases you should cover. When building http://letsfathom.com, I ended up also removing things like "original soundtrack" before comparing. Another one to watch out for is & versus and.

@edsu
Owner

Good eye, thnx!

@edsu
Owner

Yes, I am recording non-matches in the resuting JSON file, and in the log file that gets generated as the search is underway. I just never looked terribly close at it, so thank you :+1:

I'll rip out punctuation, which should cover "&" and ".". I did end up removing anything in parentheses to get around problems like http://www.rdio.com/artist/Portishead/album/Third_(Non_EU_Version)_1/ hopefully that isn't too overzealous.

@iangilman

Yeah, in my experience, parentheticals are better removed.

You should explicitly remove the word " and ", since that won't be caught by the punctuation search, and it's often interchangeable with &. The other approach is to convert "&" to "and".

@edsu
Owner

Ok I added " and ".

Please sign in to comment.
Something went wrong with that request. Please try again.