<h1>Using the Last Fm API to extract data and analyzing it with Python. (Part 1)</h1>

As you already know, Last Fm is a service (kind of a social network) to store your music listening activity, that is called *scrobbling*, and I have been using that service since late 2008. The goal of this project is to create new visualization, Last Fm itself shows some nice charts and numbers but I wanted a little more.

In this first part I'm going to show you how to extract the data and clean it using Python 🐍,  but please keep in mind that **I'm new in the data analytics/science world and English is not my native language. I'm still learning. If you could give  me your feedback about these two topics, I would really appreciate it.** 🤗

<h2> 🤓 Understating the API.</h2>

Last Fm offer a very simple API. It doesn't need authentication and the payload is very simple. But they ask you some things: 

* 👾 To use an identifiable User-Agent header on all requests.
* 👾 To be reasonable, don't make an excessive number of calls. 
* 👾 And of course, the assumption that if you use the API you are accepting its terms of service.

Link: https://www.last.fm/api/intro

You can create an API KEY https://www.last.fm/api/account/create or find the API at http://ws.audioscrobbler.com/2.0/

The API have some methods like album, artist, users, playlist. Each one has some customizable options. In this post I'm going to use two: recent tracks and loved tracks.

<h2> ✍🏻 Building the script</h2>

I'm going to do a lot of request with the same user, key and user-agent, so the best way to do it is to create a function.

In [6]:
import requests
API_KEY = '62db2e8a92143737f526939ea0b7471c'
USER_AGENT = 'Mozilla/5.0'
USERNAME = 'chmedina'
def lastfm_get(payload):
    headers = {'user-agent': USER_AGENT}
    url = 'https://ws.audioscrobbler.com/2.0/'
    payload['user'] = USERNAME
    payload['api_key'] = API_KEY
    payload['format'] = 'json'

    response = requests.get(url, headers=headers, params=payload)
    return response

If the function is OK, the response should be a 200 code:

HTTP Status codes: https://developer.mozilla.org/es/docs/Web/HTTP/Status

In [7]:
r = lastfm_get({
    'method': 'user.getrecenttracks'
})
r.status_code

200

The response is something like this:

In [8]:
import json
def jprint(obj):
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)
jprint(r.json()['recenttracks']['@attr'])

{
    "page": "1",
    "perPage": "50",
    "total": "239925",
    "totalPages": "4799",
    "user": "chmedina"
}


<h2> 📖 Paginated Data</h2>

Depending on how many songs you have or need to download, the response could be many pages long. As you can see, my user has 4799 where 50 results are shown per page.

In order to work with this, you must configure an script for paging:

In [9]:
results = []
page = 1
total_pages = 10
while page > total_pages:
    r = request.get("endpoint_url", params={"page": page})
    results.append(r.json())
    page += 1

In the example total_pages = 10 because I wanted to do a short request. That number is going to change.

And also, import the 'time' module to make the requests with a gap between each one and don't get your IP banned. 💩

In [10]:
import time

print("one")
time.sleep(1)
print("two")

one
two


One more  thing, this is going to be a large amount of data and if your code break you will lose all. 💩 Please install the request cache module.

In [11]:
pip install requests-cache

Note: you may need to restart the kernel to use updated packages.


<h2> 📍 The final script is this:</h2>

This will take time depending on the amount of data and the time.sleep settings

In [12]:
import time
from IPython.core.display import clear_output
import requests_cache
requests_cache.install_cache()
responses = []
page = 1
total_pages = 9999 
while page <= total_pages:
    payload = {
        'method': 'user.getrecenttracks',
        'limit': 500,
        'page': page,
    }
    print("Requesting page {}/{}".format(page, total_pages))
    clear_output(wait = True)
    response = lastfm_get(payload)

    if response.status_code != 200:
        print(response.text)
        break

    page = int(response.json()['recenttracks']['@attr']['page'])
    total_pages = int(response.json()['recenttracks']['@attr']['totalPages'])

    responses.append(response)

    if not getattr(response, 'from_cache', False):
        time.sleep(0.5)

    page += 1

Requesting page 480/480


<h2> 🐼 Using Pandas to clean the data</h2>

In [13]:
import pandas as pd

r0 = responses[1]
r0_json = r0.json()
r0_track = r0_json['recenttracks']['track']
r0_df = pd.DataFrame(r0_track)
r0_df.head()

Unnamed: 0,artist,album,image,streamable,date,url,name,mbid
0,{'mbid': 'c147b96e-3428-4604-9b6c-d931f980f684...,{'mbid': '188d704b-678d-499d-b488-516a7247cc01...,"[{'size': 'small', '#text': 'https://lastfm.fr...",0,"{'uts': '1628444107', '#text': '08 Aug 2021, 1...",https://www.last.fm/music/Oceansize/_/Music+Fo...,Music For A Nurse,713c0038-737f-3d76-a488-3c11cadac8a1
1,{'mbid': '1136bc83-44bf-4fd5-b036-9067a30b3a75...,"{'mbid': '', '#text': 'Smile of Tears'}","[{'size': 'small', '#text': 'https://lastfm.fr...",0,"{'uts': '1628443611', '#text': '08 Aug 2021, 1...",https://www.last.fm/music/Aisles/_/Smile+of+Tears,Smile of Tears,9d52ae3e-65f8-49f0-b075-6d41e507e932
2,{'mbid': '1136bc83-44bf-4fd5-b036-9067a30b3a75...,{'mbid': 'd62f8969-3f38-4df0-8556-a14cfe87c1e6...,"[{'size': 'small', '#text': 'https://lastfm.fr...",0,"{'uts': '1628443406', '#text': '08 Aug 2021, 1...","https://www.last.fm/music/Aisles/_/The+Poet,+P...","The Poet, Pt. I: Dusk",
3,{'mbid': '1932f5b6-0b7b-4050-b1df-833ca89e5f44...,{'mbid': '34d13866-e51b-410f-8c69-2a884f8031da...,"[{'size': 'small', '#text': 'https://lastfm.fr...",0,"{'uts': '1628442796', '#text': '08 Aug 2021, 1...",https://www.last.fm/music/Marillion/_/Power,Power,0aedae8c-9350-46c5-8bb2-f1d422913b81
4,"{'mbid': '', '#text': 'Seedpicker'}","{'mbid': '', '#text': 'Virginia Rhapsody'}","[{'size': 'small', '#text': 'https://lastfm.fr...",0,"{'uts': '1628442429', '#text': '08 Aug 2021, 1...",https://www.last.fm/music/Seedpicker/_/Power+Down,Power Down,


The .json file has some columns that we don't need. And the columns album, artist and date have the 'mbid' field and we don't need it either.

There are two options: 

* 🧶 Use pandas to select the #text field.
* 🧶 Format that as string and replace it. 

I choose the second option. I did this in parts for a better understanding:

Drop the columns we don't need:

In [14]:
r0_df= r0_df.drop('image', axis=1)
r0_df= r0_df.drop('streamable', axis=1)
r0_df= r0_df.drop('url', axis=1)
r0_df= r0_df.drop('mbid', axis=1)
r0_df.head()

Unnamed: 0,artist,album,date,name
0,{'mbid': 'c147b96e-3428-4604-9b6c-d931f980f684...,{'mbid': '188d704b-678d-499d-b488-516a7247cc01...,"{'uts': '1628444107', '#text': '08 Aug 2021, 1...",Music For A Nurse
1,{'mbid': '1136bc83-44bf-4fd5-b036-9067a30b3a75...,"{'mbid': '', '#text': 'Smile of Tears'}","{'uts': '1628443611', '#text': '08 Aug 2021, 1...",Smile of Tears
2,{'mbid': '1136bc83-44bf-4fd5-b036-9067a30b3a75...,{'mbid': 'd62f8969-3f38-4df0-8556-a14cfe87c1e6...,"{'uts': '1628443406', '#text': '08 Aug 2021, 1...","The Poet, Pt. I: Dusk"
3,{'mbid': '1932f5b6-0b7b-4050-b1df-833ca89e5f44...,{'mbid': '34d13866-e51b-410f-8c69-2a884f8031da...,"{'uts': '1628442796', '#text': '08 Aug 2021, 1...",Power
4,"{'mbid': '', '#text': 'Seedpicker'}","{'mbid': '', '#text': 'Virginia Rhapsody'}","{'uts': '1628442429', '#text': '08 Aug 2021, 1...",Power Down


Set the the columns album, artist and date as a string:

In [15]:
r0_df['artist'] = r0_df.artist.astype(str)
r0_df['album'] = r0_df.album.astype(str)
r0_df['date'] = r0_df.date.astype(str)
r0_df.head()

Unnamed: 0,artist,album,date,name
0,{'mbid': 'c147b96e-3428-4604-9b6c-d931f980f684...,{'mbid': '188d704b-678d-499d-b488-516a7247cc01...,"{'uts': '1628444107', '#text': '08 Aug 2021, 1...",Music For A Nurse
1,{'mbid': '1136bc83-44bf-4fd5-b036-9067a30b3a75...,"{'mbid': '', '#text': 'Smile of Tears'}","{'uts': '1628443611', '#text': '08 Aug 2021, 1...",Smile of Tears
2,{'mbid': '1136bc83-44bf-4fd5-b036-9067a30b3a75...,{'mbid': 'd62f8969-3f38-4df0-8556-a14cfe87c1e6...,"{'uts': '1628443406', '#text': '08 Aug 2021, 1...","The Poet, Pt. I: Dusk"
3,{'mbid': '1932f5b6-0b7b-4050-b1df-833ca89e5f44...,{'mbid': '34d13866-e51b-410f-8c69-2a884f8031da...,"{'uts': '1628442796', '#text': '08 Aug 2021, 1...",Power
4,"{'mbid': '', '#text': 'Seedpicker'}","{'mbid': '', '#text': 'Virginia Rhapsody'}","{'uts': '1628442429', '#text': '08 Aug 2021, 1...",Power Down


Split the strings:

In [16]:
r0_df[['mbid2','new_artist']]=r0_df['artist'].str.split("#text':", n = 1, expand = True)
r0_df= r0_df.drop('mbid2', axis=1)
r0_df= r0_df.drop('artist', axis=1)
r0_df[['mbid3','new_album']]=r0_df['album'].str.split("#text':", n = 1, expand = True)
r0_df= r0_df.drop('mbid3', axis=1)
r0_df= r0_df.drop('album', axis=1)
r0_df[['uts','new_date']]=r0_df['date'].str.split("#text':", n = 1, expand = True)
r0_df= r0_df.drop('date', axis=1)
r0_df.head()

Unnamed: 0,name,new_artist,new_album,uts,new_date
0,Music For A Nurse,'Oceansize'},'Everyone Into Position'},"{'uts': '1628444107', '","'08 Aug 2021, 17:35'}"
1,Smile of Tears,'Aisles'},'Smile of Tears'},"{'uts': '1628443611', '","'08 Aug 2021, 17:26'}"
2,"The Poet, Pt. I: Dusk",'Aisles'},'Hawaii'},"{'uts': '1628443406', '","'08 Aug 2021, 17:23'}"
3,Power,'Marillion'},"""Sounds That Can't Be Made""}","{'uts': '1628442796', '","'08 Aug 2021, 17:13'}"
4,Power Down,'Seedpicker'},'Virginia Rhapsody'},"{'uts': '1628442429', '","'08 Aug 2021, 17:07'}"


The final steps are to remove the '}' characters and save it in a .csv file if you like. For my project I'm going to use four users, each one with  the scrobbled and loved tracks. At the end I will have eight csv files. 

In the part 2 I'm going to use Python for data visualization. 🐾

To write this post I read the Dataquest blog https://www.dataquest.io/. That's where I learned the process. If you have questions with any procedure shown here, please go to their blog post https://www.dataquest.io/blog/last-fm-api-python/ where Celeste Grupman explain it in more detail.

Thanks for reading, and remember I'm new in the data analytics/science world and English is not my native language. I'm still learning :student:. If you could give me your feedback about these two topics, I would really appreciate it. 🙇‍♂️

Carlos Medina 🎩